mongodb - キーフィールドでMongoDBコレクション内のすべての重複ドキュメントを検索

Question

いくつかのドキュメントのセットを含むコレクションがあるとします。このようなもの。

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}

このコレクション内の重複するすべてのエントリを「名前」フィールドで検索したいと思います。たとえば、「foo」は2回表示され、「bar」は3回表示されます。

score 151 · Accepted Answer

_id受け入れられた答えは、大きなコレクションではひどく遅く、重複したレコードのsを返しません。

集約ははるかに高速で、_idsを返すことができます：

db.collection.aggregate([
  { $group: {
    _id: { name: "$name" },   // replace `name` here twice
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
]);

集約パイプラインの最初の段階では、$ group 演算子はフィールドごとにドキュメントを集約し、グループ化されたレコードの各値にname格納します。$ sum演算子は、渡されたフィールドの値（この場合は定数）を合計します。これにより、グループ化されたレコードの数がフィールドにカウントされます。uniqueIds_id1count

パイプラインの第2段階では、$ matchcountを使用して、が2以上、つまり重複しているドキュメントをフィルタリングします。

次に、最も頻繁な重複を最初に並べ替え、結果を上位10に制限します。

このクエリは$limit、重複する名前のレコードとそのレコードを出力します_id。例えば：

{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}

score 17 · Accepted Answer

注：このソリューションは理解するのが最も簡単ですが、最善ではありません。

mapReduceドキュメントに特定のフィールドが含まれている回数を確認するために使用できます。

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

score 5 · Accepted Answer

一般的なMongoソリューションについては、を使用して重複を見つけるためのMongoDBクックブックレシピをgroup参照してください。_id重複レコードのを返すことができるという点で、集約はより高速で強力であることに注意してください。

pymongoの場合、受け入れられた回答（ mapReduceを使用）はそれほど効率的ではありません。代わりに、グループメソッドを使用できます。

$connection = 'mongodb://localhost:27017';
$con        = new Mongo($connection); // mongo db connection

$db         = $con->test; // database 
$collection = $db->prb; // table

$keys       = array("name" => 1); Select name field, group by it

// set intial values
$initial    = array("count" => 0);

// JavaScript function to perform
$reduce     = "function (obj, prev) { prev.count++; }";

$g          = $collection->group($keys, $initial, $reduce);

echo "<pre>";
print_r($g);

出力は次のようになります：

Array
(
    [retval] => Array
        (
            [0] => Array
                (
                    [name] => 
                    [count] => 1
                )

            [1] => Array
                (
                    [name] => MongoDB
                    [count] => 2
                )

        )

    [count] => 3
    [keys] => 2
    [ok] => 1
)

同等のSQLクエリは次のようになりますSELECT name, COUNT(name) FROM prb GROUP BY name。配列からカウントが0の要素を除外する必要があることに注意してください。繰り返しになりますが、を使用した正規のソリューションに使用して重複を見つけるには、MongoDBクックブックのレシピをgroupgroup参照してください。

score 3 · Accepted Answer

集約パイプラインフレームワークを使用すると、キー値が重複しているドキュメントを簡単に識別できます。

// Desired unique index: 
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})

db.collection.aggregate([
  { $group: { 
    _id: { firstField: "$firstField", secondField: "$secondField" }, 
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  }}, 
  { $match: { 
    count: { $gt: 1 } 
  }}
])

〜参照：公式のmongo labブログの有用な情報：

https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework

score 1 · Accepted Answer

ここで最も受け入れられている答えはこれです：

uniqueIds: { $addToSet: "$_id" },

また、IDのリストを含むuniqueIdsという新しいフィールドが返されます。しかし、フィールドとその数だけが必要な場合はどうでしょうか。次に、これになります：

db.collection.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

これを説明するために、MySQLやPostgreSQLなどのSQLデータベースを使用している場合は、GROUP BYステートメントで機能する集計関数（COUNT（）、SUM（）、MIN（）、MAX（）など）に慣れています。たとえば、列の値がテーブルに表示される合計数を検索します。

SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type         |
+----------+-----------------+
|        3 | Contact         |
|        1 | Practice        |
|        1 | Prospect        |
|        1 | Task            |
+----------+-----------------+

ご覧のとおり、出力には、各my_type値が表示されるカウントが表示されます。MongoDBで重複を見つけるには、同様の方法で問題に取り組みます。MongoDBは、複数のドキュメントの値をグループ化する集計操作を誇り、グループ化されたデータに対してさまざまな操作を実行して単一の結果を返すことができます。これは、SQLで関数を集約するのと同様の概念です。

連絡先と呼ばれるコレクションを想定すると、初期設定は次のようになります。

db.contacts.aggregate([ ... ]);

この集計関数は集計演算子の配列を取ります。この場合、目標はフィールドの数、つまりフィールド値の出現回数でデータをグループ化することであるため、$group演算子が必要です。

db.contacts.aggregate([  
    {$group: { 
        _id: {name: "$name"} 
        } 
    }
]);

このアプローチには少し特異性があります。group by演算子を使用するには、_idフィールドが必要です。この場合、$nameフィールドをグループ化しています。_id内のキー名には、任意の名前を付けることができます。ただし、ここでは直感的であるため、名前を使用します。

$ group演算子のみを使用して集計を実行すると、すべての名前フィールドのリストが取得されます（コレクションに1回または複数回表示されているかどうかに関係なく）。

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

上記の集計の仕組みに注意してください。名前フィールドを持つドキュメントを取得し、抽出された名前フィールドの新しいコレクションを返します。

しかし、知りたいのは、フィールド値が何回再表示されるかです。$ group演算子は、$ sum演算子を使用して、グループ内の各ドキュメントの合計に式1を追加するカウントフィールドを取ります。したがって、$groupと$sumを一緒に使用すると、特定のフィールド（名前など）の結果として得られるすべての数値の合計が返されます。

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"},
    count: {$sum: 1}
    } 
  }
]);

{ "_id" : { "name" : "John" },  "count" : 1  }
{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }
{ "_id" : { "name" : "Amanda" },  "count" : 1 }

目標は重複を排除することだったので、1つの追加ステップが必要です。複数のカウントを持つグループのみを取得するには、$match演算子を使用して結果をフィルター処理できます。$ match演算子内で、カウントフィールドを確認し、「より大きい」を表す$gt演算子と数値1を使用して1より大きいカウントを検索するように指示します。

db.contacts.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }

ちなみに、Mongoid for RubyなどのORMを介してMongoDBを使用している場合は、次のエラーが発生する可能性があります。

The 'cursor' option is required, except for aggregate with the explain argument

これは、ORMが古く、MongoDBがサポートしなくなった操作を実行していることを意味している可能性があります。したがって、ORMを更新するか、修正を見つけてください。Mongoidの場合、これは私にとっての修正でした。

module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end

mongodb - キーフィールドでMongoDBコレクション内のすべての重複ドキュメントを検索

5 に答える 5

Related

Reference