mongodb - mongodbで重複するドキュメントを削除する最速の方法

Question

私はmongodbに約170万のドキュメントを持っています（将来は10m以上）。それらのいくつかは、私が望まない重複エントリを表しています。ドキュメントの構造は次のようなものです。

{
    _id: 14124412,
    nodes: [
        12345,
        54321
        ],
    name: "Some beauty"
}

同じ名前の別のドキュメントと同じノードが少なくとも1つある場合、ドキュメントは重複しています。重複を削除する最速の方法は何ですか？

score 91 · Accepted Answer

dropDups: trueオプションは 3.0 では使用できません。

重複を収集してから一度に削除するための集約フレームワークを使用したソリューションがあります。

システムレベルの「インデックス」の変更よりも多少遅くなる場合があります。ただし、重複したドキュメントを削除する方法を検討することは良いことです。

a. すべてのドキュメントを一度に削除する

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }},
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})

b. 文書を 1 つずつ削除できます。

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

score 49 · Accepted Answer

重複name+nodesエントリを含むドキュメントをコレクションから完全に削除したい場合は、次のオプションを使用してuniqueインデックスを追加できます。dropDups: true

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

ドキュメントにあるように、データベースからデータが削除されるため、これには細心の注意を払ってください。期待どおりに動作しない場合に備えて、最初にデータベースをバックアップしてください。

アップデート

オプションは 3.0 ( docsdropDups )では使用できなくなったため、このソリューションは MongoDB 2.x でのみ有効です。

score 1 · Accepted Answer

次のメソッドは、重複せずに一意のノードのみを保持しながら、同じ名前のドキュメントをマージします。

$out演算子を使用するのは簡単な方法であることがわかりました。配列を巻き戻し、セットに追加してグループ化します。この$out演算子により、集計結果を永続化できます[docs]。コレクション自体の名前を入力すると、コレクションが新しいデータに置き換えられます。名前が存在しない場合は、新しいコレクションが作成されます。

お役に立てれば。

allowDiskUseパイプラインに追加する必要がある場合があります。

db.collectionName.aggregate([
  {
    $unwind:{path:"$nodes"},
  },
  {
    $group:{
      _id:"$name",
      nodes:{
        $addToSet:"$nodes"
      }
  },
  {
    $project:{
      _id:0,
      name:"$_id.name",
      nodes:1
    }
  },
  {
    $out:"collectionNameWithoutDuplicates"
  }
])

mongodb - mongodbで重複するドキュメントを削除する最速の方法

13 に答える 13

Related

Reference