python - Pythonを使用してmongoDBの10億のドキュメントからランダムな単一のドキュメントを取得する方法は?

Question

mongoDB コレクションから単一のランダムドキュメントが必要です。現在、私の mongoDB コレクションには 10 億を超えるコレクションが含まれています。そのコレクションから単一のランダムなドキュメントを取得する方法は?

score 21 · Accepted Answer

I never worked with MongoDB from Python, but there is a general solution for your problem. Here is a MongoDB shell script for obtaining single random document:

N = db.collection.count(condition)
db.collection.find(condition).limit(1).skip(Math.floor(Math.random()*N))

condition here is a MongoDB query. If you want to query an entire collection, use query = null.

It's a general solution, so it works with any MongoDB driver.

Update

I ran a benchmark to test several implementations. First, I created test collection with 5567249 documents with indexed random field rnd.

I chose three methods to compare with each other:

First method:

db.collection.find().limit(1).skip(Math.floor(Math.random()*N))

Second method:

db.collection.find({rnd: {$gte: Math.random()}}).sort({rnd:1}).limit(1)

Third method:

db.collection.findOne({rnd: {$gte: Math.random()}})

I ran each method 10 times and got its average computing time:

method 1: 882.1 msec
method 2: 1.2 msec
method 3: 0.6 msec

This benchmark shows that my solution not the fastest one.

But the third solution is not a good one either, because it finds the first element in database (sorted in natural order) with rnd > random(). So, its output not truly random.

I think that second method is the best one for frequent usage. But it has one defect: it requires altering the whole database and ensuring additional index.

score 6 · Accepted Answer

という名前の追加の列randomをコレクションに追加し、その値が 0 から 1 の間になるようにします。を介して、各レコードのこの列に 0 から 1 の間のランダムな浮動小数点を割り当てることができます[random.random() for _ in range(0, 10)]。

それで：-

import random

collection = mongodb["collection_name"]

rand = random.random()  # rand will be a floating point between 0 to 1.
random_record = collection.find_one({ 'random' => { '$gte' => rand } })

MongoDB は、いずれネイティブに実装される予定です。ここにファイルされた機能 - https://jira.mongodb.org/browse/SERVER-533

執筆時点ではまだ実装されていません。

score 6 · Accepted Answer

以来、 docsで説明されているように、演算子を使用して関数をMongoDB 3.2使用して実行できます。超高速です。次のコードは、コレクションから 20 個のドキュメントをランダムに選択します。aggregate$sample

db.collection.aggregate( [ { $sample: {size: 20} } ] )

特定の基準でランダムなドキュメントを選択する必要がある場合は、演算子で使用でき$matchます

db.collection.aggregate([ 
    { $sample: {size: 20} }, 
    { $match:{"yourField": value} } 
  ])

順番に注意！私の小さなデータベースで約 10 万件のドキュメントを検索すると、上記のコマンドは 15 ミリ秒かかりますが、順序を切り替えると 1750 ミリ秒かかります (100 倍以上遅くなります)。理由はもちろん明らかです。さらに、この順序で、これらのランダムな 20 個のドキュメントのサブセットを取得します...

score 2 · Accepted Answer

パフォーマンス的に？控えめに言っても、データを変更せずに行うのは困難です。

1b ドキュメントから 1,000,000 の rand() を取得しようとしていると想像してください。それは遅く、非常に遅くなります。これは、MongoDB がスキップ時にインデックスを有効に活用していないためです。

@Calvin が言ったように、MongoDB にはランダムなドキュメントを取得する機能のリクエストがありますが、まだ実装されていません。

これを行う最も効率的な方法は、これを定期的に行う場合は atm で、自動インクリメント ID をレコードに追加することです: http://www.mongodb.org/display/DOCS/How+to+Make+an+ Auto+Incrementing+Fieldを使用してrand()オンにします。

編集

明確にするために; 自動インクリメント ID を使用する場合、最初に 1 つのクエリを実行して (別の方法で追跡しない限り)、フィールドの最大値を取得する必要があります。カウンターコレクションまたはコレクション自体をクエリして、逆順 ( sort({field:-1})) に並べ替えlimit(1)、の最大値を取得することができますrand()。

また、データの変更を考慮する必要があります。つまり、実際に$gteはそのランダムな位置が必要です。

私のアイデアは、ここで詳しく説明できます: php mongodb find nth entry in collection

score 1 · Accepted Answer

オブジェクトに int id がある場合、次のようなことができます

findOne({id: {$gte: rand()}})

python - Pythonを使用してmongoDBの10億のドキュメントからランダムな単一のドキュメントを取得する方法は?

5 に答える 5

Update

編集

Related

Reference