python - 大きなデータセット Python でループを最適化する

Question

Python でこれほど大きなことをするのは初めてなので、助けが必要です。

次の構造のmongodb（またはpython dict）があります。

{
  "_id": { "$oid" : "521b1fabc36b440cbe3a6009" },
  "country": "Brazil",
  "id": "96371952",
  "latitude": -23.815124482000001649,
  "longitude": -45.532670811999999216,
  "name": "coffee",
  "users": [
    {
      "id": 277659258,
      "photos": [
        {
          "created_time": 1376857433,
          "photo_id": "525440696606428630_277659258",
        },
        {
          "created_time": 1377483144,
          "photo_id": "530689541585769912_10733844",
        }
      ],
      "username": "foo"
    },
    {
      "id": 232745390,
      "photos": [
        {
          "created_time": 1369422344,
          "photo_id": "463070647967686017_232745390",
        }
      ],
      "username": "bar"
    }
  ]
}

ここで、2 つのファイルを作成します。1 つは要約を含み、もう 1 つは各接続の重みを含みます。小さなデータセットで機能する私のループは次のとおりです。

#a is the dataset
data = db.collection.find()
a =[i for i in data]

#here go the connections between the locations
edges = csv.writer(open("edges.csv", "wb"))
#and here the location data
nodes = csv.writer(open("nodes.csv", "wb"))

for i in a:

    #find the users that match
    for q in a:
        if i['_id'] <> q['_id'] and q.get('users') :
            weight = 0
            for user_i in i['users']:
                for user_q in q['users']:
                    if user_i['id'] == user_q['id']:
                        weight +=1
            if weight>0:
                edges.writerow([ i['id'], q['id'], weight])


    #find the number of photos
    photos_number =0
    for p in i['users']:
        photos_number += len(p['photos'])


    nodes.writerow([ i['id'],
                    i['name'],
                    i['latitude'],
                    i['longitude'],
                    len(i['users']),
                    photos_number
                ])

スケーリングの問題: 20000 の場所があり、各場所には最大 2000 人のユーザーがいる可能性があり、各ユーザーは約 10 枚の写真を持っている可能性があります。

上記のループを作成するより効率的な方法はありますか? 多分マルチスレッド、JIT、より多くのインデックス？上記を単一のスレッドで実行すると、最大 20000^2 *2000 *10 の結果になる可能性があるため...

では、上記の問題をより効率的に処理するにはどうすればよいでしょうか。ありがとう

score 1 · Accepted Answer

このループを折りたたむ:

photos_number =0
for p in i['users']:
    photos_number += len(p['photos'])

至るまで：

photos_number = sum(len(p['photos']) for p in i['users'])

まったく助けますか？

あなたの体重計算：

        weight = 0
        for user_i in i['users']:
            for user_q in q['users']:
                if user_i['id'] == user_q['id']:
                    weight +=1

次のように折りたたむこともできます。

        weight = sum(user_i['id'] == user_q['id'] 
                        for user_i,user_q in product([i['users'],q['users']))

True は 1 に等しいため、すべてのブール条件を合計することは、True であるすべての値をカウントすることと同じです。

python - 大きなデータセット Python でループを最適化する

3 に答える 3

Related

Reference