python - 大きなデータセットで検索

Question

user:friends (50,000) のリストとイベント参加者のリスト (25,000 のイベントと各イベントの参加者のリスト) があります。ユーザーがイベントに参加する上位k人の友達を見つけたい. これは、ユーザーごとに行う必要があります。

リストをトラバースしようとしましたが、計算コストが非常に高くなります。また、加重グラフを作成してそれをやろうとしています.(Python)

他のアプローチがあれば教えてください。

score 1 · Accepted Answer

Python のコレクションオブジェクト (dictionaries、sets、および collections.Counter) を使用すると、このタスクを簡単に実行できます。

from collections import Counter

def top_k_friends(friends, events, k=2):
    '''Given a dictionary users mapped to their set of friends
    and a dictionary of events mapped to a set of their attendees,
    find the top k friends with whom the user goes to the event.
    Do this for each user.

    '''
    for user, users_friends in friends.iteritems():
        c = Counter()
        for event, attendees in events.iteritems():
            if user in attendees:
                c.update(users_friends.intersection(attendees))
        print user, '-->', c.most_common(k)

if __name__ == '__main__':

    friends = {
        'robert' : {'mary', 'marty', 'maggie', 'john'},
        'paul' : {'marty', 'mary', 'amber', 'susan'}
    }

    events = {
        'derby': {'amber', 'mary', 'robert'},
        'pageant': {'maggie', 'paul', 'amber', 'marty', 'john'},
        'fireworks': {'susan', 'robert', 'marty', 'paul', 'robert'}
    }

    top_k_friends(friends, events)

score 0 · Accepted Answer

データベース（例：sqlite）で行うか、純粋なpythonのメモリ内オプションについては、normanを参照してください。どちらの方法でも、リストを使用してこれを自分で実装しようとするよりもはるかに高速です。

score 0 · Accepted Answer

現在のデータ構造がどのように見えるかをよりよく理解していれば、コードサンプルを提供しますが、これは pandas データフレーム groupby の仕事のように思えます (他の人が提案したようにデータベースを実際に使用したくない場合)。

score 0 · Accepted Answer

このようなことができますか。

ユーザーの友達は比較的少なく、特定のユーザーが参加したイベントもイベントの総数よりもはるかに少ないと想定しています。

そのため、ユーザーの友人ごとに出席したイベントのブールベクトルがあります。

ドット積を実行し、最大のものは、ユーザーに最も似ている可能性が高い友人になります。

繰り返しますが、これを行う前に、いくつかのイベントをフィルタリングして、ベクトルのサイズを管理しやすくする必要があります。

python - 大きなデータセットで検索

4 に答える 4

Related

Reference