python - Pythonでワークロードを分散する

Question

10,000adam_idのデータベースがあります。それぞれについてadam_id、APIを介して情報を取得する必要があります。

私のテーブルは次のようになります。

`title`
- adam_id
- success (boolean)
- number_of_tries (# of times success=0 when trying to do the pull down)

これが私が作成したい関数です：

def pull_down(cursor):
    work_remains = True
    while work_remains:
        cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                          AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
        if len(cursor.fetchall()):
            adam_id = cursor.fetchone()[0]
            do_api_call(adam_id)
        else:
            work_remains = False

def do_api_call(adam_id):
    # do api call
    if success:
        cursor.execute("UPDATE title SET success=1 WHERE adam_id = adam_id")
    else:
        cursor.execute("UPDATE title SET number_of_tries+=1 WHERE adam_id=adam_id")

n1つの同期プロセスではなく、Pythonのマルチプロセッシング機能を使用するワーカーで上記をどのように実行しますか？マルチプロセッシングモジュール（http://docs.python.org/library/multiprocessing.html）を調べ始めましたが、これまでのところ、理解するのはかなり難しいようです。

score 1 · Accepted Answer

作業の大部分がAPI呼び出しである場合、それは外部リソースに送られるため、実際に並列化する必要があるのはそれだけです。データベース呼び出しはおそらく本当に高速です。だからあなたはこれを試すかもしれません：

adam_id1つのクエリで値をバッチ取得します
IDをプロセスプールに入れて、API呼び出しを実行します
結果を取得し、データベースにコミットします

これは、ロジックフローを示すための大まかな擬似コードの例です。

from multiprocessing import Pool

def pull_down(cursor):
    # get all the data in one query
    count = cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                      AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
    if count:
        # Step #1
        adam_id_list = [row[0] for row in cursor.fetchall()]

        # Step #2
        pool = Pool(4)
        results = pool.map(do_api_call, adam_id_list)
        pool.close()

        # Step #3
        update_db(results)

def do_api_call(adam_id):
    # do api call
    success = call_api_with_id(adam_id)
    return (adam_id, success)

def update_db(results):
    # loop over results and built batch queries for the success
    # or failed items

    # (obviously this split up could be optimized)
    succeeded = [result[0] for result in results if result[1]]
    failed = [result[0] for result in results if not result[1]]

    submit_success(succeeded)
    submit_failed(failed)

データベース呼び出しを並列にしようとした場合にのみコードが複雑になります。これは、データベースがとにかく速度を低下させない場合でも、各プロセスに独自の接続を適切に与える必要があるためです。

python - Pythonでワークロードを分散する

1 に答える 1

Related

Reference