3

長い質問で申し訳ありませんが、それを要約するためのより良い方法を見つけることができませんでした。

Pythonを使用multiprocessingして並列でいくつかの計算を実行するプログラムがあります。プロセス間の通信は、aとaの2つのQueueオブジェクトを使用して行われます。メインプロセスは、計算に使用されるデータでいっぱいになり、次にこのキューを消費して結果をに格納するいくつかのサブプロセスを開始します。work_queueresult_queuework_queueresult_queue

すべてが正常に機能しているように見えますが、サンプルデータの量(つまり、に入るデータの量work_queue)とサブプロセスの数を少し試してみると、私を困惑させているエラーが発生し始めました。時間。

次のコードは問題を示しています。

# -- queue_bug.py --

import sys
import time
import random
import datetime
import traceback
# Need this to catch the Queue.Empty exception
import Queue
import multiprocessing
from multiprocessing import Process
from multiprocessing import Queue as MultiprocessingQueue

# -------------------------------------------------------------------
# do_calculation
# -------------------------------------------------------------------
def do_calculation(p_name, work_queue, result_queue):
    def log(msg):
        print '%s [%s] %s' % (datetime.datetime.now(), p_name, msg)
    log('Starting up...')
    while True:
        # Get work from queue
        try:
            work = work_queue.get(timeout = 0.1)
            test_id   = work[0]
            test_data = work[1]
        except Queue.Empty:
            break
        # this is just a dummy loop
        for i in range(100):
            test_result = [x * random.random() for x in test_data]
        result_queue.put((test_id, test_data, test_result))
    log('Finished')

# -------------------------------------------------------------------
# main
# -------------------------------------------------------------------
def main():
    def log(msg):
        print '%s [main   ] %s' % (datetime.datetime.now(), msg)
    try:
        num_tests = int(sys.argv[1])
        num_procs = int(sys.argv[2])
    except Exception:
        print 'usage: <prog> number-of-tests number-of-subprocesses'
        sys.exit()

    log('Initializing queues...')
    work_queue   = MultiprocessingQueue()
    result_queue = MultiprocessingQueue()
    log('Creating subprocesses...')
    process_list = []
    for i in range(num_procs):
        p_name = 'PROC_%02d' % (i+1)
        log('    Initializing %s' % p_name)
        p = Process(
            target = do_calculation,
            args   = (p_name, work_queue, result_queue),
            name   = p_name)
        p.daemon = True
        process_list.append(p)
    log('Populating the work_queue...')
    for test_id in range(num_tests):
        work_queue.put((test_id, [test_id]*20))
    log('Work_queue size is %d' % work_queue.qsize())
    log('Starting the subprocesses...')
    for p in process_list:
        p.start()
    log('Waiting until the work_queue is empty...')
    while True:
        log('    Work_queue size is %d' % work_queue.qsize())
        if work_queue.qsize() > 0:
            time.sleep(0.5)
        else:
            break
    log('Waiting until the result_queue is completely filled...')
    while True:
        log('    Result_queue size is %d' % result_queue.qsize())
        if result_queue.qsize() < num_tests:
            time.sleep(0.5)
        else:
            break
    log('Getting results...')
    result_dict = {}
    while True:
        try:
            queue_data           = result_queue.get_nowait()
            test_id              = queue_data[0]
            test_data            = queue_data[1]
            test_result          = queue_data[2]
            result_dict[test_id] = test_result
        except Queue.Empty:
            log('    All results loaded from result_queue')
            break
    log('Storing test results in result_summary...')
    result_summary = []
    for test_id in range(num_tests):
        try:
            test_result = result_dict[test_id]
            result_summary.append((test_id, test_result))
        except KeyError:
            ex = traceback.format_exc()
            log('ERROR: Exception found: %s' % ex)
            sys.exit()
    log('Success.')
    return result_summary

if __name__ == '__main__':
    main()

今、私がそれを実行しようとすると:

試行1:10.000の計算、10のサブプロセス-OK

$ python queue_bug.py 10000 10
2012-12-04 19:24:25.430667 [main   ] Initializing queues...
2012-12-04 19:24:25.440521 [main   ] Creating subprocesses...
2012-12-04 19:24:25.440550 [main   ]     Initializing PROC_01
2012-12-04 19:24:25.440576 [main   ]     Initializing PROC_02
2012-12-04 19:24:25.440597 [main   ]     Initializing PROC_03
2012-12-04 19:24:25.440617 [main   ]     Initializing PROC_04
2012-12-04 19:24:25.440637 [main   ]     Initializing PROC_05
2012-12-04 19:24:25.440656 [main   ]     Initializing PROC_06
2012-12-04 19:24:25.440679 [main   ]     Initializing PROC_07
2012-12-04 19:24:25.440699 [main   ]     Initializing PROC_08
2012-12-04 19:24:25.440721 [main   ]     Initializing PROC_09
2012-12-04 19:24:25.440741 [main   ]     Initializing PROC_10
2012-12-04 19:24:25.440759 [main   ] Populating the work_queue...
2012-12-04 19:24:25.494263 [main   ] Work_queue size is 10000
2012-12-04 19:24:25.494301 [main   ] Starting the subprocesses...
2012-12-04 19:24:25.495515 [PROC_01] Starting up...
2012-12-04 19:24:25.495802 [PROC_02] Starting up...
2012-12-04 19:24:25.496212 [PROC_03] Starting up...
2012-12-04 19:24:25.496557 [PROC_04] Starting up...
2012-12-04 19:24:25.496896 [PROC_05] Starting up...
2012-12-04 19:24:25.497300 [PROC_06] Starting up...
2012-12-04 19:24:25.497705 [PROC_07] Starting up...
2012-12-04 19:24:25.498074 [PROC_08] Starting up...
2012-12-04 19:24:25.498258 [main   ] Waiting until the work_queue is empty...
2012-12-04 19:24:25.498349 [main   ]     Work_queue size is 9974
2012-12-04 19:24:25.498661 [PROC_09] Starting up...
2012-12-04 19:24:25.499765 [PROC_10] Starting up...
2012-12-04 19:24:25.998914 [main   ]     Work_queue size is 0
2012-12-04 19:24:25.998954 [main   ] Waiting until the result_queue is completely filled...
2012-12-04 19:24:25.998976 [main   ]     Result_queue size is 10000
2012-12-04 19:24:25.998993 [main   ] Getting results...
2012-12-04 19:24:26.029774 [PROC_06] Finished
2012-12-04 19:24:26.029798 [PROC_03] Finished
2012-12-04 19:24:26.029824 [PROC_08] Finished
2012-12-04 19:24:26.029853 [PROC_02] Finished
2012-12-04 19:24:26.029868 [PROC_01] Finished
2012-12-04 19:24:26.029898 [PROC_07] Finished
2012-12-04 19:24:26.029921 [PROC_09] Finished
2012-12-04 19:24:26.029942 [PROC_10] Finished
2012-12-04 19:24:26.031040 [PROC_04] Finished
2012-12-04 19:24:26.031057 [PROC_05] Finished
2012-12-04 19:24:26.087804 [main   ]     All results loaded from result_queue
2012-12-04 19:24:26.087844 [main   ] Storing test results in result_summary...
2012-12-04 19:24:26.092477 [main   ] Success.

試行2:70.000の計算、10のサブプロセス-エラー

$ python queue_bug.py 70000 10
2012-12-04 19:25:01.083092 [main   ] Initializing queues...
2012-12-04 19:25:01.093483 [main   ] Creating subprocesses...
2012-12-04 19:25:01.093520 [main   ]     Initializing PROC_01
2012-12-04 19:25:01.093548 [main   ]     Initializing PROC_02
2012-12-04 19:25:01.093570 [main   ]     Initializing PROC_03
2012-12-04 19:25:01.093591 [main   ]     Initializing PROC_04
2012-12-04 19:25:01.093612 [main   ]     Initializing PROC_05
2012-12-04 19:25:01.093632 [main   ]     Initializing PROC_06
2012-12-04 19:25:01.093656 [main   ]     Initializing PROC_07
2012-12-04 19:25:01.093676 [main   ]     Initializing PROC_08
2012-12-04 19:25:01.093699 [main   ]     Initializing PROC_09
2012-12-04 19:25:01.093720 [main   ]     Initializing PROC_10
2012-12-04 19:25:01.093738 [main   ] Populating the work_queue...
2012-12-04 19:25:01.395974 [main   ] Work_queue size is 70000
2012-12-04 19:25:01.396012 [main   ] Starting the subprocesses...
2012-12-04 19:25:01.397601 [PROC_01] Starting up...
2012-12-04 19:25:01.398183 [PROC_02] Starting up...
2012-12-04 19:25:01.398545 [PROC_03] Starting up...
2012-12-04 19:25:01.399021 [PROC_04] Starting up...
2012-12-04 19:25:01.399621 [PROC_05] Starting up...
2012-12-04 19:25:01.400137 [PROC_06] Starting up...
2012-12-04 19:25:01.400675 [PROC_07] Starting up...
2012-12-04 19:25:01.401200 [PROC_08] Starting up...
2012-12-04 19:25:01.401645 [main   ] Waiting until the work_queue is empty...
2012-12-04 19:25:01.401691 [PROC_09] Starting up...
2012-12-04 19:25:01.401738 [main   ]     Work_queue size is 69959
2012-12-04 19:25:01.402387 [PROC_10] Starting up...
2012-12-04 19:25:01.902063 [main   ]     Work_queue size is 58415
2012-12-04 19:25:02.402640 [main   ]     Work_queue size is 47302
2012-12-04 19:25:02.903067 [main   ]     Work_queue size is 36145
2012-12-04 19:25:03.403650 [main   ]     Work_queue size is 24992
2012-12-04 19:25:03.904065 [main   ]     Work_queue size is 13481
2012-12-04 19:25:04.404643 [main   ]     Work_queue size is 1951
2012-12-04 19:25:04.588562 [PROC_02] Finished
2012-12-04 19:25:04.588580 [PROC_06] Finished
2012-12-04 19:25:04.588611 [PROC_10] Finished
2012-12-04 19:25:04.588631 [PROC_03] Finished
2012-12-04 19:25:04.589705 [PROC_04] Finished
2012-12-04 19:25:04.589741 [PROC_09] Finished
2012-12-04 19:25:04.589764 [PROC_05] Finished
2012-12-04 19:25:04.589791 [PROC_08] Finished
2012-12-04 19:25:04.589814 [PROC_01] Finished
2012-12-04 19:25:04.589844 [PROC_07] Finished
2012-12-04 19:25:04.905065 [main   ]     Work_queue size is 0
2012-12-04 19:25:04.905098 [main   ] Waiting until the result_queue is completely filled...
2012-12-04 19:25:04.905121 [main   ]     Result_queue size is 70000
2012-12-04 19:25:04.905140 [main   ] Getting results...
2012-12-04 19:25:05.012083 [main   ]     All results loaded from result_queue
2012-12-04 19:25:05.012140 [main   ] Storing test results in result_summary...
2012-12-04 19:25:05.020498 [main   ] ERROR: Exception found: Traceback (most recent call last):
  File "queue_bug.py", line 95, in main
    test_result = result_dict[test_id]
KeyError: 10647

2回目の試行で、からデータを読み取ろうとするとKeyErrorresult_dictが発生します。この辞書は、から取得したデータでいっぱいresult_queueなので、関連しているのではないかと思います。

また、失敗する引数の組み合わせ(たとえば、70000/10)で実行するたびに、別のキーで発生することに気付きKeyErrorました。これは、並行性/同期の問題を示しているようです。

最後になりましたが、サンプルデータのサイズまたはサブプロセスの数が増えると、それを再現する可能性が高くなります。

何か案は?

4

2 に答える 2

3

サンプルコードに少しデバッグを追加しましたが、問題が見つかったと思います。の使用に関する既存のコメントとは別にJoinableQueue、主な問題は、結果を処理するときに、最後に次のようなことを行うことです。

try:
    queue_data = result_queue.get_nowait()
except Queue.Empty:
    break

ただしQueue.Empty、キューが実際には空でなくても、get_nowait()タイムアウトが早すぎるために発生する可能性があります。代わりに、次のことを試してください。

try:
    queue_data = result_queue.get_nowait()
except Queue.Empty:
    if result_queue.qsize() < 1:
        break

つまり、ループから抜け出す前に、キューが実際に空であることを確認してください。

于 2012-12-04T19:11:56.157 に答える
3

あなたのバグは、結果キューからデータを取得するループにあります

 while result_queue.qsize() > 0:
        try:
            queue_data           = result_queue.get()
            test_id              = queue_data[0]
            test_data            = queue_data[1]
            test_result          = queue_data[2]
            result_dict[test_id] = test_result
        except Queue.Empty:
            log('    All results loaded from result_queue')
            break

get_nowaitまだ結果が残っているのに、あなたは空に戻っていました。諦める前に終わっていなかったスレッドの数が増えると、舞台裏で頭がおかしくなっているのではないかと思います。

于 2012-12-04T19:12:39.400 に答える