1

多くのドキュメントのテキストを修正する作業を Python で並列化しようとしているので、当然「joblib」を見つけました。各タスクは、特定のドキュメントを修正することです。コードの構造は次のとおりです。

if __name__ == '__main__':
    lexicon = build_compact_lexicon()

    from joblib import Parallel, delayed
    import multiprocessing

    num_cores = multiprocessing.cpu_count()
    results = Parallel(n_jobs=num_cores)(delayed(find_errors)('GDL', i, 1, lexicon) for i in range(1798, 1820))

ここに要約された関数 find_errors を使用しています:

def find_errors(newspaper, year, month, lexicon):
    # parse the input newspaper text data using etree parser from LXML
    # detect errors in the text
    return found_errors_type1, found_errors_type2, found_errors_type3

これにより、いくつかのエラーが発生します

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 130, in __call__
    return self.func(*args, **kwargs)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 72, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "hellowordParallel.py", line 85, in find_errors
    tree = etree.parse(xml_file_path)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
  File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116293)
TypeError: cannot parse from 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 392, in find_cookie
    line_string = line.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 24: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 139, in __call__
    tb_offset=1)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 373, in format_exc
    frames = format_records(records)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 274, in format_records
    for token in generate_tokens(linereader):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 514, in _tokenize
    line = readline()
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 265, in linereader
    line = getline(file, lnum[0])
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 16, in getline
    lines = getlines(filename, module_globals)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 47, in getlines
    return updatecache(filename, module_globals)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 136, in updatecache
    with tokenize.open(fullname) as fp:
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 456, in open
    encoding, lines = detect_encoding(buffer.readline)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 433, in detect_encoding
    encoding = find_cookie(first)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 397, in find_cookie
    raise SyntaxError(msg)
  File "<string>", line None
SyntaxError: invalid or missing encoding declaration for '/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/lxml/etree.so'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "hellowordParallel.py", line 160, in <module>
    results = Parallel(n_jobs=num_cores)(delayed(find_errors)('GDL', i, 1, lexicon) for i in range(1798, 1820))
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 810, in __call__
    self.retrieve()
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 727, in retrieve
    self._output.extend(job.get())
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
SyntaxError: invalid or missing encoding declaration for '/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/lxml/etree.so'

これが構成に関連する何かを行うためなのか、それとも私の関数が並列実装に適合しないのかがわかりません... (そうすべきだと思います...)

以前にあなたの何人かに起こったことがありますか?

私の質問が明確で、誰かが私に助けを与えるのに十分な情報があることを願っています!

4

0 に答える 0