python - PDFminer をライブラリとして使用する: 「AttributeError: 'NoneType' オブジェクトに属性 'getobj' がありません」

Question

PDFファイルをアップロードし、その過程でそれらを解析するためのスクリプトを書いています。解析にはPDFminerを使用します。

ファイルを PDFMiner ドキュメントに変換するには、次の関数を使用します。上記のリンクにある手順に従ってください。

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    doc.set_parser(parser)
    if self.options['password']:
        password = self.options['password']
    else:
        password = ""
    doc.initialize(password)
    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

もちろん、期待される結果は素晴らしいPDFDocumentインスタンスですが、代わりにエラーが発生します。

Traceback (most recent call last):
  File "bzk_pdf.py", line 45, in <module>
    cli.run_cli(BZKPDFScraper)
  File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli
    instance = cls(options)
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__
    self.doc = self.load_document()
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document
    doc.set_parser(parser)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value
    x = resolve1(x)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1
    x = x.resolve()
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve
    return self.doc.getobj(self.objid)
AttributeError: 'NoneType' object has no attribute 'getobj'

どこを見ればいいのかわからず、同じ問題を抱えている人は他にいません。

役立つかもしれないいくつかの追加情報：

ここに私のテストファイルがあります: http://www.2shared.com/document/kM_wrI3J/testpdf.html
_filedjango File objectですが、通常のファイルを使用しても同じ結果になります
pdfminer バージョン: 'pdfminer-20110515'
Django: 1.4.3 (関係ないと思います)
パイソン 2.7.3

score 2 · Accepted Answer

いくつかの実験で、私は行が欠落していることがわかりました：

parser.set_document（doc）

その行を追加すると、関数が機能するようになります。

私には貧弱なライブラリ設計のように見えますが、何かを見逃した可能性があり、これはエラーを修正するだけです。

とにかく、私は今必要なデータを含むPDFドキュメントを持っています。

最終結果は次のとおりです。

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)

    if 'password' in self.options.keys():
        password = self.options['password']
    else:
        password = ""

    doc.initialize(password)

    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

score 0 · Accepted Answer

次のように、ファイルを開いてパーサーに送信してみてください。

with open(_file,'rb') as f:
    parser = PDFParser(f)
    # your normal code here

あなたが今やっている方法では、ファイル名を文字列として送信していると思われます。

python - PDFminer をライブラリとして使用する: 「AttributeError: 'NoneType' オブジェクトに属性 'getobj' がありません」

2 に答える 2

Related

Reference