26

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>

When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.

if check_extractable and not doc.is_extractable:
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)

Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..

Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!

================================================

Below you will find the code with which I currently extract the text from non-read protected.

def getTextFromPDF(rawFile):
    resourceManager = PDFResourceManager(caching=True)
    outfp = StringIO()
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
    interpreter = PDFPageInterpreter(resourceManager, device)

    fileData = StringIO()
    fileData.write(rawFile)
    for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    fileData.close()
    device.close()

    result = outfp.getvalue()

    outfp.close()
    return result
4

8 に答える 8

45

私のプログラムでqpdfを動作させようとして、いくつかの問題がありました。qpdfに基づいており、pdf を抽出可能に自動的に変換する便利なライブラリpikepdfを見つけました。

これを使用するコードは非常に簡単です。

import pikepdf

pdf = pikepdf.open('unextractable.pdf')
pdf.save('extractable.pdf')
于 2018-11-14T14:19:16.670 に答える
1

完全な開示、私はpdfminer.sixのメンテナーの 1 人です。これは、コミュニティが管理する Python 3 用の pdfminer のバージョンです。

この問題は、デフォルトで無効にすることで 2020 年に修正されました。check_extractableエラーを発生させる代わりに、警告を表示するようになりました。

同様の質問と回答はこちら.

于 2021-09-12T12:04:39.817 に答える