python - Pythonを使用してpdfドキュメントの検索文字列がどのページにあるかを見つける

Question

特定の「検索文字列」がどのページにあるかを調べるために、どの python パッケージを使用できますか?

いくつかの python pdf パッケージを調べましたが、どれを使用すればよいかわかりませんでした。 PyPDFにはこの機能がないようで、PDFMinerはそのような単純なタスクにはやり過ぎのようです。何かアドバイス？

より正確に: 複数の PDF ドキュメントがあり、文字列 “Begin” と文字列 “End” の間にあるページを抽出したいと考えています。

score 18 · Accepted Answer

私はついにpyPDFが役立つことがわかりました。他の誰かを助けることができる場合に備えて投稿しています。

(1) 文字列を検索する関数

def fnPDF_FindText(xFile, xString):
    # xfile : the PDF file in which to look
    # xString : the string to look for
    import pyPdf, re
    PageFound = -1
    pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
    for i in range(0, pdfDoc.getNumPages()):
        content = ""
        content += pdfDoc.getPage(i).extractText() + "\n"
        content1 = content.encode('ascii', 'ignore').lower()
        ResSearch = re.search(xString, content1)
        if ResSearch is not None:
           PageFound = i
           break
     return PageFound

(2) 注目ページ抽出機能

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
      from pyPdf import PdfFileReader, PdfFileWriter
      output = PdfFileWriter()
      pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
      for i in range(xPageStart, xPageEnd):
          output.addPage(pdfOne.getPage(i))
          outputStream = file(xFileNameOutput, "wb")
          output.write(outputStream)
          outputStream.close()

これが他の誰かに役立つことを願っています

score 2 · Accepted Answer

@ user1043144 が言及したことに加えて、

Python 3.x で使用するには

PyPDF2 を使用する

import PyPDF2

openの代わりに使用file

PdfFileReader(open(xFile, 'rb'))

python - Pythonを使用してpdfドキュメントの検索文字列がどのページにあるかを見つける

5 に答える 5

Related

Reference