python - PDF 抽出から空白がなくなり、奇妙な単語の解釈

Question

以下のスニペットを使用して、このPDF ファイルからテキストデータを抽出しようとしました。

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    content = ""
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"  # Extract text from page and add to content
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

しかし、私が得た出力には、ほとんどの単語の間に空白がありません。これにより、テキストに対して自然言語処理を実行することが難しくなります (私の最終的な目標はここにあります)。

また、「finger」という単語の「fi」は、一貫して別のものとして解釈されます。この論文は自発的な指の動きに関するものなので、これはかなり問題です...

なぜこれが起こっているのか誰にも分かりますか？どこから始めればいいのかわからない！

score 18 · Accepted Answer

PyPdf2 を使用せずに、以下のように同じ機能を持つ Pdfminer ライブラリパッケージを使用します。これからコードを取得し、必要に応じて編集したところ、このコードは単語間に空白を含むテキストファイルを提供します。私はanacondaとpython 3.6で作業しています。Python 3.6 用の PdfMiner をインストールするには、このリンクを使用できます。

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

class PdfConverter:

   def __init__(self, file_path):
       self.file_path = file_path
# convert pdf file to a string which has space among words 
   def convert_pdf_to_txt(self):
       rsrcmgr = PDFResourceManager()
       retstr = StringIO()
       codec = 'utf-8'  # 'utf16','utf-8'
       laparams = LAParams()
       device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
       fp = open(self.file_path, 'rb')
       interpreter = PDFPageInterpreter(rsrcmgr, device)
       password = ""
       maxpages = 0
       caching = True
       pagenos = set()
       for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
           interpreter.process_page(page)
       fp.close()
       device.close()
       str = retstr.getvalue()
       retstr.close()
       return str
# convert pdf file text to string and save as a text_pdf.txt file
   def save_convert_pdf_to_txt(self):
       content = self.convert_pdf_to_txt()
       txt_pdf = open('text_pdf.txt', 'wb')
       txt_pdf.write(content.encode('utf-8'))
       txt_pdf.close()
if __name__ == '__main__':
    pdfConverter = PdfConverter(file_path='sample.pdf')
    print(pdfConverter.convert_pdf_to_txt())

score 7 · Accepted Answer

PDFファイルには印刷可能なスペース文字がなく、単語を必要な場所に配置するだけです。スペースを把握するには、おそらく複数文字のランが単語であると想定し、それらの間にスペースを入れることによって、追加の作業を行う必要があります。

PDFリーダーでテキストを選択でき、スペースが適切に表示される場合は、少なくとも、テキストを再構成するのに十分な情報があることがわかります。

「fi」は活版印刷の合字であり、単一の文字として表示されます。これは、「fl」、「ffi」、および「ffl」でも発生している場合があります。文字列置換を使用して、合字の代わりに「fi」を使用できます。

score 0 · Accepted Answer

Rを使用してこの問題を解決しました：

library(pdftools)
pdf_file <- "xxx/untitled.pdf"
text <- pdf_text(pdf_file)
cat(text[1])

score 0 · Accepted Answer

PDFBox は、Java を使用して PDF ファイルからテキストを抽出するための非常に優れたツールです。テキスト抽出はその強みです。PDF ファイルを変更/注釈付けまたは表示する場合は、別のツールの方が適している場合があります。ファイル内のスペースを識別するためのコードがあります。

合字を処理するためのコードもありますが、それを機能させるには、クラスパスに特定の国際化ライブラリ (Icu4j) が必要です。

Java コードを記述せずに、コマンドラインプログラムとして Python から PDFBox テキストエクストラクタを呼び出すことができます。

python - PDF 抽出から空白がなくなり、奇妙な単語の解釈

7 に答える 7

Related

Reference