python - PyPDF2 は、すべてのスペースを削除することを主張します

翻译自：https://stackoverflow.com/questions/36914276 2016-04-28T12:11:02.573

8474 次

私は他の多くのstackoverflowの回答を読みましたが、これに対する満足のいく回答をまだ見つけていませんが、以前に尋ねられました. PyPDF2 を使用して PDF ドキュメントを読み取ろうとすると、文内のすべての単語が 1 つの連続した文字列にマージされます。これを回避する方法を理解する上で進歩した人はいますか。以下はコードです

 import PyPDF2
 import pandas as pd

 import  struct as struct

 from nltk import word_tokenize

 pdfFileObj = open("notes.pdf", 'rb')

  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

 ## reading pages fine
 print(type(pdfReader.numPages))

## read in the pages 
pageObj = pdfReader.getPage(0)

 print(pageObj.extractText())

以下は出力のサンプルです

2)Explanationofthedifferencebetweenprobabilityandstatistics.Theroleofprobability
instatisticaldecisionmaking.ExamplesoftheuseofProbabilityinStatistics.
3)Datasummarization(graphicalandnumerical)

4)Probabilityandrandomvariables

python - PyPDF2 は、すべてのスペースを削除することを主張します

1 に答える 1

Related

Reference