python - Python - 多くの個別の PDF をテキストに変換する方法は?

Question

質問: Python パッケージ "slate" を使用して、同じパスにある多くの PDF を読み込むにはどうすればよいですか?

600 を超える PDF を含むフォルダーがあります。

次のコードを使用して、slate パッケージを使用して単一の PDF をテキストに変換する方法を知っています。

migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
     doc = slate.PDF(f)

 len(doc)

ただし、これにより、「migFiles [0]」で指定された一度に1つのPDFに制限されます-0はパスファイルの最初のPDFです。

一度に多くの PDF をテキストに読み込んで、それらを個別の文字列または txt ファイルとして保持するにはどうすればよいですか? 別のパッケージを使用する必要がありますか? パス内のすべての PDF を読み込む「for ループ」を作成するにはどうすればよいですか?

score 0 · Accepted Answer

このバージョンを試してください:

import glob
import os

import slate

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
        with open(txt_file,'w') as txt:
             txt.write(slate.pdf(pdf))

これにより、内容が変換されたpdfファイルと同じディレクトリに、pdfと同じ名前のテキストファイルが作成されます。

または、コンテンツを保存したい場合は、このバージョンを試してください。ただし、翻訳されたコンテンツが大きい場合、使用可能なメモリが使い果たされる可能性があることに注意してください。

import glob
import os

import slate

pdf_as_text = {}

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        file_without_extension = os.path.splitext(pdf_file)[0]
        pdf_as_text[file_without_extension] = slate.pdf(pdf)

pdf_as_text['somefile']これで、テキストの内容を取得するために使用できます。

score 0 · Accepted Answer

あなたができることは、単純なループを使用することです：

docs = []
for filename in migFiles:
   with open(filename) as f:
     docs.append(slate.pdf(f)) 
     # or instead of saving file to memory, just process it now

次に、docs[i] は (i+1) 番目の pdf ファイルのテキストを保持し、いつでも好きなときにファイルを操作できます。または、for ループ内でファイルを処理することもできます。

テキストに変換したい場合は、次のことができます。

docs = []
separator = ' ' # The character you want to use to separate contents of
#  consecutive pages; if you want the contents of each pages to be separated 
# by a newline, use separator = '\n'
for filename in migFiles:
   with open(filename) as f:
     docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text

また

separator = ' ' 
for filename in migFiles:
   with open(filename) as f:
     txtfile = open(filename[:-4]+".txt",'w')
     # if filename="abc.pdf", filename[:-4]="abc"
     txtfile.write(separator.join(slate.pdf(f)))
     txtfile.close()

python - Python - 多くの個別の PDF をテキストに変換する方法は?

2 に答える 2

Related

Reference