pdfbox - PDFファイルから段落を抽出してその位置を保存する方法は?

Question

PDFBox ライブラリを使用して PDF ファイルのコンテンツを抽出します。コンテンツは段落ごとに処理する必要があり、各段落について、フォローアップ処理のためにその位置が必要です。次のコードを使用して、入力 PDF のコンテンツ全体を抽出できます。

PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
String txt = stripper.getText(doc);
doc.close();

2 つの問題があります。

段落ごとにコンテンツを抽出する方法がわかりません。
後続処理 (ハイライトなど) のために段落の位置を保存する方法がわかりません。

ありがとう。

score 0 · Accepted Answer

I use Poppler's command-line pdftohtml to extract rich-text but if you need paragraph clean then the PDF got to be a tagged-PDF. If you need the (x,y) co-ordinate of the paragraph then you need to dig deeper into Poppler. There is also Apache PDFbox Java library that can also be used. If you make an annotation in the beginning of the paragraph then you can pull out the annotation as an XML from the PDF where you will find the (x,y) co-ordinate of the annotation! Adobe puts a clever encryption into the PDF to make it undiscoverable, so it may not be easy (that's with all the legal hassles etc) to pull that out without Adobe tools.

pdfbox - PDFファイルから段落を抽出してその位置を保存する方法は?

1 に答える 1

Related

Reference