pdfbox - Pdfbox PDFTextStripperByArea 座標のシフト

Question

座標に問題があります。PDFTextStripperByArea 領域が押し上げられすぎているようです。

次のスニペットの例を検討してください。

...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();

// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);

// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right 
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();

// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region"); 
... 
document.save(...); ...

シアンの四角形は、目的の領域をうまく覆います。一方、ストリッパーは四角形の下部にある数行を見逃し、四角形の上に数行を含めます。これは、(y 座標によって)「上に」シフトされているように見えます。何が起こっている？

score 1 · Accepted Answer

テキストは通常、配置用の四角形の中に含まれています。テキストがその長方形内の予想される位置にない場合があり、PDFBox はその長方形を使用して、テキストがどこにあるかを推測しようとします。そのため、テキストがキャプチャ領域の外側から始まり、そこに流れ込むと、抽出されない可能性があります。

ラフスケッチ: テキストボックスはキャプチャ領域の外側で始まりますが、テキストはその内側に流れます。取り込めない場合があります。

____________
|Page      |
|   _______|
|   |Area ||
|   |     ||
| ..|.....||
| ⁞ |Text⁞||
| ⁞ |____⁞||
| ⁞......⁞ |
|__________|

pdfbox - Pdfbox PDFTextStripperByArea 座標のシフト

2 に答える 2

Related

Reference