java - PDFbox を使用してドキュメント内の単語の座標を決定する

Question

私は PDFbox を使用して PDF ドキュメント内の単語/文字列の座標を抽出していますが、これまでのところ個々の文字の位置を決定することに成功しています。これは、PDFbox doc からのこれまでのコードです。

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override /* this is questionable, not sure if needed... */
    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

これにより、次のように、スペースを含む各文字の位置を含む一連の行が生成されます。

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

「P」は文字です。私は PDFbox で単語を検索する関数を見つけることができませんでした。また、スペースも含まれていても、これらの文字を単語に正確に連結して検索できるほど Java に精通していません。他の誰かが同様の状況にありましたか? もしそうなら、どのようにアプローチしましたか? 部分を単純化するために、単語の最初の文字の座標だけが本当に必要ですが、そのような出力に対して文字列をどのように一致させるかについては、私を超えています。

score 10 · Accepted Answer

PDFBox には、単語を自動的に抽出できる機能はありません。現在、データを抽出してブロックに収集する作業を行っています。これが私のプロセスです。

ドキュメントのすべての文字 (グリフと呼ばれる) を抽出し、それらをリストに格納します。
リストをループして、各グリフの座標を分析します。それらが重なっている場合 (現在のグリフの上部が前のグリフの上部と下部の間に含まれている場合、または現在のグリフの下部が前のグリフの上部と下部の間に含まれている場合)、同じ行に追加します。
この時点で、ドキュメントのさまざまな行を抽出しました (注意してください。ドキュメントが複数列の場合、「行」という表現は、垂直方向に重なるすべてのグリフ、つまり同じ垂直方向を持つすべての列のテキストを意味します)。座標）。
次に、現在のグリフの左座標を前のグリフの右座標と比較して、それらが同じ単語に属しているかどうかを判断できます (PDFTextStripper クラスは、試行錯誤に基づいて getSpacingTolerance() メソッドを提供します)。 , 「通常の」スペースの値. 左右の座標の差がこの値よりも小さい場合, 両方のグリフは同じ単語に属します.

私はこの方法を自分の仕事に適用しましたが、うまくいきました。

score 5 · Accepted Answer

元のアイデアに基づいて、PDFBox 2 のテキスト検索のバージョンを示します。コード自体はラフですが、シンプルです。すぐに始められるはずです。

import java.io.IOException;
import java.io.Writer;
import java.util.List;
import java.util.Set;
import lu.abac.pdfclient.data.PDFTextLocation;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

public class PrintTextLocator extends PDFTextStripper {

    private final Set<PDFTextLocation> locations;

    public PrintTextLocator(PDDocument document, Set<PDFTextLocation> locations) throws IOException {
        super.setSortByPosition(true);
        this.document = document;
        this.locations = locations;
        this.output = new Writer() {
            @Override
            public void write(char[] cbuf, int off, int len) throws IOException {
            }
            @Override
            public void flush() throws IOException {
            }

            @Override
            public void close() throws IOException {
            }
        };
    }

    public Set<PDFTextLocation> doSearch() throws IOException {

        processPages(document.getDocumentCatalog().getPages());
        return locations;
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
        super.writeString(text);

        String searchText = text.toLowerCase();
        for (PDFTextLocation textLoc:locations) {
            int start = searchText.indexOf(textLoc.getText().toLowerCase());
            if (start!=-1) {
                // found
                TextPosition pos = textPositions.get(start);
                textLoc.setFound(true);
                textLoc.setPage(getCurrentPageNo());
                textLoc.setX(pos.getXDirAdj());
                textLoc.setY(pos.getYDirAdj());
            }
        }

    }


}

score 1 · Accepted Answer

これを見てください、私はそれがあなたが必要としているものだと思います.

https://jackson-brain.com/using-pdfbox-to-locate-text-coordinates-within-a-pdf-in-Java/

コードは次のとおりです。

import java.io.File;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

public class PrintTextLocations extends PDFTextStripper {

public static StringBuilder tWord = new StringBuilder();
public static String seek;
public static String[] seekA;
public static List wordList = new ArrayList();
public static boolean is1stChar = true;
public static boolean lineMatch;
public static int pageNo = 1;
public static double lastYVal;

public PrintTextLocations()
        throws IOException {
    super.setSortByPosition(true);
}

public static void main(String[] args)
        throws Exception {
    PDDocument document = null;
    seekA = args[1].split(",");
    seek = args[1];
    try {
        File input = new File(args[0]);
        document = PDDocument.load(input);
        if (document.isEncrypted()) {
            try {
                document.decrypt("");
            } catch (InvalidPasswordException e) {
                System.err.println("Error: Document is encrypted with a password.");
                System.exit(1);
            }
        }
        PrintTextLocations printer = new PrintTextLocations();
        List allPages = document.getDocumentCatalog().getAllPages();

        for (int i = 0; i < allPages.size(); i++) {
            PDPage page = (PDPage) allPages.get(i);
            PDStream contents = page.getContents();

            if (contents != null) {
                printer.processStream(page, page.findResources(), page.getContents().getStream());
            }
            pageNo += 1;
        }
    } finally {
        if (document != null) {
            System.out.println(wordList);
            document.close();
        }
    }
}

@Override
protected void processTextPosition(TextPosition text) {
    String tChar = text.getCharacter();
    System.out.println("String[" + text.getXDirAdj() + ","
            + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
            + text.getXScale() + " height=" + text.getHeightDir() + " space="
            + text.getWidthOfSpace() + " width="
            + text.getWidthDirAdj() + "]" + text.getCharacter());
    String REGEX = "[,.\\[\\](:;!?)/]";
    char c = tChar.charAt(0);
    lineMatch = matchCharLine(text);
    if ((!tChar.matches(REGEX)) && (!Character.isWhitespace(c))) {
        if ((!is1stChar) && (lineMatch == true)) {
            appendChar(tChar);
        } else if (is1stChar == true) {
            setWordCoord(text, tChar);
        }
    } else {
        endWord();
    }
}

protected void appendChar(String tChar) {
    tWord.append(tChar);
    is1stChar = false;
}

protected void setWordCoord(TextPosition text, String tChar) {
    tWord.append("(").append(pageNo).append(")[").append(roundVal(Float.valueOf(text.getXDirAdj()))).append(" : ").append(roundVal(Float.valueOf(text.getYDirAdj()))).append("] ").append(tChar);
    is1stChar = false;
}

protected void endWord() {
    String newWord = tWord.toString().replaceAll("[^\\x00-\\x7F]", "");
    String sWord = newWord.substring(newWord.lastIndexOf(' ') + 1);
    if (!"".equals(sWord)) {
        if (Arrays.asList(seekA).contains(sWord)) {
            wordList.add(newWord);
        } else if ("SHOWMETHEMONEY".equals(seek)) {
            wordList.add(newWord);
        }
    }
    tWord.delete(0, tWord.length());
    is1stChar = true;
}

protected boolean matchCharLine(TextPosition text) {
    Double yVal = roundVal(Float.valueOf(text.getYDirAdj()));
    if (yVal.doubleValue() == lastYVal) {
        return true;
    }
    lastYVal = yVal.doubleValue();
    endWord();
    return false;
}

protected Double roundVal(Float yVal) {
    DecimalFormat rounded = new DecimalFormat("0.0'0'");
    Double yValDub = new Double(rounded.format(yVal));
    return yValDub;
}
}

依存関係:

PDFBox、FontBox、Apache 共通ログインターフェイス。

コマンドラインに次のように入力して実行できます。

javac PrintTextLocations.java 
sudo java PrintTextLocations file.pdf WORD1,WORD2,....

出力は次のようになります。

[(1)[190.3 : 286.8] WORD1, (1)[283.3 : 286.8] WORD2, ...]

java - PDFbox を使用してドキュメント内の単語の座標を決定する

5 に答える 5

Related

Reference