java - PdfBoxはpdfから同じフォントファミリのテキストを抽出します

Question

PDFからテキストのブロックを抽出する必要があります。このテキストは特性と同じ font-family を持っています。何か案は？乾杯

編集:別の方法で質問させてください:PDFページから「太字」のテキストだけを抽出するにはどうすればよいですか?

score 0 · Accepted Answer

public String pdftoText(String fileName){
    try {
        File f = new File(fileName);
        if (!f.isFile()) {
            System.out.println("File not exist.");
            return null;
        }
        parser = new PDFParser(new FileInputStream(f));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        cosDoc.close();
        pdDoc.close();
        return parsedText;
    } catch (IOException ex) {
        Logger.getLogger(PDFTextParser.class.getName()).log(Level.SEVERE, null, ex);
        return null;
    }
}

実行する前に: pdfbox.jar をプロジェクトに追加します。

java - PdfBoxはpdfから同じフォントファミリのテキストを抽出します

1 に答える 1

Related

Reference