java - PDFからデータの特定の部分を取得する

Question

PDFファイルからキーワード関連のデータを取得する必要があります。これらはキーワードです：タイトル、pdfの範囲、そのpdfを提案した人、バージョン、要約、状態、規制者。

PDFからデータを取得するツールはありますか? 前もって感謝します

score 2 · Accepted Answer

PDFBox は Apache から使用できます。正直なところ、使用したことはありませんが、フォーラムで多くのことを読んでいます。

他の代替手段はiTextまたはJPedalです。

興味がある場合は、それらを試してみることができますが、PDFBox を使用すると、要件を満たすことができると確信しています。

ありがとう

score 0 · Accepted Answer

PDFBOXを使う

public class PDFTextReader
{
   static String pdftoText(String fileName) {
        PDFParser parser;
        String parsedText = null;
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File(fileName);
        if (!file.isFile()) {
            System.err.println("File " + fileName + " does not exist.");
            return null;
        }
        try {
            parser = new PDFParser(new FileInputStream(file));
        } catch (IOException e) {
            System.err.println("Unable to open PDF Parser. " + e.getMessage());
            return null;
        }
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            // pdfStripper.setParagraphStart(FIND_START_VALUE);
            // pdfStripper.setParagraphEnd("FIND_END_VALUE);
            parsedText = pdfStripper.getText(pdDoc);
        } catch (Exception e) {
            System.err
                    .println("An exception occured in parsing the PDF Document."
                            + e.getMessage());
        } finally {
            try {
                if (cosDoc != null)
                    cosDoc.close();
                if (pdDoc != null)
                    pdDoc.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return parsedText;
    }
    public static void main(String args[]){

        System.out.println(pdftoText(FILEPATH));
    } 
}

ここでは、これを試してその部分を抽出しました。これはあなたを助けるかもしれません。

score 0 · Accepted Answer

Apache PDFBoxを検討してください

PDF からテキストを抽出し、解析して必要な情報を取得します。無料です。

また、 iTextという別のツールもありますが、商用プロジェクトに取り組んでいる場合は、iText のライセンスを購入する必要があります。

java - PDFからデータの特定の部分を取得する

3 に答える 3

Related

Reference