pdf - PDF をテキストに基づいて個別のファイルに分割する

Question

複数のレコードで構成される大きな 1 つの PDF ドキュメントがあります。各レコードは通常 1 ページを使用しますが、2 ページを使用するものもあります。レコードは定義済みのテキストで始まり、常に同じです。

私の目標は、この pdfを個別の pdf に分割することです。分割は、「ヘッダーテキスト」が見つかる前に常に行われるべきです。

注: java または python を使用するツールまたはライブラリを探しています。無料で、で利用できる必要がありますWin 7。

何か案は？私の知る限りimagemagick、これには機能しません。これをしてもいいitextですか？私は使用したことがなく、かなり複雑なので、ヒントが必要です。

編集：

マークされた回答は私を解決に導きました。完全を期すために、私の正確な実装は次のとおりです。

public void splitByRegex(String filePath, String regex,
        String destinationDirectory, boolean removeBlankPages) throws IOException,
        DocumentException {

    logger.entry(filePath, regex, destinationDirectory);
    destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
    PdfReader reader = null;
    Document document = null;
    PdfCopy copy = null;
    Pattern pattern = Pattern.compile(regex);        

    try {
        reader = new PdfReader(filePath);
        final String RESULT = destinationDirectory + "/record%d.pdf";
        // loop over all the pages in the original PDF
        int n = reader.getNumberOfPages();
        for (int i = 1; i < n; i++) {

            final String text = PdfTextExtractor.getTextFromPage(reader, i);
            if (pattern.matcher(text).find()) {
                if (document != null && document.isOpen()) {
                    logger.debug("Match found. Closing previous Document..");
                    document.close();
                }
                String fileName = String.format(RESULT, i);
                logger.debug("Match found. Creating new Document " + fileName + "...");
                document = new Document();
                copy = new PdfCopy(document,
                        new FileOutputStream(fileName));
                document.open();
                logger.debug("Adding page to Document...");
                copy.addPage(copy.getImportedPage(reader, i));

            } else if (document != null && document.isOpen()) {
                logger.debug("Found Open Document. Adding additonal page to Document...");
                if (removeBlankPages && !isBlankPage(reader, i)){
                    copy.addPage(copy.getImportedPage(reader, i));
                }
            }
        }
        logger.exit();
    } finally {
        if (document != null && document.isOpen()) {
            document.close();
        }
        if (reader != null) {
            reader.close();
        }
    }
}

private boolean isBlankPage(PdfReader reader, int pageNumber)
        throws IOException {

    // see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
    PdfDictionary pageDict = reader.getPageN(pageNumber);
    // We need to examine the resource dictionary for /Font or
    // /XObject keys.  If either are present, they're almost
    // certainly actually used on the page -> not blank.
    PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
    if (resDict != null) {
        return resDict.get(PdfName.FONT) == null
                && resDict.get(PdfName.XOBJECT) == null;
    } else {
        return true;
    }
}

score 5 · Accepted Answer

iText を使用して、要件に合わせたツールを作成できます。

iText ライブラリ (の現在のバージョン) に関するコードサンプルを探しているときはいつでも、iText in Action — 2nd Editionを参照してください。コードサンプルはオンラインで、ここからキーワードで検索できます。

あなたの場合、関連するサンプルはBurst.javaとExtractPageContentSorted2.javaです。

Burst.javaは、1 つの PDF を複数の小さな PDF に分割する方法を示しています。中心的なコード:

PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";

// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
    // step 1
    document = new Document();
    // step 2
    copy = new PdfCopy(document,
            new FileOutputStream(String.format(RESULT, ++i)));
    // step 3
    document.open();
    // step 4
    copy.addPage(copy.getImportedPage(reader, i));
    // step 5
    document.close();
}
reader.close();

このサンプルは、PDF を単一ページの PDF に分割します。あなたの場合、異なる基準で分割する必要があります。ただし、これは、ループ内で複数のインポートされたページを追加する必要がある場合があることを意味するだけです (したがって、インポートするループインデックスとページ番号を切り離します)。

新しいデータセットが開始するページを認識するには、ExtractPageContentSorted2.javaを参考にしてください。このサンプルでは、ページのテキストコンテンツを文字列に解析する方法を示します。中心的なコード:

PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    System.out.println("\nPage " + i);
    System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();

レコード開始テキストを検索するだけです。ページのテキストにそれが含まれている場合、新しいレコードがそこから開始されます。

score 1 · Accepted Answer

Apache PDFBoxには、コマンドラインから実行できるPDFSplitユーティリティがあります。

score 0 · Accepted Answer

コーダー以外の場合、PDF コンテンツ分割はおそらく車輪を再発明することなく最も簡単な方法であり、使いやすいインターフェイスを備えています: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html

それが役立つことを願っています。

pdf - PDF をテキストに基づいて個別のファイルに分割する

4 に答える 4

Related

Reference