java - PDFから画像とそのメタデータを抽出するにはどうすればよいですか？

Question

Javaを使用してPDFファイルから画像を抽出し、元の作成日と変更日を失うことなく特定のフォルダーにエクスポートすることは可能ですか？ITextとPDFBoxを使用してこの目標を達成しようとしましたが、成功しませんでした。どんなアイデアや例でも大歓迎です。

score 6 · Accepted Answer

画像にはメタデータが含まれておらず、画像にアセンブルする必要がある生データとして保存されます。https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-h ow-are-images-stored/とhttpsで画像データがPDFファイルに保存される方法を説明する2つのブログ投稿を書きました：//blog.idrsolutions.com/2010/09/understanding-the-pdf-file-format-images/

score 4 · Accepted Answer

私は他の人に同意せず、あなたの質問にPOCを持っています：次の方法でpdfboxを使用して画像のXMPメタデータを抽出できます。

public void getXMPInformation() {
    // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
}

そして「ヘルパーメソッド」：

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

注：これは、迅速で汚い概念実証であり、適切なスタイルのコードではありません。

PDFドキュメントを作成する前に、画像をInDesignに配置する場合は、画像にXMPメタデータが含まれている必要があります。XMP-Metdadataは、たとえばPhotoshopを使用して設定できます。すべてのIPTC/Exif/...情報がXMP-メタデータに変換されるわけではないことに注意してください。変換されるフィールドはごくわずかです。

InDesignでビルドされたPDFに配置されたJPGおよびPNG画像でこの方法を使用しています。それはうまく機能し、準備ができたPDF（画像コーティング）から、制作ステップ後のすべての画像情報を取得できます。

score 3 · Accepted Answer

短い答え

たぶん、しかしおそらくそうではありません。

長い答え

PDFは、JPEG、JPEG2000（より一般的になりつつある）、CITT（fax）3および4、およびJBIG2（非常にまれ）をネイティブにサポートします。これらの形式の画像は、ファイル内のメタデータを保持しながら、バイト単位でPDFにコピーできます。作成/変更日は通常、イメージではなくファイルシステムの一部です。

JPEG：内部メタデータをサポートしているようには見えません。

JPEG2000：うん。そこには潜在的にたくさんのものがあります

CITT：そのようには見えません。

JBIG2：えーと、そう思いますが、スキミングしたばかりのスペックからは明らかではありません。

他のすべての画像形式は、ピクセルに変換してから、何らかの方法で圧縮する必要があります（多くの場合、Flate / ZIPを使用）。これらの変換により、メタデータがPDFのxmlメタデータまたは画像の辞書の一部として保持される可能性がありますが、そのようなことが起こったことは聞いたことがありません。それはただ投げられます。

score 1 · Accepted Answer

画像がPDFに埋め込まれている場合、通常、元の作成日と変更日は保存されません。生のピクセルデータだけが圧縮されて保存されます。しかし、ウィキペディアによると：

PDFのラスター画像（Image XObjectsと呼ばれる）は、関連するストリームを持つ辞書で表されます。

辞書にはメタデータが含まれており、その中に日付が含まれている場合があります。

score 0 · Accepted Answer

SonwTideAPIを使用してPDFファイルからメタデータを取得します。PDFTextStream.jarを使用します。最後に、すべてのPDFプロパティが返され、コマンドラインで印刷されます。

public static void getPDFMetaData(String pdfFilePath) throws IOException{

            // input pdf file with location Add PDFTextStream.jar from snowtide web site to your code build path
            PDFTextStream stream = new PDFTextStream(pdfFilePath);

            // get collection of all document attribute names
            Set attributeKeys = stream.getAttributeKeys();

            // print the values of all document attributes to System.out
            Iterator iter = attributeKeys.iterator();
            String attrKey;
            while (iter.hasNext()) {
                attrKey = (String)iter.next();
                System.out.println(attrKey + " = " + stream.getAttribute(attrKey));

            }


}

java - PDFから画像とそのメタデータを抽出するにはどうすればよいですか？

5 に答える 5

短い答え

長い答え

Related

Reference