c# - PDFSharpを使用してPDFからFlateDecoded画像を抽出する方法

Question

PDFSharpを使用してPDFドキュメントからFlateDecoded（PNGなど）の画像を抽出するにはどうすればよいですか？

PDFSharpのサンプルでそのコメントを見つけました：

// TODO: You can put the code here that converts vom PDF internal image format to a
// Windows bitmap
// and use GDI+ to save it in PNG format.
// [...]
// Take a look at the file
// PdfSharp.Pdf.Advanced/PdfImage.cs to see how we create the PDF image formats.

誰かがこの問題の解決策を持っていますか？

返信ありがとうございます。

編集：私は8時間以内に自分の質問に答えることができないので、私はその方法でそれをします：

非常に迅速な返信ありがとうございます。

メソッド「ExportAsPngImage」にコードを追加しましたが、希望する結果が得られませんでした。さらにいくつかの画像（png）を抽出しているだけで、適切な色がなく、歪んでいます。

これが私の実際のコードです：

PdfSharp.Pdf.Filters.FlateDecode flate = new PdfSharp.Pdf.Filters.FlateDecode();
        byte[] decodedBytes = flate.Decode(bytes);

        System.Drawing.Imaging.PixelFormat pixelFormat;

        switch (bitsPerComponent)
        {
            case 1:
                pixelFormat = PixelFormat.Format1bppIndexed;
                break;
            case 8:
                pixelFormat = PixelFormat.Format8bppIndexed;
                break;
            case 24:
                pixelFormat = PixelFormat.Format24bppRgb;
                break;
            default:
                throw new Exception("Unknown pixel format " + bitsPerComponent);
        }

        Bitmap bmp = new Bitmap(width, height, pixelFormat);
        var bmpData = bmp.LockBits(new Rectangle(0, 0, width, height), ImageLockMode.WriteOnly, pixelFormat);
        int length = (int)Math.Ceiling(width * bitsPerComponent / 8.0);
        for (int i = 0; i < height; i++)
        {
            int offset = i * length;
            int scanOffset = i * bmpData.Stride;
            Marshal.Copy(decodedBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
        }
        bmp.UnlockBits(bmpData);
        using (FileStream fs = new FileStream(@"C:\Export\PdfSharp\" + String.Format("Image{0}.png", count), FileMode.Create, FileAccess.Write))
        {
            bmp.Save(fs, System.Drawing.Imaging.ImageFormat.Png);
        }

それは正しい方法ですか？または、別の方法を選択する必要がありますか？どうもありがとう！

score 1 · Accepted Answer

Windows BMPを取得するには、ビットマップヘッダーを作成してから、画像データをビットマップにコピーする必要があります。PDFイメージはバイトアラインされます（すべての新しい行はバイト境界で始まります）が、Windows BMPはDWORDアラインされます（すべての新しい行はDWORD境界で始まります（歴史的な理由からDWORDは4バイトです））。ビットマップヘッダーに必要なすべての情報は、フィルターパラメーターで見つけるか、計算することができます。

カラーパレットは、PDFのもう1つのFlateEncodedオブジェクトです。また、それをBMPにコピーします。

これは、いくつかの形式（1ピクセルあたり1ビット、8 bpp、24 bpp、32 bpp）で実行する必要があります。

score 1 · Accepted Answer

これを行うための完全なコードを次に示します。

PDF から UPS 配送ラベルを抽出しているので、形式が事前にわかっています。抽出した画像のタイプが不明な場合は、を確認して適切に処理する必要がありますbitsPerComponent。また、ここでは最初のページの最初の画像のみを処理します。

注: 私はTryUnfilter「deflate」を使用しています。これは、適用されたフィルターを使用し、その場でデータをデコードします。「Deflate」を明示的に呼び出す必要はありません。

    var file = @"c:\temp\PackageLabels.pdf";

    var doc = PdfReader.Open(file);
    var page = doc.Pages[0];

    {
        // Get resources dictionary
        PdfDictionary resources = page.Elements.GetDictionary("/Resources");
        if (resources != null)
        {
            // Get external objects dictionary
            PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
            if (xObjects != null)
            {
                ICollection<PdfItem> items = xObjects.Elements.Values;

                // Iterate references to external objects
                foreach (PdfItem item in items)
                {
                    PdfReference reference = item as PdfReference;
                    if (reference != null)
                    {
                        PdfDictionary xObject = reference.Value as PdfDictionary;
                        // Is external object an image?
                        if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
                        {
                            // do something with your image here 
                            // only the first image is handled here
                            var bitmap = ExportImage(xObject);
                            bmp.Save(@"c:\temp\exported.png", System.Drawing.Imaging.ImageFormat.Bmp);
                        }
                    }
                }
            }
        }
    }

これらのヘルパー関数の使用

    private static Bitmap ExportImage(PdfDictionary image)
    {
        string filter = image.Elements.GetName("/Filter");
        switch (filter)
        {
            case "/FlateDecode":
                return ExportAsPngImage(image);

            default:
                throw new ApplicationException(filter + " filter not implemented");
        }
    }

    private static Bitmap ExportAsPngImage(PdfDictionary image)
    {
        int width = image.Elements.GetInteger(PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfImage.Keys.Height);
        int bitsPerComponent = image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent);   

        var canUnfilter = image.Stream.TryUnfilter();
        var decoded = image.Stream.Value;

        Bitmap bmp = new Bitmap(width, height, System.Drawing.Imaging.PixelFormat.Format8bppIndexed);
        BitmapData bmpData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        Marshal.Copy(decoded, 0, bmpData.Scan0, decoded.Length);
        bmp.UnlockBits(bmpData);

        return bmp;
    }

score 0 · Accepted Answer

PDF には、マスクやさまざまな色空間オプションを含む画像が含まれている場合があるため、画像オブジェクトを単純にデコードしても正しく機能しない場合があります。

そのため、コードでは、PDF 内でイメージマスク (/ImageMask) とイメージオブジェクトのその他のプロパティ (イメージが反転色を使用するか、インデックス付きの色を使用するかを確認するため) をチェックして、PDF での表示方法と同様のイメージを再作成する必要もあります。公式PDF リファレンスの Image オブジェクト、/ImageMask および /Decode ディクショナリを参照してください。

PDFSharp が PDF 内のイメージマスクオブジェクトを検索できるかどうかはわかりませんが、iTextSharp はイメージマスクオブジェクトにアクセスできます (PdfName.MASK オブジェクトタイプを参照)。

PDF Extractor SDKなどの商用ツールは、元の形式と「レンダリングされた」形式の両方で画像を抽出できます。

私は、PDF Extractor SDK のメーカーである ByteScout で働いています。

c# - PDFSharpを使用してPDFからFlateDecoded画像を抽出する方法

6 に答える 6

Related

Reference