c# - iTextSharpを使用してFlateDecode画像を抽出します

Question

PDFから画像を抽出したい。現在、iTextSharpを使用しています。一部の画像は正しく抽出できますが、ほとんどの画像は適切な色ではなく、歪んでいます。さまざまなPixelFormatでいくつかの実験を行いましたが、問題の解決策が得られませんでした...

これは、画像タイプを区切るコードです。

if (filter == "/FlateDecode")
{
   // ...
   int w = int.Parse(width);
   int h = int.Parse(height);
   int bpp = tg.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue;

   byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)tg);
   byte[] decodedBytes = PdfReader.FlateDecode(rawBytes);
   byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, tg.GetAsDict(PdfName.DECODEPARMS));

   PixelFormat[] pixFormats = new PixelFormat[23] { 
         PixelFormat.Format24bppRgb,
         // ... all Pixel Formats
    };
    for (int i = 0; i < pixFormats.Length; i++)
    {
        Program.ToPixelFormat(w, h, pixFormats[i], streamBytes, bpp, images));
    }
}

これは、画像をMemoryStreamに保存するためのコードです。画像をフォルダに保存することは後で実装されます。

private static void ToPixelFormat(int width, int height, PixelFormat pixelformat, byte[] bytes, int bpp, IList<Image> images)
{
    Bitmap bmp = new Bitmap(width, height, pixelformat);
    BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, width, height),
       ImageLockMode.WriteOnly, pixelformat);
    Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
    bmp.UnlockBits(bmd);
    using (var ms = new MemoryStream())
    {
       bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Tiff);
       bytes = ms.GetBuffer();
    }
    images.Add(bmp);
}

私を助けてください。

score 3 · Accepted Answer

問題の解決策を見つけたとしても、上記のコードを修正するための提案をさせてください。

行データ境界の不一致が原因で、歪みの問題が発生していると思います。PdfReader はバイト境界でデータを返します。たとえば、幅が 20 ピクセルのグレースケールイメージの場合、各イメージ行に対して 20 バイトのデータが取得されます。Bitmap クラスは 32 ビット境界で動作します。幅 20 ピクセルのビットマップを作成する場合、Bitmap クラスはストライド (バイト幅) = 32 バイトのグレースケールビットマップを生成します。ToPixelFormat() にあるように、Marshal.Copy() メソッドを使用して、取得したバイトを PdfReader から新しいビットマップに単純にコピーできないことを意味します。

ソースバイト配列の最初のピクセルは 21 番目のバイトとして配置されますが、ビットマップの 32 ビット境界のため、デスティネーションビットマップはそれを 33 番目のバイトとして必要とします。この問題を解決するには、各データ行の 32 ビット境界を考慮したサイズのバイト配列を作成する必要がありました。

32 ビットの行境界を考慮して、PdfReader から取得したバイト配列から新しいバイト配列にデータを行ごとにコピーします。これで、Bitmap クラスの境界に一致する境界を持つデータのバイトができたので、Marshal.Copy() を使用して新しい Bitmap にコピーできます。

score 2 · Accepted Answer

私は自分の問題の解決策を見つけました。すべてのページのすべての画像を抽出するために、さまざまなフィルターを実装する必要はありません。iTextSharp には、すべての画像を元の画像タイプで保存する画像レンダラーがあります。

http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx HttpHandler を実装する必要はありません...

score 1 · Accepted Answer

PDF supports a pretty wide variety of image formats. I don't think I would take this approach you've chosen here. You need to determine the image format from the bytes in the stream itself. For example, JPEG will typically start with the ASCII bytes JFIF.

.NET (3.0+) does come with a method that will attempt to pick the right decoder: BitmapDecoder.Create. See http://msdn.microsoft.com/en-us/library/system.windows.media.imaging.bitmapdecoder.aspx

If that doesn't work you may want to consider some third-party imaging libraries. I've used ImageMagick.NET and LeadTools (way overpriced).

c# - iTextSharpを使用してFlateDecode画像を抽出します

3 に答える 3

Related

Reference