c# - PDFファイルで空白ページを見つける方法

Question

PDFファイルの空白ページを検出できません。インターネットで検索しましたが、良い解決策が見つかりませんでした。

Itextsharp を使用して、ページサイズ、Xobjects で試しました。しかし、正確な結果は得られません。

私は試した

if(xobjects==null || textcontent==null || size <20 bytes )
  then "blank"
else
 not blank

しかし、最大時間は間違った答えを返します。Itextsharpを使用しました

コードは以下のとおりです...私はItextsharp Librabryを使用しています

xobjects の場合

PdfDictionary xobjects = resourceDic.GetAsDict(PdfName.XOBJECT);
//here resourceDic is PdfDictionary type
//I know that if Xobjects is null then page is blank. But sometimes blank page gives xobjects which is not null.

コンテンツストリーム用

 RandomAccessFileOrArray f = reader.SafeFile;
 //here reader = new PdfReader(filename);

 byte[] contentBytes = reader.GetPageContent(pageNum, f);
 //I have measured the size of contentbytes but sometimes it gives more than 20 bytes for   blank page

テキストコンテンツ用

String extractedText = PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy());
  // sometimes blank page give a text more than 20 char length .

score 2 · Accepted Answer

空のページを見つける非常に簡単な方法は次のとおりです。デバイスを呼び出す Ghostscript コマンドラインを使用しbboxます。

Ghostscript の bboxは、ピクセルがレンダリングされるページのすべてのポイントを囲む最小の四角形の「境界ボックス」の座標を計算します。

gs \
  -o /dev/null \
  -sDEVICE=bbox \
   input.pdf

Windows の場合:

gswin32c.exe ^
  -o nul ^
  -sDEVICE=bbox ^
   input.pdf

結果：

GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 6.
Page 1
%%BoundingBox: 27 281 548 804
%%HiResBoundingBox: 27.000000 281.000000 547.332031 804.000000
Page 2
%%BoundingBox: 0 0 0 0
%%HiResBoundingBox: 0.000000 0.000000 0.000000 0.000000
Page 3
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 4
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 5
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 6
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000

ご覧のとおり、入力ドキュメントの 2 ページ目は空でした。

score 1 · Accepted Answer

文字列に対して .Trim() を試したことがあると思われるので、それ自体はお勧めしません。

空白の 20 文字以上の長さの文字列の実際の内容は何ですか? 私はそれが単なる改行文字であると考えていenterます（改ページを挿入するのではなく、新しいページを取得するためだけに10回以上押すとどうなるかなど）、その場合：

String extractedText = 
    string.Replace(string.Replace(
        PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy())
    , Environment.NewLine, ""), "\n", "").Trim();

この後の出力内容をお知らせください。

もう 1 つの可能性は、改行しないスペースや実際にはスペースではない他の文字を含む空白のテキストであるということです。これらを手動で見つけて置き換える必要があります。 [0-9,az,AZ] を使用して、ページが空白かどうかを判断します。

c# - PDFファイルで空白ページを見つける方法

3 に答える 3

Related

Reference