itext - iText - PD4ML で生成された pdf のコンテンツを読み取れない

Question

iText で PDF コンテンツを読む際に問題があります。私はすべての異なる技術をテストしました。それらはすべて標準の PDF ドキュメントで動作しますが、修正が必要な PDF ドキュメントが 1 つあり、コンテンツを取得できません。

このドキュメントは PD4ML によって生成されました。Acrobat Reader では読めますが、Open Office では読めません。

コマンドの使用例

  PdfReader reader = new PdfReader(src);
  FileOutputStream out = new FileOutputStream(result);
  out.write(reader.getPageContent(1));

Produces this output: q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 lh f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 - 2572.832 LH W N 29.18088 102.1433 536.9282 675.0511 RE W NQ 24.78997 0 0 22.53634 51.71722 733.2485 CM /IM1 DO Q /CS1 CS 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2.c BT 20 0 0 20 40 0 Tm /G1 1 Tf [ <0033> 1 <004800550049> 1 <00520055005000440051004600480003> 1 <0044005100470003>

しかし、テキストコンテキストを取得しようとすると、テキスト項目が表示されません。テキスト形式が異なっていた場合のように。

このコード:

    PdfReader reader = new PdfReader(src);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(result)); TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
      out.println(strategy.getResultantText());
    }

スペースを生成するだけです。TextLocationStrategy についても同じです。

コマンド PdfContentReaderTool.listContentStream(new File(src), out);

=============ページ 1==================== - - - - - 辞書 - - - - - - ( /Parent=辞書の種類: /Pages, /Contents=Stream, /Type=/Page, /Resources=Dictionary, /MediaBox=[0, 0, 595.29, 841.89]) サブディクショナリ /Parent = (/Type=/Pages, /MediaBox=[0, 0, 595.29, 841.89], /Count=6, /Kids=[2 0 R, 14 0 R, 26 0 R, 30 0 R, 34 0 R, 38 0 R]) サブディクショナリ /Resources = (/XObject=Dictionary, /ProcSet=[/PDF, /Text, /ImageB, /ImageC, /ImageI], /ColorSpace=Dictionary, /Font=Dictionary) Subdictionary /XObject = (/Im1=タイプのストリーム: / XObject) サブディクショナリ /ColorSpace = (/Cs1=[/ICCBased, 12 0 R]) サブディクショナリ /Font = (/G2=タイプの辞書: /Font, /G1=タイプのディクショナリ: /Font) サブディクショナリ /G2 = (/ BaseFont=/HCNQGU+font000000001c036002, /DescendantFonts=[50 0 R], /Type=/Font, /Encoding=/Identity-H,/Subtype=/Type0, /ToUnicode=Stream) サブディクショナリ /G1 = (/BaseFont=/HCZCBJ+font000000001c036002, /DescendantFonts=[43 0 R], /Type=/Font, /Encoding=/Identity-H, /Subtype= /Type0, /ToUnicode=Stream) - - - - - XObject の概要 - - - - - - ------ /Im1 - サブタイプ = /Image = 9148 バイト ------

- - - - Content Stream - - - - - - q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 lh f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 - 2572.832 L 43.26609 -2572.832 LH W N 29.18088 102.1433 536.9282 675.0511 RE W NQ 24.78997 0 0 22.53634 51.71722 733.2485 CM /IM1 DO Q /CS 0.2 CS 0 0.2 CS 0.2 CS 0.2 CS 0.2 CS 0.2 SM /IM1 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm /G1 1

しかし、テキスト抽出の部分は空です。

テキストが読めない理由がわかりません。テキストを取得する前に、他にできることやテストできることはありますか?

どんなポインターでも大歓迎です。

ジル

itext - iText - PD4ML で生成された pdf のコンテンツを読み取れない

0 に答える 0

Related

Reference