java - iText を使用した PDF ドキュメントの読み取りが機能しないことがある

Question

iText を使用して PDF ドキュメントを読み込んでいます。ArrayIndexOutOfBoundsException が発生しています。奇妙なことは、特定のファイルとそれらのファイルの特定の場所でのみ発生することです。PDFがそれらの場所でエンコードされる方法に関係していると思われますが、何が問題なのかわかりません。

この質問Read pdf using iText を見てきましたが、彼はこのファイルの場所を変更することで問題を解決したようです。一部のファイル内の特定の場所で例外が発生するため、これは機能しません。したがって、例外を引き起こしているのはファイル自体ではなく、問題のページです。

スタックトレースは

スレッド「メイン」の例外 java.lang.ArrayIndexOutOfBoundsException: 無効なインデックス: 02 com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID (不明なソース) com.lowagie.text.pdf.CMapAwareDocumentFont.decode (不明なソース) で com .lowagie.text.pdf.parser.PdfContentStreamProcessor.decode (不明なソース) com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString (不明なソース) com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke (不明なソース) com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator (不明なソース) com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent (不明なソース) com.lowagie.text.pdf.parser com.pdfextractor.main.Extractor の .PdfTextExtractor.getTextFromPage(不明なソース)。メイン (Extractor.java:61)

61 行目は次の行に対応します。
content = extractor.getTextFromPage(page);
したがって、getTextFromPage() メソッドが機能していないことは明らかです。

public static void main(String[] args) throws IOException{
    ArrayList<String> keywords = new ArrayList<String>();
        keywords.add("location");
        keywords.add("Mass Spectrometry");  
        keywords.add("vacuole");
        keywords.add("cytosol");

    String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
    File directoryToRead = new File(directory); 
    String[] sa_filesToRead = directoryToRead.list();
    List<String> filesToRead = Arrays.asList(sa_filesToRead);

    Iterator<String> fileItr = filesToRead.iterator();
    while(fileItr.hasNext()){           

        String nextFile = fileItr.next();

        PdfReader reader = new PdfReader(directory+nextFile);
        int noPages = reader.getNumberOfPages();
        PdfTextExtractor extractor = new PdfTextExtractor(reader);

    String content=""; 
    for(int page=1;page<=noPages;page++){
        int index = 1;
        System.out.println(page);
        content = extractor.getTextFromPage(page);

        }       
    }
    }

score 1 · Accepted Answer

getTextFromPage(int)ほとんどの Java クラス/ライブラリは、次のようなメソッドのインデックスが 0 から始まることを想定しています。つまりgetTextFromPage(0)、ページ 1 からgetTextFromPage(1)テキストを返し、ページ 2 からテキストを返す必要があります。

ArrayIndexOutOfBoundsException を引き起こす for ループには、1 から始まるインデックスが付けられます。

iText のgetTextFromPage(int)インデックスが (ほぼ) 標準の 0 ではなく 1 から開始されていることは確かですか?

score 0 · Accepted Answer

同様の問題があり、テキストに特殊文字が含まれている場所で常に発生していました。エンコーディングを回避する方法があるのだろうか。

(更新) 5.1.3 の com.itextpdf.itextpdf でこの問題が発生しましたが、5.3.4 に更新された後です。この問題は修正されました。

score 0 · Accepted Answer

0

非常に活発な IText メーリングリストに投稿してみましたか?

于 2009-11-18T08:25:50.190 に答える

java - iText を使用した PDF ドキュメントの読み取りが機能しないことがある

3 に答える 3

Related

Reference