java - PDFBox を使用して既存の PDF ドキュメントに同じテキストを書き直す

Question

それは非常に重要な質問であり、私はあなたの助けを得ることに非常に興味があります.

PDFBox を使用して簡単な PDF ドキュメントを作成しました。私がやろうとしているのは、既存のドキュメントを読んでから、同じテキストを同じ位置に書き直すことです。

1) まず、「Musique.pdf」という名前の PDF を作成します。

2)この既存のドキュメントを読みます。

3) PDFTextStripper を使用してテキストをドキュメントに抽出します。

3) ドキュメント内の各文字の位置を見つけます (x、y、幅、fs など)。

4) 各文字の x と y を含むテーブルを作成します。たとえば、 table1 [0]=x1 table1[1]=y1 、 table1[2]=x2 、 table1[3]=y2 など

5) 次に、PDFContentStream のブークルを作成して、各文字を正しい位置に書き直します。

問題は：

最初の行は完全に書かれていますが、問題は 2 行目にあります。

"I notice that if we have for example a text formed of 3 lines and if we assume that it contains 225 characters,,so if we get the length of this text, we will put a length equal to 231,,so we can notice that it adds 2 spaces of the end of each line,, but when we search the position of each character, the program does not consider these added spaces"

以下のコードを実行して、この問題を解決する方法を教えてください。

今までの私のコード:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package test;

import java.io.IOException;
import java.io.OutputStream;
import java.util.List;
import org.apache.pdfbox.cos.COSInteger;
import org.apache.pdfbox.cos.COSStream;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdfwriter.ContentStreamWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.util.PDFOperator;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;


public class Test extends PDFTextStripper{
private static final String src="...";
    private static int i;
    private static float[] table1;
    private static PDPageContentStream content;
    private static float jjj;

public Test() throws IOException {
        super.setSortByPosition(true);
    }


public static void createPdf(String src) throws IOException, COSVisitorException{


 //create  document named "Musique.pdf"

PDRectangle rec= new PDRectangle(400,400);
PDDocument document= null;
document= new PDDocument();
PDPage page= new PDPage(rec);
document.addPage(page);
PDFont font= PDType1Font.HELVETICA;
PDPageContentStream canvas1= new PDPageContentStream(document,page,true,true);
canvas1.setFont(font, 10);
canvas1.beginText();
canvas1.appendRawCommands("15 385 Td");
canvas1.appendRawCommands("(La musique est très importante dans notre vie moderne. Sans la musique, non)Tj\n");
canvas1.endText();
canvas1.close();
PDPageContentStream canvas2= new PDPageContentStream(document,page,true,true);
canvas2.setFont(font, 11);
canvas2.beginText();
canvas2.appendRawCommands("15 370 Td");
canvas2.appendRawCommands("(Donc il est très necessaire de jouer chaque jours la musique.)Tj\n");
canvas2.endText();
canvas2.close();
document.save("Musique.pdf");
document.close();

                 }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException, COSVisitorException {

Test tes= new Test();
tes.createPdf(src);

//read the existing document
PDDocument doc;
doc= PDDocument.load("Musique.pdf");
List pages = doc.getDocumentCatalog().getAllPages(); 
PDPage page = (PDPage) pages.get(0);
//extract the text existed in the document
PDFTextStripper stripper =new PDFTextStripper();
String texte=stripper.getText(doc);
PDStream contents = page.getContents();

  if(contents!=null){

      i=1;
      table1=new float[texte.length()*2]; 
      table1[0]=(float)15.0;
      //the function below call the processTextPosition procedure in order to find the position of each character and put each value in a case of table1
      tes.processStream(page, page.findResources(), page.getContents().getStream()); 

      //after execution of processTextPosition, the analysing of code continue to the below code:

 int iii=0;
int kkk=0;
//create a boucle of PDPageContentStream in order to re-write completly the text in the document
//when you run this code, you must notice a problem with the second line, so how to resolve this problem ?
PDFont font= PDType1Font.HELVETICA;
while(kkk<table1.length){
    content = new PDPageContentStream(doc,page,true,true);
    content.setFont(font, 10);
    content.beginText();
    jjj = 400-table1[kkk+1];
    content.appendRawCommands(""+table1[kkk]+" "+jjj+" Td");
    content.appendRawCommands("("+texte.charAt(iii)+")"+" Tj\n");
    content.endText();
    content.close();
    iii=iii+1;
    kkk=kkk+2;

}

  }
  //save the modified document
  doc.save("Modified-musique.pdf");
  doc.close();

}

      /**
     * @param text The text to be processed
     */

    public void processTextPosition(TextPosition text) {

        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());

         if(i>1){
        table1[i]=text.getXDirAdj();
        System.out.println(table1[i]);
        i=i+1;
        table1[i]=text.getYDirAdj();
        System.out.println(table1[i]);
         i=i+1;
        }
        else{
        table1[i]=text.getYDirAdj(); 
        System.out.println(table1[i]);
        i=i+1;
         }    
    } 
}

よろしくお願いします、

リスト。

score 7 · Accepted Answer

あなたのコンセプトとコードには欠点があります。

まず、コンセプト: 3の番号が付けられた 2 つのアイテム:

3) PDFTextStripper を使用してテキストをドキュメントに抽出します。

3) ドキュメント内の各文字の位置を見つけます (x、y、幅、fs など)。

私の目には、これらの 2 つのステップを分離することは悪い考えです。なぜなら、一般に、テキスト抽出からそれぞれ対応する文字を認識し、コンテンツからグリフを認識するのに苦労するからです。

eたとえば、コンテンツ内のどのグリフがeテキスト内のどの文字に対応しているのかなど、一般的には難しいでしょう。コンテンツストリーム内の出現順序が解析済みテキスト内の順序と同じであることを期待することは、非常に単純なページコンテンツでのみ機能します。

そして、置換によって課せられる追加の問題があります。たとえば、テキスト抽出は合字を展開する可能性が非常に高く、ffたとえばﬀ.

さらに、フォントのエンコーディングと文字列のエンコーディングを行き来するという問題があり、非常に損失が大きくなる可能性があります

さらに、テキスト抽出により、コンテンツに存在しない空白文字がテキストに追加される可能性があります。たとえば、 y方向のジャンプを認識した場所に改行を追加したり、 x方向のジャンプを認識した場所にスペースを追加したりできます。

ところで、これはおそらくあなたの観察の理由です：

たとえば、3 行で構成されたテキストがあり、225 文字が含まれていると仮定すると、このテキストの長さを取得すると、長さは 231 になることに注意してください。各行の末尾に 2 つのスペースが追加されますが、各文字の位置を検索するときに、プログラムはこれらの追加されたスペースを考慮しません。

さらに、あなたのコードはPDFサイズを爆発させます

5) 次に、PDFContentStream のブークルを作成して、各文字を正しい位置に書き直します。

while(kkk<table1.length){
    content = new PDPageContentStream(doc,page,true,true);
    ...
}

少なくとも、追加のコンテンツストリームを 1 つだけ作成することをお勧めします...

次のようなものから始めてみてはどうでしょうか。

// read the existing document
PDDocument doc;
doc = PDDocument.load(musiqueFileName);
List<?> pages = doc.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) pages.get(0);

PDPageContentStream content = new PDPageContentStream(doc, page, true, true);

TestRewriter rewriter = new TestRewriter(content);
rewriter.processStream(page, page.findResources(), page.getContents().getStream());

content.close();

// save the modified document
doc.save(modifiedMusiqueFileName);
doc.close();

ここで TestRewriter も PDFTextStripper のサブクラスです。

public static class TestRewriter extends PDFTextStripper
{
    final PDPageContentStream canvas;

    public TestRewriter(PDPageContentStream canvas) throws IOException
    {
        this.canvas = canvas;
    }

    /**
     * @param text
     *            The text to be processed
     */
    public void processTextPosition(TextPosition text)
    {
        try
        {
            PDFont font = PDType1Font.HELVETICA;
            canvas.setFont(font, 10);
            canvas.beginText();
            canvas.appendRawCommands("" + (text.getXDirAdj()) + " " + (400 - text.getYDirAdj()) + " Td");
            canvas.appendRawCommands("(" + text.getCharacter() + ")" + " Tj\n");
            canvas.endText();
        }
        catch(IOException e)
        {
            e.printStackTrace();
        }
    }
}

これはまだ完璧にはほど遠いですが、続行するのに役立つかもしれません...

並行して実際のテキストを解析する必要がある場合は、より多くのPDFTextStripperメソッドprocessTextPositionを統合して機能を組み合わせます。

java - PDFBox を使用して既存の PDF ドキュメントに同じテキストを書き直す

1 に答える 1

Related

Reference