java - Java - 選択した領域から txt に pdf からテキストを抽出します

Question

アイデアは次のとおりです。

ユーザーが pdf ファイルを選択すると、このファイルが画像に変換され、そのような画像がアプリケーションに表示されます。

画像では、ユーザーは pdf ファイルから読み取りたい位置を選択でき、バックグラウンドプログラムの選択位置で終了すると、txt ファイルに保存されている元の pdf とテキストが読み取られます。

PDFファイルから得られた画像は、彼自身のPDFファイルと同じサイズであることが重要です

次のコードは、pdf を画像に変換します。私はpdfrenderer-0.9.1.jarを使用しています

import java.awt.Rectangle;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import javax.imageio.ImageIO;
import com.sun.pdfview.PDFFile;
import com.sun.pdfview.PDFPage;


public class Pdf2Image {

public static void main(String[] args) {

    File file = new File("E:\\invoice-template-1.pdf");
    RandomAccessFile raf;
    try {
        raf = new RandomAccessFile(file, "r");

        FileChannel channel = raf.getChannel();
        ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
        PDFFile pdffile = new PDFFile(buf);
        // draw the first page to an image
        int num=pdffile.getNumPages();
        for(int i=0;i<num;i++)
        {
            PDFPage page = pdffile.getPage(i);

            //get the width and height for the doc at the default zoom              
            int width=(int)page.getBBox().getWidth();
            int height=(int)page.getBBox().getHeight();             

            Rectangle rect = new Rectangle(0,0,width,height);
            int rotation=page.getRotation();
            Rectangle rect1=rect;
            if(rotation==90 || rotation==270)
                rect1=new Rectangle(0,0,rect.height,rect.width);

            //generate the image
            BufferedImage img = (BufferedImage)page.getImage(
                        rect.width, rect.height, //width & height
                        rect1, // clip rect
                        null, // null for the ImageObserver
                        true, // fill background with white
                        true  // block until drawing is done
                );

            ImageIO.write(img, "png", new File("E:/invoice-template-"+i+".png"));
        }
    } 
    catch (FileNotFoundException e1) {
        System.err.println(e1.getLocalizedMessage());
    } catch (IOException e) {
        System.err.println(e.getLocalizedMessage());
    }
}
}

次に、イメージは ImageView コンポーネントの JavaFX アプリケーションでユーザーに表示されます。pdf ファイル内のテキストを読みたい画像の一部をユーザーが選択したときのマウス、マウスの正確な位置を取得するのを手伝ってもらえますか?

このコードを使用して、pdf ファイルを読み取り、設定された位置からテキストを取得します。手動で位置を入力する必要があるだけです:( 。私は pdfbox-1.3.1.jar を使用します。クライアントがリストに画像を保持することを選択した位置に配置したいと思いますこれらすべての位置を含むpdfファイルからテキストを読み取ります。

    File file = new File("E:/invoice-template-1.pdf");
    PDDocument document = PDDocument.load(file);
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition(true);
    Rectangle rect1 = new Rectangle(38, 275, 15, 100);
    Rectangle rect2 = new Rectangle(54, 275, 40, 100); 
    stripper.addRegion("row1column1", rect1);
    stripper.addRegion("row1column2", rect2);
    List allPages = document.getDocumentCatalog().getAllPages();
    List<PDPage> pages = document.getDocumentCatalog().getAllPages();
    int j = 0;

    for (PDPage page : pages) {
        stripper.extractRegions(page);
        stripper.setSortByPosition(true);
        List<String> regions = stripper.getRegions();
        for (String region : regions) {
            String text = stripper.getTextForRegion(region);
            System.out.println("Region: " + region + " on Page " + j);
            System.out.println("\tText: \n" + text);
        }

たとえば、次の請求書では、4 つの位置を選択してテキストをエクスポートしたいと考えています。画像を選択すると、寸法がリストに保持され、リストを調べて、それらの位置から PDF ファイルからテキストをエクスポートします。

java - Java - 選択した領域から txt に pdf からテキストを抽出します

0 に答える 0

Related

Reference