c# - PDFテキストを検索し、座標を取得した後に長方形を描いて見つかった単語を強調表示し、テキストを強調表示してPDFを保存します

Question

テキスト座標を取得する方法を手伝ってくれる人はいますか? これは可能ですか？ユーザーがテキストボックスに単語を入力し、アプリが iTextSharp を使用して既存の PDF を読み取り、見つかった場合は一致する単語を強調表示し、強調表示されたテキストで PDF を保存する Windows フォームアプリが必要だったからです。これまでのところ、黄色の長方形の描画を含め、ほとんどすべての作業が完了していますが、不足しているのは、一致したパターンのテキスト座標を取得してそれらを強調表示する方法です。事前に感謝します: (ちなみに: sb は検索テキストボックスです) 、tb は PDF テキストが表示されるリッチテキストボックスです)

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using iTextSharp.text;
using System.Text.RegularExpressions;

namespace manipulatePDF
{
    public partial class Form1 : Form
    {
        string oldFile;
        Document document = new Document();
        StringBuilder text = new StringBuilder();
    public Form1()
    {
        InitializeComponent();
    }
    private void open_Click(object sender, EventArgs e)
    {
        reset_Click(sender, e);

        openFileDialog1.Filter = "PDF Files (.pdf)|*.pdf";
        openFileDialog1.FilterIndex = 1;

        if (openFileDialog1.ShowDialog() == System.Windows.Forms.DialogResult.OK)
        {
            label1.Text = "File Location: " + openFileDialog1.FileName;
            oldFile = openFileDialog1.FileName;

            // open the reader
            PdfReader reader = new PdfReader(oldFile);

            iTextSharp.text.Rectangle size = reader.GetPageSizeWithRotation(1);
            document.SetPageSize(size);

            for (int cPage = 1; cPage <= reader.NumberOfPages; cPage++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(reader, cPage, strategy);
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
                reader.Close();
            }
            tb.Text = text.ToString();
        }
    }
    private void save_Click(object sender, EventArgs e)
    {
        saveFileDialog1.InitialDirectory = "C: ";
        saveFileDialog1.Title = "Save the PDF File";
        saveFileDialog1.Filter = "PDF files (*.pdf)|*.pdf";

        if (saveFileDialog1.ShowDialog() == System.Windows.Forms.DialogResult.OK)
        {
            PdfReader reader = new PdfReader(oldFile);
            string newFile = saveFileDialog1.FileName;

            // open the writer
            FileStream fs = new FileStream(newFile, FileMode.Create, FileAccess.Write);
            PdfWriter writer = PdfWriter.GetInstance(document, fs);

            document.Open();

            // the pdf content
            PdfContentByte cb = writer.DirectContent;

            // select the font properties
            PdfGState graphicsState = new PdfGState();
            graphicsState.FillOpacity = 10;
            cb.SetGState(graphicsState);

            int index = 0;
            while (index < text.ToString().LastIndexOf(sb.Text))
            {
                if (contain.Checked == true)
                {
                    tb.Find(sb.Text, index, tb.TextLength, RichTextBoxFinds.MatchCase);
                    tb.SelectionBackColor = Color.Gold;
                    index = tb.Text.IndexOf(sb.Text, index) + 1;
                }
                else if (exact.Checked == true)
                {
                    tb.Find(sb.Text, index, tb.TextLength, RichTextBoxFinds.WholeWord);
                    tb.SelectionBackColor = Color.Gold;
                    index = tb.Text.IndexOf(sb.Text, index) + 1;
                }
            }

            int count = 0; //counts the pattern occurance
            for (int cPage = 1; cPage <= reader.NumberOfPages; cPage++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(reader, cPage, strategy);
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                string textToSearch = sb.Text;
                int lastStartIndex = currentText.IndexOf(textToSearch, 0, StringComparison.CurrentCulture);

                while (lastStartIndex != -1)//if the pattern was found
                {
                    count++;
                    lastStartIndex = currentText.IndexOf(textToSearch, lastStartIndex + 1, StringComparison.CurrentCulture);

                    BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
                    cb.SetFontAndSize(bf, 10);

                    cb.SetColorFill(new CMYKColor(0f, 0f, 1f, 0f));
                    cb.Rectangle(document.PageSize.Width - 500f, 600f, 100f, 100f);
                    cb.Fill();
                }

                if (count != 0)
                {
                    if (contain.Checked == true)
                    {
                        label2.Text = "Number of pages: " + cPage + " - " + textToSearch + " found " + count + " times. \n";
                    }
                    else if (exact.Checked == true)
                    {
                        //finds the words that are bounded by a space or a dot and store in cCount
                        //returns the count of matched pattern = count - cCount
                    }
                }

                text.Append(currentText);
                // create the new page and add it to the pdf
                PdfImportedPage page = writer.GetImportedPage(reader, cPage);
                cb.AddTemplate(page, 0, 0);

                document.NewPage();
                //PdfStamper stamper = new PdfStamper(reader, fs);
                ////Create a rectangle for the highlight. NOTE: Technically this isn't used but it helps with the quadpoint calculation
                //iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(60.6755f, 749.172f, 94.0195f, 735.3f);
                ////Create an array of quad points based on that rectangle. NOTE: The order below doesn't appear to match the actual spec but is what Acrobat produces
                //float[] quad = { rect.Left, rect.Bottom, rect.Right, rect.Bottom, rect.Left, rect.Top, rect.Right, rect.Top };

                ////Create our hightlight
                //PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quad);

                ////Set the color
                //highlight.Color = BaseColor.YELLOW;

                ////Add the annotation
                //stamper.AddAnnotation(highlight, 1);
            }

            // close the streams
            document.Close();
            fs.Close();
            writer.Close();
            reader.Close();
        }
    }
    private void reset_Click(object sender, EventArgs e)
    {
        tb.Text = "";
    }
}

score 3 · Accepted Answer

さて、Vb.NET 2010 を使用して作成されたダウンロード可能な例を追加しました。これは、まさに必要なことを行います。これは、Chris が参照した同じスレッドの別の投稿で利用できます。そのコードはすべてのフォントタイプ、フォントサイズで機能し、検索する単語/文に一致するすべてを返し、各一致を x/y 位置の四角形として UI に返し、最後にそれらすべてを強調表示して新しいファイルに保存します。 PDF の場合、検索語、カルチャ別の比較タイプ、ソース PDF パス、宛先 PDF パスなどの初期パラメーターを指定するだけです。唯一実装されていないのは、検索語/文が複数行にまたがる特定のケースですが、TextChunk クラスで SameLine() メソッドを使用できるため、コードを簡単に変更できるはずです。

c# - PDFテキストを検索し、座標を取得した後に長方形を描いて見つかった単語を強調表示し、テキストを強調表示してPDFを保存します

1 に答える 1

Related

Reference