c# - PDFファイルC#からAnchorTextを使用してハイパーリンクを読み取る方法

Question

のようにPDFファイルからリンク値http://google.com を取得しましたが、アンカーテキスト値を取得する必要がありますclick here. アンカーリンクの値のテキストを取得する方法は?

以下の URL を使用して、PDF ファイルの URL 値を取得しました: たとえば、pdf ファイルからハイパーリンクを読み取る。

Anchor a = new Anchor("Test Anchor");
a.Reference = "http://www.google.com";
myParagraph.Add(a);

ここで取得しますhttp://www.google.comが、アンカー値を取得する必要がありますTest Anchor

あなたの提案が必要です。

score 5 · Accepted Answer

PDF ファイルから、リンクが配置されている領域を特定し、iTextSharp を使用してリンクの下のテキストを読み取る必要があります。

このようにして、リンクの下のテキストを抽出できます。このアプローチの制限は、リンク領域がテキストよりも広い場合、抽出によってその領域の下のテキスト全体が読み取られることです。

private void GetAllHyperlinksFromPDFDocument(string pdfFilePath)
{
    string linkTextBuilder = "";
    string linkReferenceBuilder = "";

    PdfDictionary PageDictionary = default(PdfDictionary);
    PdfArray Annots = default(PdfArray);
    PdfReader R = new PdfReader(pdfFilePath);

    List<BinaryHyperlink> ret = new List<BinaryHyperlink>();

    //Loop through each page
    for (int i = 1; i <= R.NumberOfPages; i++)
    {
        //Get the current page
        PageDictionary = R.GetPageN(i);

        //Get all of the annotations for the current page
        Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

        //Make sure we have something
        if ((Annots == null) || (Annots.Length == 0))
            continue;

        //Loop through each annotation

        foreach (PdfObject A in Annots.ArrayList)
        {
            //Convert the itext-specific object as a generic PDF object
            PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);

            //Make sure this annotation has a link
            if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
                continue;

            //Make sure this annotation has an ACTION
            if (AnnotationDictionary.Get(PdfName.A) == null)
                continue;

            //Get the ACTION for the current annotation
            PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.GetAsDict(PdfName.A);
            if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
            {
                //Get action link URL : linkReferenceBuilder
                PdfString Link = AnnotationAction.GetAsString(PdfName.URI);
                if (Link != null)
                    linkReferenceBuilder = Link.ToString();

                //Get action link text : linkTextBuilder
                var LinkLocation = AnnotationDictionary.GetAsArray(PdfName.RECT);
                List<string> linestringlist = new List<string>();
                iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(((PdfNumber)LinkLocation[0]).FloatValue, ((PdfNumber)LinkLocation[1]).FloatValue, ((PdfNumber)LinkLocation[2]).FloatValue, ((PdfNumber)LinkLocation[3]).FloatValue);
                RenderFilter[] renderFilter = new RenderFilter[1];
                renderFilter[0] = new RegionTextRenderFilter(rect);
                ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
                linkTextBuilder = PdfTextExtractor.GetTextFromPage(R, i, textExtractionStrategy).Trim();
            }
        }
    }
}

score 1 · Accepted Answer

残念ながら、少なくとも多くの推測作業がなければ、これを行うことはできないと思います。HTML では、ハイパーリンクとそのテキストが次のように一緒に保存されるため、これは簡単です。

<a href="http://www.example.com/">Click here</a>

ただし、PDF では、これら 2 つのエンティティはいかなる形式の関係でも保存されません。私たちが PDF 内の「ハイパーリンク」と考えているものは、技術的には、たまたまテキストの上にある PDF 注釈です。Adobe Acrobat Pro などの編集プログラムで PDF を開くと、これを確認できます。テキストは変更できますが、「クリック可能な」領域は変更されません。「クリック可能な」領域を移動およびサイズ変更して、ドキュメント内の任意の場所に配置することもできます。

PDF を作成するとき、iText/iTextSharp はこれを抽象化するので、これについて考える必要はありません。クリック可能なテキストで「ハイパーリンク」を作成できますが、PDF を生成すると、最終的にはテキストが通常のテキストとして作成され、長方形の座標が計算され、その長方形に注釈が付けられます。

私はあなたがこれを推測しようとすることができると言った. これを行うには、注釈用の四角形を取得し、それらの座標にもあるテキストを見つける必要があります。ただし、パディングの問題があるため、完全に一致するわけではありません。ハイパーリンクの下のテキストを絶対に取得する必要がある場合、これが私が知っている唯一の方法です。幸運を！

c# - PDFファイルC#からAnchorTextを使用してハイパーリンクを読み取る方法

2 に答える 2

Related

Reference