c# - HtmlAgilityPackを使用して2つのHTMLタグ間でコンテンツを取得する

Question

Wordで作成された非常に大規模なヘルプドキュメントがあり、これを使用して、さらに大規模で扱いにくいHTMドキュメントを生成しました。C＃とこのライブラリを使用して、アプリケーションの任意の時点でこのファイルの1つのセクションのみを取得して表示したいと思います。セクションは次のように分割されます。

<!--logical section starts here -->
<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A</a></h1>
</div>
 <div> Lots of unnecessary markup for simple formatting... </div>
 .....
<!--logical section ends here -->

<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section B</a></h1>
</div>

論理的に言えば、タグH1にはセクション名が含まれています。a別のdivに遭遇するまで、divを含む外側からすべてを選択し、h1そのdivを除外したいと思います。

各セクション名は、複数の子（それぞれ約6つ）を持つ<a>タグの下にあります。h1
論理セクションはコメントでマークされています
これらのコメントは実際のドキュメントには存在しません

私の試み：

var startNode = helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(., '"+sectionName+"')]");
//go up one level from the a node to the h1 element
startNode=startNode.ParentNode;

//get the start index as the index of the div containing the h1 element
int startNodeIndex = startNode.ParentNode.ChildNodes.IndexOf(startNode);

//here I am not sure how to get the endNode location. 
var endNode =?;

int endNodeIndex = endNode.ParentNode.ChildNodes.IndexOf(endNode);

//select everything from the start index to the end index
var nodes = startNode.ParentNode.ChildNodes.Where((n, index) => index >= startNodeIndex && index <= endNodeIndex).Select(n => n);

サイン私はこれに関するドキュメントを見つけることができませんでした、私は私のスタートノードから次のh1要素にどのように行くことができるかわかりません。任意の提案をいただければ幸いです。

score 5 · Accepted Answer

H1タグはセクションの先頭にのみ表示されることを前提としていますが、これでうまくいくと思います。そうでない場合は、子孫にWhereを追加して、検出されたH1ノードで他のフィルターをチェックできます。これには、セクション名を持つ次の兄弟になるまで、検出したdivのすべての兄弟が含まれることに注意してください。

private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName)
{
    HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
    if (startNode == null)
        return null; // section not found

    List<HtmlNode> section = new List<HtmlNode>();
    HtmlNode sibling = startNode.NextSibling;
    while (sibling != null && sibling.Descendants("h1").Count() <= 0)
    {
        section.Add(sibling);
        sibling = sibling.NextSibling;
    }

    return section;
}

score 0 · Accepted Answer

したがって、結果として本当に必要なのは、h1-Tagの周りのdivですか？はいの場合、これは機能するはずです。

helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(@name, '"+sectionName+"')]/ancestor::div");

またSelectNodes、HTMLに応じて動作します。このような：

helpDocument.DocumentNode.SelectNodes("//h1/a[starts-with(@name,'_Toc')]/ancestor::div");

ああ、これをテストしているときに、containsメソッドのドットが機能していないことに気付きました。name属性に変更すると、すべて正常に機能します。

c# - HtmlAgilityPackを使用して2つのHTMLタグ間でコンテンツを取得する

2 に答える 2

Related

Reference