csquery - HTML
ノード CsQuery のアンカーテキストを含む InnerText

Question

CsQuery を使用してワードプレスのブログ記事を解析し、テキストクラスタリング分析を行っています。<p>関連するノードからテキストを削除したいと思います。

var content = dom["div.entry-content>p"];
if (content.Length == 1)
{
    System.Diagnostics.Debug.WriteLine(content[0].InnerHTML);
    System.Diagnostics.Debug.WriteLine(content[0].InnerText);
}

投稿の1つでは、InnerHTML次のようになります。

An MIT Europe project that attempts to <a title="Wired News: Gizmo Puts Cards 
on the Table" href="http://www.wired.com/news/technology/0,1282,61265,00.html?
tw=rss.TEK">connect two loved ones seperated by distance</a> through the use 
of two tables, a bunch of RFID tags and a couple of projectors.

そして、InnerTextこのような対応

2 つのテーブル、多数の RFID タグ、および 2 台のプロジェクターを使用して試行する MIT ヨーロッパプロジェクト。

つまり、内側のテキストにアンカーテキストがありません。自分で HTML を解析することはできますが、CsQuery に教えてもらう方法があることを願っています。

2 つのテーブル、多数の RFID タグ、および 2 台のプロジェクターを使用して、離れた場所にいる 2 人の愛する人を接続しようとする MIT ヨーロッパプロジェクト。

(イタリック体) これはどうやって入手すればよいですか?

score 1 · Accepted Answer

HtmlAgilityPackを使用してみてください

using HAP = HtmlAgilityPack;
...
var doc = new HAP.HtmlDocument();
doc.LoadHtml("Your html");
var node = doc.DocumentNode.SelectSingleNode(@"node xPath");
Console.WriteLine(node.InnerText());

xPath は、ページ上のノードへのパスです。

例: Google Chrome で F12 を押してノードを選択し、右クリックして [xPath のコピー] を選択します。

このトピックヘッダー xPath: //*[@id="question-header"]/h1/a

csquery - HTMLノード CsQuery のアンカー テキストを含む InnerText

2 に答える 2

Related

Reference

csquery - HTML
ノード CsQuery のアンカーテキストを含む InnerText