c# - C#でhtmlから情報を抽出するには?

Question

C# で html から情報を抽出する方法を教えてください。C# で WinRT クラスライブラリを使用しています。

http://lifehacker.com/5923026/remains-of-the-day-google-image-search-gets-knowledge-graph-integrationから主なコンテンツと画像を抽出したいと思います。

ここに部分的なウェブサイトコードがあります,

<html xmlns="http://www.w3.org/1999/xhtml" class="feature_chompcommentimages feature_s3upload feature_switch feature_powwowtest" xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>

  **<title>Remains of the Day: Google Image Search Gets Knowledge Graph Integration</title>**
          <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <meta http-equiv="content-language" content="en" />
  <meta http-equiv="refresh" content="86400" />
  <meta name="robots" content="all" />
                      <meta name="keywords" content="For What It&#039;s Worth, remainders, in brief, Lifehacker" />
                  <meta property="fb:page_id" content="7568536355" />
                              <meta name="title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      **<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />**
                      <link rel="image_src" href="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/original.jpg" />
          <meta property="og:image" content="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/xlarge.jpg" />
                  <meta property="og:site_name" content="Lifehacker"/>
      <meta property="og:title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      <meta property="og:description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS." />
      <meta property="og:type" content="article" />

SyndicationFeed.Title.Text (Windows.Web.Syndication; を使用) を使用して、日の名残りを抽出できます: Google 画像検索がナレッジグラフの統合を取得

抽出を手伝ってください

<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />*

内部のメインコンテンツも抽出する必要があります

<div id="container"> <script type="text/javascript">

<!-- %JUMP:More &raquo;% --><\/p>\n<ul>\n<li><a href=\"http:\/\/insidesearch.blogspot.com\/2012\/07\/find-smarter-more-comprehensive-search.html\">Find Smarter, More Comprehensive Search by Image Results<\/a> <i>Google updated its Image Search with a couple of new features. One being an expanded view that lets searchers see the text around matching images, and the other being added support for Knowledge Graph to image search results, which means Google will attempt to identity any photo that you upload or link to and provide more information about the subject.<\/i> [Google Blog]<\/li>\n<li>

コンテンツ: 「画像結果によるよりスマートで包括的な検索の検索」画像検索結果に送信されます。つまり、Google はユーザーがアップロードまたはリンクした写真を特定し、その件名に関する詳細情報を提供しようとします。[Google ブログ]」

どうもありがとう！！

[7/4/12]
申し訳ありませんが、html から直接解析するか、最初に xml に変換して解析して、html からテキスト (文字列として) と画像 (リンクまたは BitmapImage) を抽出しようとしています。

私は htmlagilitypack.codeplex.com の HtmlAgilityPack と 4guysfromrolla.com/articles/011211-1.aspx のチュートリアルを使用しています。HtmlAgilityPack にはサポートがないため、Metro スタイルアプリのより良い解決策があるかどうかはまだ疑問に思っています。たとえば、html を xml に変換するメソッドがありますが、WinRT は .NET からの XmlTextReader をサポートしなくなりました。

再度、感謝します

score 0 · Accepted Answer

ジェリー、このXMLを解析するのではなく、RSSライブラリを使用することをお勧めします。RssToolkitを見てください。

c# - C#でhtmlから情報を抽出するには?

1 に答える 1

Related

Reference