c# - C# Text Matching HTML

Question

I'm trying to interact with a really crappy "web-service" (cleverly disguised as simple aspx page...) but I don't control the page so I can't tweak the output so I'm stuck with it. The format is always the same like this:

<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br />123 North Main
<br />Hume, ACT
<br />(999) 888-8888

So, I need to parse out the URL, Name, Address, City, State, and Phone? It's not really properly formed XML so I can't use XML parser, and RegEx seems painfully nasty, so am I stuck with String.Match and IndexOf etc?

Thanks for your suggestions... James

score 2 · Accepted Answer

HTMLパーサーを使用してページを解析できます。HtmlAgilityPackは、無料で堅牢なものです。XQueryまたは、.Net用の任意のプロセッサを使用できます。htmlページの解析に使用する場合の欠点については、このスレッドをご覧ください。regex

score 1 · Accepted Answer

html 要素が静的なままであると仮定すると、正規表現は必要ありません。<b>, </b>, and <br />私の解決策は、要素のインデックスを見つけて、あるインデックスから次のインデックスへ部分文字列を取得することです。例えば

int bStartIndex = html.IndexOf("<b>");
int bEndIndex = html.IndexOf("</b>)");
int urlSize = bEndIndex - bStartIndex - 3;
string url = html.Substring(bStartIndex + 3, urlSize);

はい、この方法は大まかなハックですが、「本当にくだらないWebサービス」の状況を考えると、面倒ではありますが、公平で率直な解決策だと思います。

score 0 · Accepted Answer

過去に、フレームワークメソッドを使用して内部の値を取得する他の多くの方法を試しました。しかし、その形式はカスタマイズされすぎているため、唯一の方法は、応答のすべての行をループすることだと思います。値を取得すると、URL が表示されます。
行内の文字列を読み始めるときはいつでも、それは住所、次に都市国家などになります。何らかの理由で、オブジェクトのプロパティの順序が異なる順序で到着すると、コードは失敗します。(可能であれば) 少なくとも、デシリアライズが容易な JSON 形式をサービスから返すことをお勧めします。それ以外の場合は、独自のデシリアライザーを構築して、必要に応じてデータを取得する必要があります。

score 0 · Accepted Answer

次のように Regex.Replace を使用できます (これが常にまったく同じ方法でフォーマットされている場合)。

string crappyXML = 
"<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br />123 North Main
<br />Hume, ACT
<br />(999) 888-8888";

string betterXML = Regex.Replace(crappyXML, "</b><br />", "</b><br>");

(間にスペースがある場合は、そのスペースを考慮する必要がある場合があります)

次に、より良いXMLは次のようになります。

"<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br>123 North Main
<br />Hume, ACT
<br />(999) 888-8888";

次に、別の正規表現を実行できます。

betterXML = Regex.Replace(betterXML, "<br />", "</br><br>");

これは次のようになります。

"<b>
   <a href=\"http://www.google.com/\" target=\"_blank\">Google Inc</a>
</b>
<br>123 North Main
</br><br>Hume, ACT
</br><br>(999) 888-8888";

次に、これを行うだけです：

betterXML += "</br>";

最後のタグを閉じます。

繰り返しますが、私のコードはどれもRegex.Replace空白を考慮していません。それを追加する必要があります。

そこから、XML パーサーを使用してループスルーし、データを取得できるはずです。

それが役立つことを願っています！ご不明な点がございましたらお知らせください。

c# - C# Text Matching HTML

4 に答える 4

Related

Reference