c# - 外国の文字を拾うのはなぜですか? どうすればそれらを削除できますか?

Question

HTML Agility Pack を使用して H3 タグの InnerText を取得すると、ソースに比べて余分な文字 (Â) が検出されます。

これらの文字がどこから来たのか、またはそれらを削除する方法がわかりません。

抽出された文字列:

Â WeekÂ 1

HTML ソース:

<h3>
<span> </span>Week 1</h3>

現在のコード:

private void getWeekNumber(string url)
{
    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.Load(new System.IO.StringReader(url));

    foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
    {
        MessageBox.Show(h3.InnerText);
    }
}

現在の回避策(stackoverflow のどこかから盗まれ、リンクが失われました):

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Method = "GET";

using (var stream = request.GetResponse().GetResponseStream())
using (var reader = new System.IO.StreamReader(stream, Encoding.UTF8))
{
    result = reader.ReadToEnd();
}

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

htmlDoc.Load(new System.IO.StringReader(result));

foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
{
    MessageBox.Show(h3.InnerText);
}

score 4 · Accepted Answer

その前にエンコーディングを設定する必要があります...

htmlDoc.Load(new System.IO.StringReader(url), Encoding.UTF8);

これは、文字が他のエンコーディングではなく UTF8 であることをアジリティパックに伝えます。

ここでそれを行う必要があるのは、これが誤って解析されるポイントだからです。この後、リテラル Â 文字を格納しています。

インターネットから HTML をダウンロードした後に変更された文字列の文字も興味深い場合があります。

score 1 · Accepted Answer

あなたの文字エンコーディングかもしれません、エンコーディングをUTF-8に設定してください

c# - 外国の文字を拾うのはなぜですか? どうすればそれらを削除できますか?

2 に答える 2

Related

Reference