c# - C＃は、文字列内で複数回出現する場合に文字列の特定の部分を取得します

Question

そのため、フォーラムからメンバープロファイルのリンクを取得して、コンソールアプリに表示しようとしています。私がやりたいのは、ウェブページからすべてのリンクを取得して印刷することです。

現在、私は次のようなページソースを取得しています：

String source = WebClient.DownloadString("URL");

私がやりたいのは、その文字列を繰り返し処理して、次のようなすべての文字列を見つけることです。

<h3 class='ipsType_subtitle'>
         <strong><a href='http://www.website.org/community/user/8416-unreal/' title='View Profile'>!Unreal</a></strong>
</h3>

次に、その部分を取得したら、次のようなURLを取得します。

http://www.website.org/community/user/8416-unreal/

現在、これは私が試したコードであり、機能します。ただし、リンクの1つだけを取得します。

    WebClient c = new WebClient();
    String members = c.DownloadString("http://www.powerbot.org/community/members/");
    int times = Regex.Matches(members, "<h3 class='ipsType_subtitle'>").Count;
    Console.WriteLine(times.ToString());

    for (int i = 1; i < times; i++)
    {
        try
        {
            int start = members.IndexOf("<h3 class='ipsType_subtitle'>");
            members = members.Substring(start, 500);
            String[] next = members.ToString().Split(new string[] { "a href='" }, StringSplitOptions.None);
            String[] link = next[1].Split(' ');
            Console.WriteLine(link[0].Replace("'", ""));
        }
        catch(Exception e) { Console.WriteLine("Failed: " + e.ToString()); }
    }

    Console.Read();

ありがとう。

score 1 · Accepted Answer

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(members);

var links = doc.DocumentNode
    .Descendants("h3")
    .Where(h => h.Attributes["class"] != null && h.Attributes["class"].Value == "ipsType_subtitle")
    .Select(h => h.Descendants("a").First().Attributes["href"].Value)
    .ToArray();

score 0 · Accepted Answer

以下に、私がいくつか変更を加えたコードを示します。これで問題ないはずです。しかし、確かにあなたはこのタスクに最適な方法を選択しませんでした。

WebClient c = new WebClient();
String members = c.DownloadString("http://www.powerbot.org/community/members/");
int times = Regex.Matches(members, "<h3 class='ipsType_subtitle'>").Count;
Console.WriteLine(times.ToString());

var member = string.Empty;//extracted value

for (int i = 1; i < times; i++)
{
    try
    {
        int start = members.IndexOf("<h3 class='ipsType_subtitle'>");
        member = members.Substring(start, 500);

        members = members.Remove(start, 500);

        String[] next = member.ToString().Split(new string[] { "a href='" }, StringSplitOptions.None);
        String[] link = next[1].Split(' ');
        Console.WriteLine(link[0].Replace("'", ""));
    }
    catch(Exception e) { Console.WriteLine("Failed: " + e.ToString()); }
}

Console.Read();

score 0 · Accepted Answer

0

より良い方法は、HTMLAgilityPackを使用することです

于 2012-05-28T11:01:17.750 に答える

score 0 · Accepted Answer

HTMLを一時停止する最も正しい方法は、 HtmlAgilityPackなどのHTMLパーサーを使用することです。他の方法でページを正しく渡すことはできません。HTML

これの証明は「バランスのとれた括弧」の概念です。正規表現を使用して文字列を解析することはできません。解析ツリー((x))を覚えておく必要があるためですが、正規表現はステートレス構造です。

それらは悪くはありませんが、これらのタイプの構文解析には適していません。

お役に立てれば。

c# - C＃は、文字列内で複数回出現する場合に文字列の特定の部分を取得します

4 に答える 4

Related

Reference