c# - C# Web クローラー/パーサー/スパイダー

Question

私は C# と WinForms の初心者で、Web クローラー (パーサー) を作成したいと考えています。これは、Web ページを解析して階層的に表示することができます。+ 特定のハイパーリンク深度でボットをクロールさせる方法がわかりません。

だから私は2つの質問があると思います：

リンクの深さを指定してボットをクロールするには?
すべてのハイパーリンクを階層的に表示する方法は?

PSコードサンプルになるといいですね。

PPS には 1 つのボタン = ボタン 1 があります。1 つの richtextbox = richTextBox1;

ここに私のコードがあります: 私はそれが非常に醜いことを知っています.... (すべてのコードは1つのボタンにあります):

public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    private void button1_Click(object sender, EventArgs e)
    {

        //Declaration

        HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
        HttpWebResponse response = (HttpWebResponse) request.GetResponse();
        StreamReader sr = new StreamReader(response.GetResponseStream());
        Match m;
        string anotherTest = @"(((ht){1}tp[s]?://)[-a-zA-Z0-9@:%_\+.~#?&\\]+)";
        List<string> savedUrls = new List<string>();
        List<string> titles = new List<string>();

        //Go to this URL:
        string url = UrlTextBox.Text = "http://www.yahoo.com";
        if (!(url.StartsWith("http://") || url.StartsWith("https://")))
            url = "http://" + url;

       //Scrape Whole Html code:
        string s = sr.ReadToEnd();

        try
        {
            // Get Urls:
            m = Regex.Match(s, anotherTest,
                            RegexOptions.IgnoreCase | RegexOptions.Compiled,
                            TimeSpan.FromSeconds(1));

            while (m.Success)
            {
                savedUrls.Add(m.Groups[1].ToString());
                m = m.NextMatch();
            }

            // Get TITLES:
            Match m2 = Regex.Match(s, @"<title>\s*(.+?)\s*</title>");
            if (m2.Success)
            {
                titles.Add(m2.Groups[1].Value);
            }
            //Show Title:
            richTextBox1.Text += titles[0] + "\n";

            //Show Urls:
            TrimUrls(ref savedUrls);
        }
        catch (RegexMatchTimeoutException)
        {
            Console.WriteLine("The matching operation timed out.");
        }

        sr.Close();
    }

    private void TrimUrls(ref List<string> urls)
    {
        List<string> d = urls.Distinct().ToList();
        foreach (var v in d)
        {
            if (v.IndexOf('.') != -1 && v != "http://www.w3.org")
            {
                richTextBox1.Text += v + "\n";
            }
        }
    }

}

}

もう 1 つの質問: XML でツリーのように保存する方法を知っている人はいますか?

score 2 · Accepted Answer

HTML Agility Packも強くお勧めします。

Html Agility Pack を使用すると、次のようなことができます。

var doc = new HtmlDocument();
doc.LoadHtml(html);
var urls = new List<String>();
doc.DocumentNode.SelectNodes("//a").ForEach(x => 
{
    urls.Add(x.Attributes["href"].Value);
});

編集：

このようなこともできますが、例外処理を追加してください。

public class ParsResult
{
    public ParsResult Parent { get; set; }
    public String Url { get; set; }
    public Int32 Depth { get; set; }
}

__

private readonly List<ParsResult> _results = new List<ParsResult>();
private  Int32 _maxDepth = 5;
public  void Foo(String urlToCheck = null, Int32 depth = 0, ParsResult parent = null)
{
    if (depth >= _maxDepth) return;
    String html;
    using (var wc = new WebClient())
        html = wc.DownloadString(urlToCheck ?? parent.Url);

    var doc = new HtmlDocument();
    doc.LoadHtml(html);
    var aNods = doc.DocumentNode.SelectNodes("//a");
    if (aNods == null || !aNods.Any()) return;
    foreach (var aNode in aNods)
    {
        var url = aNode.Attributes["href"];
        if (url == null)
            continue;
        var result = new ParsResult
        {
            Depth = depth,
            Parent = parent,
            Url = url.Value
        };
        _results.Add(result);
        Console.WriteLine("{0} - {1}", depth, result.Url);
        Foo(depth: depth + 1, parent: result);
}

c# - C# Web クローラー/パーサー/スパイダー

2 に答える 2

Related

Reference