c# - Web ページからリンクを取得

Question

この Web ページからすべてのアイテムリンク (URL) を、改行で区切られたテキストファイル (つまり、「アイテム #1」「アイテム #2」などのようなリスト) に取得する必要があります。

http://dota-trade.com/equipment?order=nameはウェブページで、下にスクロールすると約 500 ～ 1000 のアイテムが表示されます。

どのプログラミング言語を使用する必要があるか、またはどのようにこれを行うことができるでしょうか。私はすでにimacrosを使用した経験もあります。

score 1 · Accepted Answer

HtmlAgilityPack をダウンロードする必要があります

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication5
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient wc = new WebClient();
            var sourceCode = wc.DownloadString("http://dota-trade.com/equipment?order=name");
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(sourceCode);
            var node = doc.DocumentNode;
            var nodes = node.SelectNodes("//a");
            List<string> links = new List<string>();
            foreach (var item in nodes)
            {
                var link = item.Attributes["href"].Value;
                links.Add(link.Contains("http") ? link : "http://dota-trade.com" +link);
            }
            int index = 1;
            while (true)
            {
                sourceCode = wc.DownloadString("http://dota-trade.com/equipment?order=name&offset=" + index.ToString());
                doc = new HtmlDocument();
                doc.LoadHtml(sourceCode);
                node = doc.DocumentNode;
                nodes = node.SelectNodes("//a");
                var cont = node.SelectSingleNode("//tr[@itemtype='http://schema.org/Thing']");
                if (cont == null) break; 
                foreach (var item in nodes)
                {
                    var link = item.Attributes["href"].Value;
                    links.Add(link.Contains("http") ? link : "http://dota-trade.com" + link);
                }
                index++;
            }
            System.IO.File.WriteAllLines(@"C:\Users\Public\WriteLines.txt", links);
        }
    }
}

score 0 · Accepted Answer

正規表現をサポートする言語を使用することをお勧めします。私はRubyをよく使うので、次のようにします：

require 'net/http'
require 'uri'

uri = URI.parse("http://dota-trade.com/equipment?order=name")

req = Net::HTTP::Get(uri.path)
http = Net::HTTP.new(uri.host, uri.port)
response = http.request(request)

links = response.body.match(/<a.+?href="(.+?)"/)

これは私の頭の中ではありませんが、links[0] は一致オブジェクトである必要があり、その後のすべての要素が一致します。

puts links[1..-1].join("\n")

最後の行は必要なものをダンプする必要がありますが、おそらくホストは含まれていません。ホストを含めたい場合は、次のようにします。

puts links[1..-1].map{|l| "http://dota-trade.com" + l }.join("\n")

これはテストされていないことに注意してください。

c# - Web ページからリンクを取得

2 に答える 2

Related

Reference