ruby - 訪問した各URLにタグを追加するRubyAnemoneスパイダー

Question

クロールを設定しました：

require 'anemone'

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
  puts page.url
end
end

ただし、スパイダーがアクセスするすべてのURLでGoogleアナリティクスのトラッキング防止タグを使用し、必ずしも実際にリンクをクリックする必要はありません。

スパイダーを一度使用してすべてのURLを保存し、WATIRを使用してタグを追加して実行することもできますが、速度が遅く、skip_links_like関数とページ深度関数が好きなのでこれは避けたいと思います。

どうすればこれを実装できますか？

score 3 · Accepted Answer

ロードする前にURLに何かを追加したいですよね？あなたはそのために使うことができますfocus_crawl。

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
    anemone.focus_crawl do |page|
        page.links.map do |url|
            # url will be a URI (probably URI::HTTP) so adjust
            # url.query as needed here and then return url from
            # the block.
            url
        end
    end
    anemone.on_every_page do |page|
        puts page.url
    end
end

focus_crawlURLリストをフィルタリングするためのメソッド：

各ページでたどるリンクを選択するブロックを指定します。ブロックはURIオブジェクトの配列を返す必要があります。

ただし、汎用URLフィルターとしても使用できます。

たとえば、atm_source=SiteCon&atm_medium=Mycampaignすべてのリンクに追加する場合は、次のpage.links.mapようになります。

page.links.map do |uri|
    # Grab the query string, break it into components, throw out
    # any existing atm_source or atm_medium components. The to_s
    # does nothing if there is a query string but turns a nil into
    # an empty string to avoid some conditional logic.
    q = uri.query.to_s.split('&').reject { |x| x =~ /^atm_(source|medium)=/ }

    # Add the atm_source and atm_medium that you want.
    q << 'atm_source=SiteCon' << 'atm_medium=Mycampaign'

    # Rebuild the query string 
    uri.query = q.join('&')

    # And return the updated URI from the block
    uri
end

URLに安全でない文字を使用しているatm_source、または含む場合は、それらをURIエンコードします。atm_medium

ruby - 訪問した各URLにタグを追加するRubyAnemoneスパイダー

1 に答える 1

Related

Reference