ruby - Ruby+Anemone Web Crawler: 一連の数字で終わる URL に一致する正規表現

Question

次のように終了したページをスキップして、Web サイトをクロールしようとしていたとします。

http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117

現在、Ruby で Anemone gem を使用してクローラーを構築しています。skip_links_like メソッドを使用していますが、パターンが一致しないようです。=2105925これを可能な限り汎用的にしようとしているので、サブページに依存するのではなく、(数字)だけに依存します。

試してみ/=\d+$/まし/\?.*\d+$/たが、うまくいかないようです。

これは、アネモネでのクロールから拡張子 pdf、zip を使用して Web ページをスキップすることに似ていますが、拡張子の代わりに数字を使用することはできません。

また、パターンを使用してhttp://regexpal.com/でテストすると、=\d+$正常に一致しますhttp://misc.com/test/index.php?page=news&subpage=20060118

編集：

これが私のコード全体です。何が悪いのか正確にわかる人がいるのだろうか。

require 'anemone'
...
Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true) do |anemone|
  anemone.skip_links_like /\?.*\d+$/
  anemone.on_every_page do |page|
    pURL = page.url.to_s
    puts "Now checking: " + pURL
    bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
    puts "Successfully checked"
  end
end

私の出力は次のようなものです：

...
Now checking: http://MISC.com/about_us/index.php?page=press_and_news&subpage=20110711
Successfully checked
...

score 3 · Accepted Answer

  Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true, :skip_query_strings => true) do |anemone|
   anemone.on_every_page do |page|
     pURL = page.url.to_s
     puts "Now checking: " + pURL
      bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
     puts "Successfully checked"
   end
 end

score 2 · Accepted Answer

実際に/\?.*\d+$/動作します：

~> irb
> all systems are go wirble/hirb/ap/show <
ruby-1.9.2-p180 :001 > "http://hiddenwebsite.com/anonimize/index.php?page=press_and_news&subpage=20060117".match /\?.*\d+$/
 => #<MatchData "?page=press_and_news&subpage=20060117">

ruby - Ruby+Anemone Web Crawler: 一連の数字で終わる URL に一致する正規表現

2 に答える 2

Related

Reference