3

I have written a web scraper in ruby . But the websites that I am scraping hav changed their design.Thus my scraper is failing. Is there a smart and simple solution to solve this kind of an inherent problem of scrapers? (for eg.. using some kind of pattern matching, xpaths,comparing DOM tress...etc)

EM.run {
 http_request = EM::HttpRequest.new(url, opts).get
 http_request.callback { |body|
 doc = Nokogiri.parse(body.response)
 doc = Nokogiri::HTML(body.response)
 puts doc.css(".poster_information")
 puts doc.css(".date")
 puts doc.css(".comment_block")
}

In above example code snippet I am scraping the the above mentioned website for poster information , date posted and comments posted with the help of css selectors for one web page. Now suppose if the webmaster changes the layout of the forum. The css selectors will fail and thus my whole scraper will fail. I do not want to update my scraper everytime the website's layout changes. So is there any way that my scraper detect the website layout change and it would be able to correctly find the path to the desired destination?Becuase I have no way to know when the website will change.. I am just trying to make my scraper automated and fault tolerant

4

1 に答える 1

0

ページが変更されたときに通知するために定期的に実行される統合テストを作成できます。ページ構造が頻繁に変更される場合は、セレクターパターンを構成に抽出し、UIを構築して、実際にスクレイプするセレクターを簡単に編集することもできます。ちなみに、より高いレベルでスクレーパーを制御するためにカピバラをチェックすることにも興味があるかもしれません。capybara-webkitは、JS機能も必要な場合に利用できます。

于 2012-07-18T16:38:35.603 に答える