ruby - ruby webcrawler で空のページと HTTPErrors をチェックする方法

Question

このコードは、Google 画像から画像をランダムに取得します。ただし、Google が結果を表示しない用語を Web クローラーが検索しようとすると、エラーが発生します。グーグルがもはや存在しない画像のウェブクローラーを与えるときにもエラーが発生します。エラーが発生した場合に再実行して別の画像を取得しようとするように、このコードをどのように記述すればよいでしょうか。

require 'open-uri'
require 'nokogiri'
url = "https://www.google.com/search?hl=en&q=" + rand(0-999999).to_s + "&ion=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.42553238,d.dmg&biw=1354&bih=622&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=sNEfUf-fHvLx0wG7uoG4DQ"
googim = Nokogiri::HTML(open(url))
googimstr = googim.to_s
durl = googim.to_s.split('imgurl=')[1].split('&amp')[0]

name = durl.reverse.split("/")[0].reverse

open("./data/images/#{name}", 'wb') do |file|
          file << open(durl).read
end

これが私が受け取る2種類のエラーです

最初のエラー:

usr/lib/ruby/2.0.0/open-uri.rb:353:in `open_http': 400 Bad Request (OpenURI::HTTPError)
    from /usr/lib/ruby/2.0.0/open-uri.rb:708:in `buffer_open'
    from /usr/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
    from /usr/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
    from /usr/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
    from /usr/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
    from /usr/lib/ruby/2.0.0/open-uri.rb:688:in `open'
    from /usr/lib/ruby/2.0.0/open-uri.rb:34:in `open'
    from wc.rb:11:in `block in <main>'
    from /usr/lib/ruby/2.0.0/open-uri.rb:36:in `open'
    from /usr/lib/ruby/2.0.0/open-uri.rb:36:in `open'
    from wc.rb:10:in `<main>'

2 番目のエラー:

wc.rb:6:in `split': invalid byte sequence in UTF-8 (ArgumentError)
    from wc.rb:6:in `<main>'

score 2 · Accepted Answer

コードの適切な部分をbegin/endブロックとrescue例外でラップできます。例えば：

begin
  open("./data/images/#{name}", 'wb') do |file|
    file << open(durl).read
  end
rescue => e
  puts "some failure: #{e}"
end

Pickaxe/Programming Ruby の Exceptions, Catch, and Throw の章へのリンクは次のとおりです: http://www.ruby-doc.org/docs/ProgrammingRuby/html/tut_exceptions.html

ruby - ruby webcrawler で空のページと HTTPErrors をチェックする方法

1 に答える 1

Related

Reference