ruby - Nokogiri 接続のタイムアウトの調整

Question

サーバーがビジーで、ページを 1 つずつ要求しているときに、nokogiri が数秒 (3-5) 待機するのはなぜですか。これらの要求がループしている場合、nokogiri は待機せず、タイムアウトメッセージをスローします。リクエストをラップするタイムアウトブロックを使用していますが、nokogiri はその時間をまったく待ちません。これに関する提案された手順はありますか？

# this is a method from the eng class
def get_page(url,page_type)
 begin
  timeout(10) do
    # Get a Nokogiri::HTML::Document for the page we’re interested in...
    @@doc = Nokogiri::HTML(open(url))
  end
 rescue Timeout::Error
  puts "Time out connection request"
  raise
  end
end

 # this is a snippet from the main app calling eng class
 # receives a hash with urls and goes throgh asking one by one
 def retrieve_in_loop(links)
  (0..links.length).each do |idx|
    url = links[idx]
    puts "Visiting link #{idx} of #{links.length}"
    puts "link: #{url}"
    begin
        @@eng.get_page(url, product)
    rescue Exception => e
        puts "Error getting url: #{idx} #{url}"
        puts "This link will be skeeped. Continuing with next one"
    end
  end
end

score 7 · Accepted Answer

ブロックは、そのコードが例外をトリガーせずにブロック内で実行する必要があるtimeout単純な最大時間です。Nokogiri や OpenURI 内には何の影響もありません。

タイムアウトは 1 年に設定できますが、OpenURI はいつでも好きなときにタイムアウトできます。

したがって、あなたの問題は、接続試行自体で OpenURI がタイムアウトしている可能性が最も高いです。Nokogiri にはタイムアウトがありません。それは単なるパーサーです。

読み取りタイムアウトの調整

OpenURI で調整できる唯一のタイムアウトは、読み取りタイムアウトです。この方法では接続タイムアウトを変更できないようです:

open(url, :read_timeout => 10)

接続タイムアウトの調整

接続タイムアウトを調整するには、Net::HTTP代わりに直接使用する必要があります。

uri = URI.parse(url)

http = Net::HTTP.new(uri.host, uri.port)
http.open_timeout = 10
http.read_timeout = 10

response = http.get(uri.path)

Nokogiri.parse(response.body)

ここでいくつかの追加の議論を見ることもできます：

Ruby Net::HTTP
タイムアウト Net::HTTP のタイムアウトを増やす

ruby - Nokogiri 接続のタイムアウトの調整

1 に答える 1

Related

Reference