ruby - _details メソッドを使用してリンクをクリックすると、Scrubyt で 404 エラーが発生する

Question

これは、以前の 2 つの質問と同様の問題である可能性があります。こことここを参照してください。ただし、_detail コマンドを使用してリンクを自動的にクリックし、個々のイベントの詳細ページをスクレイピングできるようにしています。

私が使用しているコードは次のとおりです。

require 'rubygems'
require 'scrubyt'

nuffield_data = Scrubyt::Extractor.define do
  fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail do
      dates "1-4 October"
      times "7:30pm"
    end
  end

  next_page "Next Page", :limit => 20
end

  nuffield_data.to_xml.write($stdout,1)

event_detail を使用してアクセスしようとしている URL を出力する方法はありますか? このエラーは、404 を返した URL を教えてくれないようです。

更新:リンクは相対リンクである可能性があります。これが問題を引き起こしている可能性はありますか? それに対処する方法はありますか？

score 1 · Accepted Answer

相対リンクで同じ問題が発生し、このように修正しました... :resolve パラメータを正しいベース URL に設定する必要があります

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail :resolve => 'http://www.nuffieldtheatre.co.uk/cn/events' do
      dates "1-4 October"
      times "7:30pm"
    end
  end

score 1 · Accepted Answer

    sudo gem install ruby-debug

This will give you access to a nice ruby debugger, start the debugger by altering your script:

    require 'rubygems'
    require 'ruby-debug'
    Debugger.start
    Debugger.settings[:autoeval] = true if Debugger.respond_to?(:settings)

    require 'scrubyt'

    nuffield_data = Scrubyt::Extractor.define do
      fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

      event do
        title 'The Coast of Mayo'
        link_url
        event_detail do
          dates "1-4 October"
          times "7:30pm"
        end
      end

      next_page "Next Page", :limit => 2

    end

    nuffield_data.to_xml.write($stdout,1)

Then find out where scrubyt is throwing an exception - in this case:

    /Library/Ruby/Gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:52:in `fetch'

Find the scrubyt gem on your system, and add a rescue clause to the method in question so that the end of the method looks like this:

      if @@current_doc_protocol == 'file'
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
      else
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
        store_host_name(self.get_current_doc_url)   # in case we're on a new host
      end
    rescue
      debugger
      self # the self is here because debugger doesn't like being at the end of a method
    end

スクリプトを再度実行すると、例外が発生したときにデバッガーにドロップされます。これをデバッグプロンプトに入力してみて、問題のある URL を確認してください。

@@current_doc_url

何が起こっているかを確認したい場合は、そのメソッドの任意の場所にデバッガーステートメントを追加することもできます。

これが基本的に、以前の質問に対する答えを見つけた方法です。

幸運を。

score 0 · Accepted Answer

申し訳ありませんが、これが nil になる理由がわかりません。これを実行するたびに URL が返されます。メソッド self.fetch には、ローカル変数 doc_url としてアクセスできる URL が必要です。これが nil を返す場合も、デバッガ呼び出しを含めたコードを投稿する必要があります。

score 0 · Accepted Answer

doc_url にアクセスしようとしましたが、それも nil を返すようです。サーバーにアクセスできるようになったら (後日)、デバッグビットを含むコードを投稿します。

ruby - _details メソッドを使用してリンクをクリックすると、Scrubyt で 404 エラーが発生する

4 に答える 4

Related

Reference