python - Scrapy - 最終的なリダイレクト URL を取得する

Question

最終的にリダイレクトされた URL をスクレイピーで取得しようとしています。たとえば、アンカータグに特定の形式がある場合:

<a href="http://www.example.com/index.php" class="FOO_X_Y_Z" />

次に、URL のリダイレクト先の URL を取得する必要があります (そうであれば、200 であれば OK)。たとえば、次のような適切なアンカータグを取得します。

def parse (self, response)  
    hxs     = HtmlXPathSelector (response);
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

    // Lets assume anchor contains the actual link (http://...)
    for anchor in anchors:
        final_url = get_final_url (anchor);   // << I would need something like this

        // Save final_url

したがって、訪問http://www.example.com/index.phpした場合、10回のリダイレクトが送信され、最終的に停止しますhttp://www.example.com/final.php-これが私がget_final_url()返す必要があるものです.

私は解決策への道をハックすることを考えましたが、スクレイピーがすでに提供されているかどうかを確認するためにここに尋ねていますか?

score 3 · Accepted Answer

繰り返しanchorますが、実際の URL が含まれていると仮定して、私はurllib2でそれを達成しました:

def parse (self, response)  
    hxs     = HtmlXPathSelector (response);
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

    // Lets assume anchor contains the actual link (http://...)
    for anchor in anchors:
        final_url = urllib2.open(anchor, None, 1).geturl()

        // Save final_url

urllib2.open()2 つの追加メソッドを持つファイルのようなオブジェクトをgeturl()返します。そのうちの 1 つは最終的な URL を返します (すべてのリダイレクトが行われた後)。Scrapy の一部ではありませんが、機能します。

score 0 · Accepted Answer

これを使用response.headersすると、情報のリストが返されます。新しい url 値は、"Location" キーの横にあります。

In [1]: response.headers
Out[1]: 
{'Date': 'Thu, 09 Jun 2016 00:18:18 GMT',
 'Location': 'https:/www.protiviti.com/en-US/Pages/default.aspx',
 'Server': 'nginx/1.9.1',
 'X-Ms-Invokeapp': '1; RequireReadOnly'}

score -4 · Accepted Answer

-4

それはとても簡単です：

print response.url #(inside parse() )

于 2012-10-07T15:30:32.763 に答える

python - Scrapy - 最終的なリダイレクト URL を取得する

3 に答える 3

Related

Reference