xml - 各hrefにscrapyを使ってクロールする方法

Question

各hrefにscrapyを使用してクロールするにはどうすればよいですか? 私はそれをすべて表示する方法を知っていますが、それらのリンクのそれぞれにアクセスできるようにしたいと考えています。これはイントラネットデータであるため、リンクにアクセスすることはできません。また、データがファイルに表示されるときに日付をフォーマットするにはどうすればよいですか? start_url に URL のリストを追加する必要がありますか? initSpider をクロールスパイダーに変更する必要がありますか?

<row>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">14256238845</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">353918053831794</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">310260548400764</cell>
<cell type="href" href="/dis/packages.jsp?view=timeline&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100&date=20130423T020032243">2013-04-23 02:00:32.243</cell>
<cell type="plain">2013-04-23 02:00:32.243</cell>
<cell type="plain">3 - PackageCreation</cell>
<cell type="href" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="href" href="/dis/sessions.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view sessions</cell>
<cell type="href" href="/dis/errors_agg.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view errors</cell>
</row>

これは私がこれまでに持っているもので、すべてを印刷します

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from scrapy.selector import XmlXPathSelector

from carrier.items import CarrierItem

class CarrierSpider(InitSpider):
    name = 'dis'
    allowed_domains = ['qvpweb01.ciq.labs.att.com']
    login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
    start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]

    def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'txtUserName': 'myuser', 'txtPassword': 'xxxx'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "logout" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..

        return self.initialized() 

    else:
        self.log("\n\n\nFailed, Bad password :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.


    def parse(self, response):
    xhs = XmlXPathSelector(response)
    columns = xhs.select('//table[3]/row/cell')
    for column in columns:
        item = CarrierItem()
        item['title'] = column.select('.//text()').extract()
        item['link'] = column.select('.//@href').extract()
        yield item

以下のcsvファイルから取得した出力：

14256238845
3.53918E+14
3.10261E+14
00:32.2
00:32.2
3 - PackageCreation
400006
view sessions
view errors

以下に取得したいcsvからの出力を望みます：

14256238845
353918053831794
310260548400764
2013-04-23 02:00:32.243
2013-04-23 02:00:32.243
3 - PackageCreation
400006
view sessions
view errors

score 1 · Accepted Answer

URL をたどりたいときはいつでも Request オブジェクトを生成できます。
例えば：yield Request(extracted_url_link, callback=your_parse_function)

次のリンクの 2 番目の例を見てください。
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example

クロール URL を指定するもう 1 つの方法は、SgmlLinkExtractor を使用することです。ルールを書くことができます。ルールが一致した場合、Spider は任意のページのすべての URL をクロールします。次の URL の例を参照してください。
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

日付はクロール後の単なる文字列です。これを Python の datetime オブジェクトに変換してから、strftime などの日時レンダリング関数を使用して希望どおりに表示できます。

あなたの質問に答えていただければ幸いです。

xml - 各hrefにscrapyを使ってクロールする方法

1 に答える 1

Related

Reference