0

as we see:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.select('a/text()').extract()
        item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
        item['description'] = site.select('text()').extract()
        items.append(item)

    return items

scrapy just get a page response,and find urls in the page response. I think it is just a surface crawl !!

But I want more urls with the definded depth .

what can I do to implement it ??

thank you!!

4

3 に答える 3

1

私はあなたの質問を理解していませんでしたが、あなたのコードにいくつかの問題があることに気づきました。それらのいくつかはあなたの質問に関連している可能性があります(コードのコメントを参照):

sites = hxs.select('//ul/li')
items = []

for site in sites:
    item = Website()
    # this extracts a list, so i guess .extract()[0] is expected
    item['name'] = site.select('a/text()').extract() 
    # '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'.
    # And, again, this returns a list, not a single url.
    item['url'] = site.select('//a[contains(@href, "http")]/@href').extract() 
于 2012-06-25T08:10:20.607 に答える
0

Have a look at the documentation on Requests and Responses.

As you scrape the first page, you gather some links that you use to generate a second request and lead to a second callback function to scrape the second level. In abstract that sounds complex, but you will see from the example code in the documentation that it is quite straightforward.

Furthermore, the CrawlSpider example is more fleshed out and gives you template code that you may simply want to adapt to your situation.

Hope this gets you started.

于 2012-06-25T07:20:46.720 に答える
0

you can crawl more pages by using the CrawlSpider which can be imported from scrapy.contrib.spiders and defining your rules as to which type of links you want your crawler to scrape.

Follow the notes here on how to define your rules

By the way, consider changing function name, from docs:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

于 2014-02-04T22:53:40.277 に答える