web-scraping - Scrapy を使用してクロールされた Web ページをメモリに保存する方法

Question

次のスクレイピースクリプトを使用して、Web をクロールできるようになりました

 import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from lxml import html

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider
from scrapy import log

#from tutorial.items import TutorialItem
from tutorial.items import DmozItem


class StayuncleCrawlerSpider(CrawlSpider):

    name = 'stayuncle_crawler'

    allowed_domains = ['stayuncle.com']
    start_urls = ['http://www.stayuncle.com/']
    CrawlSpider.DOWNLOAD_DELAY=.25;



    rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)     ]

def parse_item(self,response,spider):

             doc = html.fromstring(response.body)
             item = DmozItem()
             item['title'] = doc.xpath('//meta[@property="og:title"]/@content')
             item['link'] = response.url
             item['desc'] = doc.xpath('//meta[@name="description"]/@content')
             yield self.parse_save(self,response)
             yield item



    # self.log('A response from %s just arrived!' % response.url)

def parse_save(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

ここにログがあります

/Users/Nand/crawledData/tutorial/tutorial/spiders/stack_crawler.py:16: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(allow=('pages/')), callback='parse_item', follow=True),
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:7: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider, Rule
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:8: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:8: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:11: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import BaseSpider
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:12: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
  from scrapy import log
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:28: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:29: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(), callback='parse_save', follow=True)
2016-06-09 17:13:28 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)
2016-06-09 17:13:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
2016-06-09 17:13:28 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-06-09 17:13:28 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-09 17:13:28 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-09 17:13:28 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-09 17:13:28 [scrapy] INFO: Spider opened
2016-06-09 17:13:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-09 17:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-09 17:13:28 [py.warnings] WARNING: /usr/local/lib/python2.7/site-packages/scrapy/core/downloader/__init__.py:65: UserWarning: StayuncleCrawlerSpider.DOWNLOAD_DELAY attribute is deprecated, use StayuncleCrawlerSpider.download_delay instead
  (type(spider).__name__, type(spider).__name__))

2016-06-09 17:13:29 [scrapy] DEBUG: Crawled (404) <GET http://www.stayuncle.com/robots.txt> (referer: None)
2016-06-09 17:13:29 [scrapy] DEBUG: Redirecting (302) to <GET http://www.stayuncle.com/home> from <GET http://www.stayuncle.com/>
2016-06-09 17:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/home> (referer: None)
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'stayuncle.tumblr.com': <GET http://stayuncle.tumblr.com/>
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'facebook.com': <GET http://facebook.com/stayuncle>
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET http://twitter.com/stayuncle>
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/home> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.stayuncle.com/home> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/cdn-cgi/l/email-protection> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.cloudflare.com': <GET https://www.cloudflare.com/sign-up?utm_source=email_protection>
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/career> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/StayUncle?ref=hl>
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET https://www.twitter.com/stayuncle>
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/howwechose> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (404) <GET http://www.stayuncle.com/index.html> (referer: http://www.stayuncle.com/career)
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/about> (referer: http://www.stayuncle.com/home)
2016-06-09 17:13:31 [scrapy] DEBUG: Ignoring response <404 http://www.stayuncle.com/index.html>: HTTP status code is not handled or not allowed
2016-06-09 17:13:31 [scrapy] DEBUG: Filtered offsite request to 'in.linkedin.com': <GET https://in.linkedin.com/pub/nand-singh/1b/31b/464>
2016-06-09 17:13:31 [scrapy] INFO: Closing spider (finished)
2016-06-09 17:13:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2748,
 'downloader/request_count': 9,
 'downloader/request_method_count/GET': 9,
 'downloader/response_bytes': 32186,
 'downloader/response_count': 9,
 'downloader/response_status_count/200': 6,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/404': 2,
 'dupefilter/filtered': 23,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 6, 9, 11, 43, 31, 709558),
 'log_count/DEBUG': 19,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'offsite/domains': 7,
 'offsite/filtered': 22,
 'request_depth_max': 2,
 'response_received_count': 8,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'start_time': datetime.datetime(2016, 6, 9, 11, 43, 28, 793762)}
2016-06-09 17:13:31 [scrapy] INFO: Spider closed (finished)

しかし、クロールされたすべての Web ページを html の形式で保存したいですか?自分。私がこれを達成できるように、誰かがコードスナップで私を導くことができますか?

web-scraping - Scrapy を使用してクロールされた Web ページをメモリに保存する方法

1 に答える 1

Related

Reference