python - SeleniumhtmlページをhtmlXpathSelectorに渡す方法

Question

javascriptを使用しているページをスクレイプする必要があります。これが私がSeleniumを使用している理由です。問題は、セレンが必要なデータをフェッチできないことです。

htmlXmlSelectorを使用してデータをフェッチしようとしています。

生成されたhtmlセレンをhtmlXmlSelectorに渡すにはどうすればよいですか？

score 6 · Accepted Answer

これが私の解決策です。seleniumpage_sourceからhtmlXpathSelectorを作成するだけです。

hxs = HtmlXPathSelector(text=sel.page_source)

score 0 · Accepted Answer

Response手動で作成してみてください：

from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector

body = '''<html></html>'''

response = TextResponse(url = '', body = body, encoding = 'utf-8')

hxs = HtmlXPathSelector(response)
hxs.select("/html")

score 0 · Accepted Answer

Seleniumによる手動応答：

from scrapy.spider import BaseSpider
from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector
import time
from selenium import selenium

class DemoSpider(BaseSpider):
    name="Demo"
    allowed_domains = ['http://www.example.com']
    start_urls = ["http://www.example.com/demo"]

    def __init__(self):
        BaseSpider.__init__(self)
        self.selenium = selenium("127.0.0.1", 4444, "*chrome", self.start_urls[0])
        self.selenium.start()

    def __del__(self):
       self.selenium.stop()

    def parse (self, response):
        sel = self.selenium
        sel.open(response.url)
        time.sleep(2.0) # wait for javascript execution

        #build the response object from Selenium
        body = sel.get_html_source()
        sel_response = TextResponse(url=response.url, body=body, encoding = 'utf-8')
        hxs = HtmlXPathSelector(sel_response)
        hxs.select("//table").extract()

python - SeleniumhtmlページをhtmlXpathSelectorに渡す方法

3 に答える 3

Related

Reference