javascript - Scrapy と Selenium を使用してログインが必要なページをスクレイピングし、javascript を使用してデータをロードする

Question

ログインが必要なページをスクレイピングし、javascript を使用してデータをロードしようとしています。現在、scrapy を使用して正常にログインできます。しかし、データは JavaScript を使用してロードされるため、私のスパイダーは必要なデータを見ることができません。

いくつかの検索を行ったところ、Selenium が可能な解決策であることがわかりました。Seleniumでブラウザを作ってページを見たい。Selenium webdriver ツールを使用する必要があるようです。しかし、私はそれを行う方法がわかりません。スパイダーにセレンコードを追加する場所と方法を知っている人はいますか?

どうもありがとう。

#My spider looks like

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request, FormRequest

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from selenium import selenium
import time

from login.items import SummaryItem

class titleSpider(BaseSpider):
    name = "titleSpider"
    allowed_domains = ["domain.com"]
    start_urls = ["https://www.domain.com/login"]

    # Authentication
    def parse(self, response):
        return [FormRequest.from_response(response,
                formdata={'session_key': 'myusername', 'session_password': 'mypassword'},
                callback=self.after_login)]

    # Request the webpage
    def after_login(self, response):
        # check login succeed before going on
        if "Error" in response.body:
            print "Login failed"
        else:
            print "Login successfully"
            return Request(url="https://www.domain.com/result1",
               callback=self.parse_page) # this page has some data loaded using javascript


    def __init__(self):
        CrawlSpider.__init__(self)
        self.verificationErrors = []
        # How can I know selenium passes authentication? 
        self.selenium = selenium("localhost", 4444, "*firefox", "https://www.domain.com/result1")
        print "Starting the Selenium Server!"
        self.selenium.start()
        print "Successfully, Started the Selenium Server!"

    def __del__(self):
        self.selenium.stop()
        print self.verificationErrors
        CrawlSpider.__del__(self)

    # Parse the page
    def parse_page(self, response):

        item = SummaryItem()
        hxs = HtmlXPathSelector(response)
        item['name']=hxs.select('//span[@class="name"]/text()').extract() # my spider cannot see the name.

        # Should I add selenium codes here? Can it load the page that requires authentication?
        sel= self.selenium
        sel.open(response.url)
        time.sleep(4)
        item['name']=sel.select('//span[@class="name"]/text()').extract() # 

        return item

score 0 · Accepted Answer

このようなものを試すことができます

def __init__(self):
    BaseSpider.__init__(self)
    self.selenium = webdriver.Firefox()

def __del__(self):
    self.selenium.quit()
    print self.verificationErrors

def parse(self, response):

    # Initialize the webdriver, get login page
    sel = self.selenium
    sel.get(response.url)
    sleep(3)

javascript - Scrapy と Selenium を使用してログインが必要なページをスクレイピングし、javascript を使用してデータをロードする

1 に答える 1

Related

Reference