python - スクレイピーページネーションセレンパイソン

Question

ページネーションを使用してテーブルからリンクを削り取ろうとしています。Selenium にページを反復処理させ、最初のページからリンクを取得することができますが、2 つを結合しようとすると、最後のページに到達してボタンがなくなったnext pageときにプロセスが停止し、私は何も得ません。

データを単にcsvに返すように優雅に指示する方法がわかりません。私はwhile true:ループを使用しているので、かなり不可解です。

別の質問は、xpath を使用して解析しようとしているリンクをターゲットにすることに関するものです。リンクは 2 つの異なるtrクラスで保持されます。1 つのセットはの下//tr[@class ="resultsY"]にあり、もう1 つはの下//tr[@class ="resultsW"]にあります。一度にすべてのリンクをターゲットにするために使用できるステートメントはありORますか?

私が見つけた1つの解決策： '//tr[@class ="resultsY"] | //tr[@class ="resultsW"]'毎回エラーが発生します。

html テーブルは次のとおりです。

<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a>        <----a link i'm after
-<td>
-<td></td>
</tr>
<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a>        <----a link i'm after
-<td>
-<td></td>
</tr>

そして、ここに私のスクレイピーがあります:

import time
from scrapy.item import Item, Field
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from scrapy.selector import HtmlXPathSelector

class ElyseAvenueItem(Item):
    link = Field()   
    link2 = Field()

class ElyseAvenueSpider(BaseSpider):
    name = "s1"
    allowed_domains = ["nces.ed.gov"]
    start_urls = [
    'https://nces.ed.gov/collegenavigator/']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        select = Select(self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_ucMapMain_lstState"))
        select.deselect_by_visible_text("No Preference")
        select.select_by_visible_text("Alabama")
        self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_btnSearch").click()

#here is the while loop. it gets to the end of the table and says...no more "next page" and gives me the middle finger

        '''while True:
            el1 = self.driver.find_element_by_partial_link_text("Next Page")
            if el1:
                el1.click()
            else:
                #return(items)
                self.driver.close()'''
        hxs = HtmlXPathSelector(response)

        '''
#here i tried: titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"] | //tr[@class ="resultsY"]') and i got an error saying that

        titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"]')
        items = []
        for titles in titles:
            item = ElyseAvenueItem()

#here i'd like to be able to target all of the hrefs...not sure how

            link = titles.find_element_by_xpath('//tr[@class ="resultsW"]/td[2]/a')
            item ["link"] = link.get_attribute('href')
            items.append(item)
        yield(items)

python - スクレイピーページネーションセレンパイソン

1 に答える 1

Related

Reference