python - xpath から値を取得できません

Question

以下のコード。空白を返す代わりに値「INDICES」を返すことになっていますか? 他のサイトで動作するのを見たことがありますが、ここでは失敗しているようです。以下のすべての試行は空白を返します

import requests
from lxml import html

pageContent=requests.get('https://finviz.com/futures.ashx')
tree = html.fromstring(pageContent.content)
Indicies = tree.xpath('//div[contains(@class, "tile_header is-indices")]//*')
Indicies = tree.xpath('//*[@id="futures"]/div/div[2]/div[1]/a[1]/div[1]/text()')
Indicies = tree.xpath('/html/body/div[2]/div/div/div/div[2]/div[1]/a[1]/div[1]')
print([e.text_content() for e in tree.xpath('//div[@class="tile_header is-indices" and @style]')])
Indicies = tree.xpath("//div[contains(text(), 'tile_header is-indices')]//*")
Indicies = tree.xpath("//a[contains(text(), 'tile_header is-indices')]//*")

var groups = [{"ticker":"INDICES","label":"Indices","contracts":[{"label":"DJIA","ticker":"YM","cot":"124601,124603"}

score 1 · Accepted Answer

どうしたの？

スクレイピングしようとしているページには、javascript コードを実行した後に表示されるコンテンツがあります (これは、curl で取得したページとブラウザーで検査したページの違いです)。lxml を使用して必要なものを見つける前に、 JavaScript のレンダリングに役立ちます。

セレン入り

Selenium は、この作業に役立ちます。これは (Web) ブラウザーの自動化です。つまり、オペレーティングシステムのパッケージマネージャーでダウンロードするか、手動でダウンロードできる webdriver というアドオンが必要になります (gecko、chrome、opera などを選択できます)。それらを PATH に追加するかexecutable_path="location\to\geckodriver.exe"、webdriver を呼び出すときに使用します。

さて、あなたのコードではどのように見えるでしょうか (webdriver として chrome を使用)?

from selenium import webdriver
from lxml import html
driver = webdriver.Chrome()
driver.get('https://finviz.com/futures.ashx')

tree = html.fromstring(driver.page_source)
Indicies = tree.xpath('//div[contains(@class, "tile_header is-indices")]//*')
Indicies = tree.xpath('//*[@id="futures"]/div/div[2]/div[1]/a[1]/div[1]/text()')
Indicies = tree.xpath('/html/body/div[2]/div/div/div/div[2]/div[1]/a[1]/div[1]')
print([e.text_content() for e in tree.xpath('//div[@class="tile_header is-indices" and @style]')])
Indicies = tree.xpath("//div[contains(text(), 'tile_header is-indices')]//*")
Indicies = tree.xpath("//a[contains(text(), 'tile_header is-indices')]//*")

driver.quit()

サイレント (--log-level=3)、ヘッドレス (--headless) などを実行できるように、いくつかのオプションを指定できます。引数自体は、使用する Web ドライバーによって異なります。

from selenium.webdriver.chrome.options import Options
chrome_options.add_argument("--add-first-thing")
chrome_options.add_argument("--add-second-thing")
driver = webdriver.Chrome(options=chrome_options)

ご覧のとおり、Selenium はページのレンダリングを支援し、lxml、Beautiful Soup、または Selenium 自体を解析できます。

python - xpath から値を取得できません

1 に答える 1

どうしたの？

セレン入り

Related

Reference