xpath - Scrapy xpath aria-select=false

Question

スクレイピーを使用して、いくつかのカーンアカデミービデオから文字起こし情報を取得しようとしています。例: https://www.khanacademy.org/math/algebra-basics/basic-alg-foundations/alg-basics-negative-numbers/v/opposite-of-a-number

xpath を使用して [トランスクリプト] ボタンを選択しようとするとresponse.xpath('//div[contains(@role, "tablist")]/a').extract()、タブに関する情報しか取得できませんでしたaria-selected="true"。これは、[概要] セクションです。トランスクリプトボタンでスクレイピーを使用しaria-selectedて false から true に変更し、必要な情報を取得する必要があります。

どうすればこれを達成できるのか、誰か明確にしてもらえますか?

とても有難い！

score 1 · Accepted Answer

ネットワークを調べてみると、ページが読み込まれるとトランスクリプトを取得するために AJAX リクエストが行われていることがわかります。

この場合、それはhttps://www.khanacademy.org/api/internal/videos/2Zk6u7Uk5ow/transcript?casing=camel&locale=en&lang=en です。YouTube ビデオ URL ID を使用して、この API URL を作成しているようです。したがって、非常に簡単に再作成できます。

import json
import scrapy
class MySpider(scrapy.Spider):
    #...
    transcript_url_template = 'https://www.khanacademy.org/api/internal/videos/{}/transcript?locale=en&lang=en'

    def parse(self, response):
        # find youtube id
        youtube_id = response.xpath("//meta[@property='og:video']/@content").re_first('v/(.+)')
        # create transcript API url using the youtube id
        url = self.transcript_url_template.format(youtube_id)
        # download the data and parse it
        yield Request(url, self.parse_transript)

    def parse_transcript(self, response):
        # convert json data to python dictionary
        data = json.loads(response.body)
        # parse your data!

xpath - Scrapy xpath aria-select=false

1 に答える 1

Related

Reference