python - ウェブページの複数のテーブルからコンテンツをスクレイピングする方法

Question

Web ページの複数のテーブルからコンテンツをスクレイピングしたいのですが、HTML コードは次のようになります。

<div class="fixtures-table full-table-medium" id="fixtures-data">             
    <h2 class="table-header"> Date 1    </h2>
    <table class="table-stats">
        <tbody>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>team 1</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>team 2</a>                
                        </span>
                    </p>
                </td>
            </tr>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='match-details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>team 3</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>team 4</a>                
                        </span>
                    </p>
                </td>
            </tr>
        </tbody>
    </table>

    <h2 class="table-header"> Date 2    </h2>
    <table class="table-stats">
        <tbody>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='match-details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>team X</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>team Y</a>                
                        </span>
                    </p>
                </td>
            </tr>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='match-details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>Team A</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>Team B</a>                
                        </span>
                    </p>
                </td>
            </tr>
        </tbody>
    </table>
</div>

日付の下にさらに試合があります（その日に行われた試合に応じて、9または2または1）。テーブル数は 63 (日数に等しい)

日付ごとに、チーム間の試合と、どのチームがホームで、どのチームが離れているかを抽出したいと考えています。

私はスクレイピーシェルを使用していて、次のコマンドを試しました：

 title = sel.xpath("//td[@class = 'match-details']")[0] 
 l_home = title.xpath("//span[@class = 'team-home teams']/a/text()").extract()

これはホームチームのリストを出力し、これはすべてのアウェーチームのリストを出力しました。

 l_Away = title.xpath("//span[@class = 'team-away teams']/a/text()").extract()

これにより、すべての日付のリストが得られました。

sel.xpath("/html/body/div[3]/div/div/div/div[4]/div[2]/div/h2/text()").extract()

私が望むのは、すべての日付で、その日に行われる試合を取得することです (また、どのチームがホームアンドアウェイなのか)

私のitems.pyは次のようになります:

date = Field()
home_team = Field()
away_team2 = Field()

parse関数とItemクラスを書くのを手伝ってください。

前もって感謝します。

score 3 · Accepted Answer

からのロジックの例を次に示しscrapy shellます。

>>> for table in response.xpath('//table[@class="table-stats"]'):
...     date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]
...     print date
...     for match in table.xpath('.//tr[@class="preview" and @id]'):
...         home_team = match.xpath('.//span[@class="team-home teams"]/a/text()').extract()[0]
...         away_team = match.xpath('.//span[@class="team-away teams"]/a/text()').extract()[0]
...         print home_team, away_team
... 
 Date 1    
team 1 team 2
team 3 team 4
 Date 2    
team X team Y
Team A Team B

メソッドでは、内側のループでインスタンスparse()をインスタンス化する必要があります。Itemyield

def parse(self, response):
    for table in response.xpath('//table[@class="table-stats"]'):
        date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]
        for match in table.xpath('.//tr[@class="preview" and @id]'):
            home_team = match.xpath('.//span[@class="team-home teams"]/a/text()').extract()[0]
            away_team = match.xpath('.//span[@class="team-away teams"]/a/text()').extract()[0]

            item = MyItem()
            item['date'] = date
            item['home_team'] = home_team
            item['away_team'] = away_team
            yield item

どこにあるMyitemでしょう：

class MyItem(Item):
    date = Field()
    home_team = Field()
    away_team = Field()

python - ウェブページの複数のテーブルからコンテンツをスクレイピングする方法

1 に答える 1

Related

Reference