python - BeautifulSoup：特定のテーブルの内容を取得します

Question

私の地元の空港は、IEを持たないユーザーを恥ずかしそうにブロックし、ひどい目に見えます。到着ページと出発ページの内容を数分ごとに取得し、より読みやすい方法で表示するPythonスクリプトを作成したいと思います。

私が選んだツールは、IEを使用していると信じてサイトをだますための機械化と、フライトデータテーブルを取得するためのページの解析のためのBeautifulSoupです。

正直なところ、BeautifulSoupのドキュメントに迷い、ドキュメント全体からテーブル（タイトルを知っている）を取得する方法と、そのテーブルから行のリストを取得する方法を理解できません。

何か案は？

score 53 · Accepted Answer

これは必要な特定のコードではなく、BeautifulSoupの操作方法のデモにすぎません。idが「Table1」であるテーブルを検索し、そのすべてのtr要素を取得します。

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1") 
rows = table.findAll(lambda tag: tag.name=='tr')

score 17 · Accepted Answer

soup = BeautifulSoup(HTML)

# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )

rows=list()
for row in table.findAll("tr"):
   rows.append(row)

# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times

score 13 · Accepted Answer

ジェネリックの実用的な例を次に示し<table>ます。（テーブルデータをロードするために必要なjavascriptの実行のためにページを使用していませんが）

ここから国別のGDP（国内総生産）の表データを抽出します。

from bs4 import BeautifulSoup as Soup
html = ... # read your html with urllib/requests etc.
soup = BeautifulSoup(html, parser='lxml')

htmltable = soup.find('table', { 'class' : 'table table-striped' })
# where the dictionary specify unique attributes for the 'table' tag

以下の関数は、タグで始まり、<table>複数の<tr>（テーブル行）タグと内部<td>（テーブルデータ）タグが続くhtmlセグメントを解析します。内部列を持つ行のリストを返します。<th>最初の行で1つ（テーブルヘッダー/データ）のみを受け入れます。

def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td') ) # data row       
    return rows

それを使用して（最初の2行）を取得します。

list_table = tableDataText(htmltable)
list_table[:2]

[['Rank',
  'Name',
  "GDP (IMF '19)",
  "GDP (UN '16)",
  'GDP Per Capita',
  '2019 Population'],
 ['1',
  'United States',
  '21.41 trillion',
  '18.62 trillion',
  '$65,064',
  '329,064,917']]

pandas.DataFrameこれは、より高度な操作のために簡単に変換できます。

import pandas as pd

dftable = pd.DataFrame(list_table[1:], columns=list_table[0])
dftable.head(4)

python - BeautifulSoup：特定のテーブルの内容を取得します

3 に答える 3

Related

Reference