python - Python BeautifulSoup で HTML テーブルを解析する

Question

BeautifulSoup を使用して、http://pastie.org/8070879にアップロードした html テーブルを解析して、3 つの列 (0 ～ 735、0.50 ～ 1.0、0.5 ～ 0.0) をリストとして取得しようとしています。理由を説明すると、0 から 735 までの整数をキーに、10 進数を値にします。

SOに関する他の多くの投稿を読んで、私が望むリストを作成するのに近づかない次のことを思いつきました。ここで見られるように、表にテキストを表示するだけですhttp://i1285.photobucket.com/albums/a592/TheNexulo/output_zps20c5afb8.png

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("fide.html"))
table = soup.find('table')

rows = table.findAll('tr')

for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
     text = ''.join(td.find(text=True))
     print text + "|",
  print

私は Python と BeautifulSoup は初めてなので、お手柔らかにお願いします。ありがとう

score 3 · Accepted Answer

BeautifulSoup のような HTML パーサーは、入力 HTML 構造を反映したオブジェクトモデルが必要であると想定します。しかし、時々 (この場合のように) そのモデルは助けになる以上に邪魔になります。Pyparsing には、生の正規表現を使用するよりも堅牢な HTML 解析機能がいくつか含まれていますが、それ以外は同様の方法で機能し、関心のある HTML のスニペットを定義し、残りを無視することができます。以下は、投稿された HTML ソースを読み取るパーサーです。

from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group

""" looking for this recurring pattern:
          <td valign="top" bgcolor="#FFFFCC">00-03</td>
          <td valign="top">.50</td>
          <td valign="top">.50</td>

    and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""

td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))

realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')

# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend + 
                    Group(2*(td + realnum + tdend))("vals"))

このパーサーは、一致するトリプルを選択するだけでなく、開始と終了の整数と実数のペアも抽出します (また、解析時に文字列から整数または浮動小数点数に変換されます)。

表を見ると、実際には 700 のようなキーを取得し、値のペア (0.99、0.01) を返すルックアップが必要だと思います。700 は 620 ～ 735 の範囲にあるからです。このコードは、ソース HTML テキストを検索し、一致したエントリを反復処理して、キーと値のペアを辞書ルックアップに挿入します。

# search the input HTML for matches to the entryExpr expression, and build up lookup dict
lookup = {}
for entry in entryExpr.searchString(sourcehtml):
    for i in range(entry.start, entry.end+1):
        lookup[i] = tuple(entry.vals)

そして、いくつかのルックアップを試してみましょう:

# print out some test values
for test in (0,20,100,700):
    print (test, lookup[test])

プリント:

0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)

score 3 · Accepted Answer

上記の回答は私が提供するものよりも優れていると思いますが、開始できる BeautifulSoup の回答があります。これは少しハックですが、それでも提供することにしました。

BeautifulSoup を使用すると、次の方法で特定の属性を持つすべてのタグを見つけることができます (soup.object が既に設定されていると仮定します)。

soup.find_all('td', attrs={'bgcolor':'#FFFFCC'})

これにより、すべてのキーが見つかります。トリックは、これらを必要な値に関連付けることです。これらはすべて直後に表示され、ペアになっています (ちなみに、これらが変更された場合、このソリューションは機能しません)。

したがって、以下を試して、キーエントリに続くものにアクセスし、それらを your_dictionary に入れることができます。

 for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
   your_dictionary[node.string] = node.next_sibling

問題は、「next_sibling」が実際には '\n' であるため、次の値 (必要な最初の値)をキャプチャするために次の手順を実行する必要があることです。

for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
  your_dictionary[node.string] = node.next_sibling.next_sibling.string

次の2 つの値が必要な場合は、これを 2 倍にする必要があります。

for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
  your_dictionary[node.string] = [node.next_sibling.next_sibling.string, node.next_sibling.next_sibling.next_sibling.next_sibling.string]

免責事項：最後の行は私にとってかなり醜いです。

python - Python BeautifulSoup で HTML テーブルを解析する

3 に答える 3

Related

Reference