python - HTML テーブルからのデータの抽出

Question

Linux シェル環境で HTML から特定の情報を取得する方法を探しています。

これは私が興味を持っているビットです:

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

そして、シェル変数に保存するか、上記のhtmlから抽出されたキーと値のペアでこれらをエコーします。例：

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

現時点で私ができることは、sax パーサーまたは jsoup などの html パーサーを使用してこの情報を抽出する Java プログラムを作成することです。

ただし、ここで Java を使用すると、実行したい「ラッパー」スクリプト内に実行可能な jar を含めることでオーバーヘッドが発生するようです。

同じことができる「シェル」言語、つまりperl、python、bashなどがあるはずです。

私の問題は、これらの経験がまったくないことです。誰かがこの「かなり簡単な」問題を解決するのを手伝ってくれますか

クイックアップデート:

.html ドキュメントのテーブルと行が増えたことを忘れていました。申し訳ありません (早朝)。

更新 #2:

root アクセス権がないため、このように Bsoup をインストールしようとしました:

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

エラー：

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

更新 #3:

Tichodromas の回答を実行すると、次のエラーが発生します。

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

何か案は？

score 52 · Accepted Answer

BeautifulSoup4を使用したPythonソリューション（編集：適切なスキップを使用。編集3：を使用class="details"してを選択table）：

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

結果は次のようになります。

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

Edit2：目的の出力を生成するには、次のようなものを使用します。

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

結果：

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms

score 3 · Accepted Answer

HTMLコードがmycode.htmlファイルに保存されていると仮定すると、bashの方法は次のとおりです。

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

注：出力は完全に整列されていません

score 1 · Accepted Answer

undef $/;
$text = <DATA>;

@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;
for (@tabs) {
    @th = m!<th>(.*?)</th>!gms;
    @td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
    printf "%-16s\t: %s\n", $th[$i], $td[$i];
}

__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>

次のように出力します。

Tests               : 103
Failures            : 24
Success Rate        : 76.70%
Average Time        : 71 ms
Min Time            : 0 ms
Max Time            : 829 ms

score 1 · Accepted Answer

標準ライブラリのみを使用する Python ソリューション (HTML がたまたま整形式の XML であるという事実を利用しています)。複数行のデータを扱うことができます。

（Python 2.6および2.7でテスト済み。OPがPython 2.4を使用しているとの質問が更新されたため、この場合、この回答はあまり役に立たない可能性があります。ElementTreeはPython 2.5で追加されました）

from xml.etree.ElementTree import fromstring

HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
  <tr valign="top" class="whatever">
    <td>A</td>
    <td>B</td>
    <td>C</td>
    <td>D</td>
    <td>E</td>
    <td>F</td>
  </tr>
</table>"""

tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]

for num, h in enumerate(headrow):
    data = ", ".join([row[num].text for row in datarows])
    print "{0:<16}: {1}".format(h.text, data)

出力：

Tests           : 103, A
Failures        : 24, B
Success Rate    : 76.70%, C
Average Time    : 71 ms, D
Min Time        : 0 ms, E
Max Time        : 829 ms, F

score 1 · Accepted Answer

以下は、Python 2.7 でテストした Python 正規表現ベースのソリューションです。xml モジュールに依存していないため、xml が完全に整形されていない場合でも機能します。

import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
  tables=[]
  maxlen=0
  rex1=r'<table.*?/table>'
  rex2=r'<tr.*?/tr>'
  rex3=r'<(td|th).*?/(td|th)>'
  s = re.search(rex1,html,re.DOTALL)
  while s:
    t = s.group()  # the table
    s2 = re.search(rex2,t,re.DOTALL)
    table = []
    while s2:
      r = s2.group() # the row 
      s3 = re.search(rex3,r,re.DOTALL)
      row=[]
      while s3:
        d = s3.group() # the cell
        #row.append(strip_tags(d).strip() )
        row.append(d.strip() )

        r = re.sub(rex3,'',r,1,re.DOTALL)
        s3 = re.search(rex3,r,re.DOTALL)

      table.append( row )
      if maxlen<len(row):
        maxlen = len(row)

      t = re.sub(rex2,'',t,1,re.DOTALL)
      s2 = re.search(rex2,t,re.DOTALL)

    html = re.sub(rex1,'',html,1,re.DOTALL)
    tables.append(table)
    s = re.search(rex1,html,re.DOTALL)
  return tables, maxlen

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""
print extract_html_tables(html)

python - HTML テーブルからのデータの抽出

7 に答える 7

Related

Reference