python - BeautifulSoupを使用してタグを解析し、値を抽出するのに問題がある

Question

私はこのように見えるいくつかのhtmlを持っています：

<tr>
  <td>some text</td>
  <td>some other text</td>
  <td>some <b>problematic</b> other <br /> text</td>
</tr>

タグの値を取得して各内部値を出力しようとするPython：

soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
for row in soup.findAll('tr'):
    print repr(row) # this prints the whole 'tr' element text just fine.
    for col in row.contents:
        print col.string

したがって、フルテキストはキャプチャされたhtmlを正しく出力しますが、「col」は最後の要素に対してNoneを出力します。

some text
some other text
None

私はBeatifulSoupやpythonに精通していませんが、最後の要素の内部タグが解析の問題を引き起こしているようです。

ありがとう

score 0 · Accepted Answer

BeautifulSoup バージョン 4 にアップグレードして、以下を使用できます.stripped_strings。

soup = BeautifulSoup(data)
for row in soup.find_all('tr'):
    print '\n'.join(row.stripped_strings)

BeautifulSoup 3 では、代わりに含まれるすべてのテキストを検索する必要があります。

for row in soup.findAll('tr'):
    print '\n'.join(el.strip() for row.findAll(text=True) if el.strip())

python - BeautifulSoupを使用してタグを解析し、値を抽出するのに問題がある

1 に答える 1

Related

Reference