python - BeautifulSoup, findAll('table') はすべてのテーブルを返すだけでなく、それらの間のテキストも返します

Question

Web ページの一部を分離しようとしていますが、残念ながら、それは私が引き出すことができるものには含まれていません。

私が得ることができる最も近いのは、Webページの本文全体を取得してから、テーブルを削除しようとすることです(これは私が望まない唯一の部分です)。

私が使用しているコード:

storyText = soup.body
toRemove = storyText.findAll('table')
for each in toRemove:
    print each

現時点での問題は、toRemove 行がテーブルとそれらの間に含まれているテキストを返すことです。

だから私は得る：

<body>
<table>
    table stuff
</table>
    Text, not in tags </br> #This is what I want.
<table>
    table stuff
</table
</body>

次のようにして問題を回避しました。

# Isolate body
findBody = soup.body
new = str(findBody)
# Section off the text from the tables before it.
sec = new.split('</table>')
# Select story area
newStory = sec[3]
# Section off the text from the tables after it.
newSec = newStory.split('<table')
# Select the story area, this the area that we want.
story = newSec[0]

これを行うためのはるかにクリーンな方法があるはずなので、私はまだ答えを探しています。

score 0 · Accepted Answer

あなたのコードは私の Mac で正常に動作します。どのバージョンを使用しましたか? ビューティフルスープ4を使用。

(Beautiful Soup 3 はお勧めしません。開発されていないためです。http://www.crummy.com/software/BeautifulSoup/bs4/doc/ )

これが私のコードです：

from bs4 import BeautifulSoup

contents = '''<body>
<table>
     table stuff1
</table>
     Text, not in tags </br> #This is what I want.
<table>
     table stuff2
</table>
</body>'''

soup = BeautifulSoup(contents)

storyText = soup.body
toRemove = storyText.findAll('table')
for each in toRemove:
    print each
    each.extract()

print '----result-------------'
print soup

以下の結果が出ます。

<table>
    table stuff1
</table>
<table>
    table stuff2
</table>
----result-------------
<body>

    Text, not in tags  #This is what I want.

</body>

python - BeautifulSoup, findAll('table') はすべてのテーブルを返すだけでなく、それらの間のテキストも返します

1 に答える 1

Related

Reference