python - Beautiful Soup を使用して壊れたタグからコンテンツを取得する

Question

score 2 · Accepted Answer

text=Trueコードから削除すると、問題なく動作するはずです。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <html>
... <body>
... <a href = "http:\\www.google.com">Google<br>
... <a href = "http:\\www.example.com">Example</a>
... </body>
... </html>
... ''')
>>> [a.get_text().strip() for a in soup.find_all('a')]
[u'Google', u'Example']
>>> [a.get_text().strip() for a in soup.find_all('a', text=True)]
[u'Example']

score 0 · Accepted Answer

このコードを試してください：

from BeautifulSoup import BeautifulSoup

text = '''
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
<a href = "http:\\www.example.com">Example</a>
</body>
</html>
'''

soup = BeautifulSoup(text)

for link in soup.findAll('a'):
    if link.string != None:
        print link.string

コードを実行したときの出力は次のとおりです。

例

、またはそこに行くために必要なものは何でも置き換えtextてください。text = open('sol.html').read()

python - Beautiful Soup を使用して壊れたタグからコンテンツを取得する

2 に答える 2

Related

Reference