python - Pythonを使用してHTMLコード内の特定のコメントを見つける

Question

Python で特定のコメントを見つけることができません。私の主な理由は、2 つの特定のコメント内のすべてのリンクを見つけることです。パーサーのようなもの。私はこれを試しましたBeautifulsoup：

import urllib
over=urlopen("www.gamespot.com").read()
soup = BeautifulSoup(over)
print soup.find("<!--why-->")

しかし、うまくいきません。regexを使用する必要があるかもしれないと思いますBeautifulsoup。

助けてください。

例: このような HTML コードがあります

<!--why-->
www.godaddy.com
<p> nice one</p>
www.wwf.com
<!-- why not-->

編集: 2 つのコメントの間に、タグなどの他のものが存在する可能性があります。

そして、すべてのリンクを保存する必要があります。

score 6 · Accepted Answer

すべてのコメントが必要な場合findAllは、callableを使用できます。

>>> from bs4 import BeautifulSoup, Comment
>>> 
>>> s = """
... <p>header</p>
... <!-- why -->
... www.test1.com
... www.test2.org
... <!-- why not -->
... <p>tail</p>
... """
>>> 
>>> soup = BeautifulSoup(s)
>>> comments = soup.findAll(text = lambda text: isinstance(text, Comment))
>>> 
>>> comments
[u' why ', u' why not ']

それらを取得したら、通常のトリックを使用して移動できます。

>>> comments[0].next
u'\nwww.test1.com\nwww.test2.org\n'
>>> comments[0].next.split()
[u'www.test1.com', u'www.test2.org']

ページが実際にどのように見えるかに応じて、少し調整する必要がある場合があり、必要なコメントを選択する必要がありますが、それで作業を開始できます。

編集：

特定のテキストのように見えるものだけが本当に必要な場合は、次のようにすることができます

>>> comments = soup.findAll(text = lambda text: isinstance(text, Comment) and text.strip() == 'why')
>>> comments
[u' why ']

または、リスト内包表記を使用して事後にそれらをフィルタリングできます。

>>> [c for c in comments if c.strip().startswith("why")]
[u' why ', u' why not ']

python - Pythonを使用してHTMLコード内の特定のコメントを見つける

1 に答える 1

Related

Reference