python - 正規表現に基づいてHTMLタグを取得する方法

Question

正規表現の一致を含むすべてのHTMLタグを抽出しようとしています。たとえば、文字列「name」を含むすべてのタグを取得したいとし、次のようなHTMLドキュメントがあるとします。

<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>

おそらく、正規表現を試して、開始と終了の間のすべての一致をキャッチする必要"<>"がありますが、それらの一致に基づいて解析されたツリーをトラバースできるようにしたいので、兄弟または親または「nextElements」を取得できます。上記の例では、一致を含むタグの親または兄弟であることがわかったら、それは取得する<head>*</head>か、おそらく一度は取得することになります。<h2>*</h2>

BeautifulSoapを試してみましたが、探しているタグの種類やその内容に基づいたタグがすでにわかっている場合に便利だと思います。この場合、最初に一致を取得し、その一致を開始点として、BeautifulSoapや他のHTMLパーサーが実行できるようにツリーをナビゲートしたいと思います。

提案？

score 2 · Accepted Answer

を使用しlxml.htmlます。これは優れたパーサーであり、必要なものを簡単に表現できるxpathをサポートしています。

以下の例では、この xpath 式を使用しています。

//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()

つまり、英語で次のようになります。

テキストにその単語を含むタグを見つけて'name'から、親を取得し、次の兄弟を取得し、そのクラス 'name'でタグを見つけて、最後にそのテキストコンテンツを返します。

コードを実行した結果は次のとおりです。

['This is also a tag to be retrieved']

完全なコードは次のとおりです。

text = """
<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
    'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')

必ずお読みください。「HTML を正規表現で解析しないでください」という回答はこちら: https://stackoverflow.com/a/1732454/17160

score 1 · Accepted Answer

次の条件が与えられます。

一致は、タグの属性の値で発生する必要があります
一致は、タグの直接の子であるテキストノードで発生する必要があります

あなたは美しいスープを使うことができます：

from bs4 import BeautifulSoup
from bs4 import NavigableString
import re

html = '''<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>'''

soup = BeautifulSoup(html)
p = re.compile("name")

def match(patt):
    def closure(tag):
        for c in tag.contents:
            if isinstance(c, NavigableString):
                if patt.search(unicode(c)):
                    return True
        for v in tag.attrs.values():
            if patt.search(v):
                return True
    return closure

for t in soup.find_all(match(p)):
    print t

出力：

<title>This tag includes 'name', so it should be retrieved</title>
<h1 class="name">This is also a tag to be retrieved</h1>

python - 正規表現に基づいてHTMLタグを取得する方法

2 に答える 2

Related

Reference