python - Pythonの文字列パターンに一致

Question

リンクを含めることができる文字列があります：

<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a> ...

すべてのhtmlタグ「Hello」、「Hello2」、「Hello3」のテキスト（リンクではない）を抽出するにはどうすればよいですか...？私はすべてのテキストを含むべきリストを考えています。

score 1 · Accepted Answer

lxmlの使用：

import lxml.html as LH

content = '''
<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a>
<a href="/">go <b>home</b>, dude!</a>
'''

doc = LH.fromstring(content)
texts = [elt.text_content() for elt in doc.xpath('//a')]
print(texts)

収量

['Hello', 'Hello2', 'Hello3', 'go home, dude!']

python - Pythonの文字列パターンに一致

1 に答える 1

Related

Reference