html - 正規表現を使用してPython 2.7でhtmlを解析する - それを本当に理解していない

Question

ちょっとばかげて申し訳ありませんが、Python の助けが本当に必要です。

['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

だから私はこのタプルを持っていて、その href 属性の中にあるものとタグの中にあるものを切り取る必要があります<a>- 基本的に、私は次のようなタプルを取得したいです:

[["needs to be cut out", "Foo to BAR"], ["this also needs to be cut out", "BAR to Foo"]]

href 属性内には、たとえば、多くの特殊記号があります。

<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">

私が思うに、オブジェクトツリーを解析する必要がなく、Web ページの URL と単語だけが必要な場合、HTML パーサーを使用するのは非常に困難です。しかし、正規表現を形成する方法を本当に理解できません。私が作成した正規表現は完全に間違っているようです。そこで、誰かが私を助けてくれるかどうか尋ねています。

score 1 · Accepted Answer

とにかくHTMLパーサーを使用してください。Python にはいくつかのものが含まれており、任意の属性を持つ単純なタグであっても、 xml.etree.ElementTreeAPIは正規表現よりも簡単に機能します。<a>

from xml.etree import ElementTree as ET

texts = []
for linktext in linkslist:
    link = ET.fromstring(linktext)
    texts.append([link.attrib['href'], link.text])

を使用すると、タグの下にネストされたもの' '.join(link.itertext())からテキストを取得できます。リンクの一部に、、またはその他のインラインタグがネストされていることがわかった場合は、リンクテキストをさらにマークアップします。<a><span><b><i>

for linktext in linkslist:
    link = ET.fromstring(linktext)
    texts.append([link.attrib['href'], ' '.join(link.itertext())])

これは与える：

>>> from xml.etree import ElementTree as ET
>>> linkslist = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']     
>>> texts = []
>>> for linktext in linkslist:
...     link = ET.fromstring(linktext)
...     texts.append([link.attrib['href'], ' '.join(link.itertext())])
... 
>>> texts
[['needs to be cut out', 'Foo to BAR'], ['this also needs to be cut out', 'BAR to Foo']]

score 1 · Accepted Answer

HTML エンティティの解析には BeautifulSoup を使用できます。

あなたの問題によると、すでに次のリストがあります：

l = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

必要なのは次のコードだけです。

from BeautifulSoup import BeautifulSoup

parsed_list = []

for each in l:
    soup = BeautifulSoup(each)
    parsed_list.append([soup.find('a')['href'], soup.find('a').contents[0]])

それが役に立てば幸い：）

score 0 · Accepted Answer

そのために、Easy Html Parser EHP を使用します。

https://github.com/iogf/ehpをチェックしてください

lst = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>', '<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">']

data = [(tag.text(), attr.get('href'))for indi in lst
            for tag, name, attr in Html().feed(indi).walk() if attr.get('href')]


data

出力：

[('Foo to BAR', 'needs to be cut out'), ('BAR to Foo', 'this also needs to be cut out'), ('', u'?a=p.stops&direction_id=23600&interval=1&t=wml&l=en')]

html - 正規表現を使用してPython 2.7でhtmlを解析する - それを本当に理解していない

3 に答える 3

Related

Reference