python - Python で 2 つのタグ間のデータを取得する

Question

<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>

Python を使用して、ラフセットとファジーセットのビューでグラニュラーコンピューティングベースのデータマイニングであるアンカータグから値を取得したい

lxmlを使ってみた

parser = etree.HTMLParser()
tree   = etree.parse(StringIO.StringIO(html), parser)                   
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)              
print rawResponse

次の出力を取得します

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]

score 3 · Accepted Answer

text_content次の方法を使用できます。

import lxml.html as LH

html = '''<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>'''

root = LH.fromstring(html)
for elt in root.xpath('//a'):
    print(elt.text_content())

収量

Granular computing based
data
mining
in the views of rough set and fuzzy set

または、空白を削除するには、次を使用できます

print(' '.join(elt.text_content().split()))

得るために

Granular computing based data mining in the views of rough set and fuzzy set

役に立つと思われる別のオプションを次に示します。

print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))

収量

Granular computing based data  mining in the views of rough set and fuzzy set

data(ただし、との間に余分なスペースが残ることに注意してくださいmining。)

'//a/descendant-or-self::text()'のより一般化されたバージョンです "//a/child::text() | //a/span/child::text()"。すべての子と孫などを反復処理します。

python - Python で 2 つのタグ間のデータを取得する

2 に答える 2

Related

Reference