python - 美しいスープを使用して、指定された html 構造から href を解析します

Question

私は次の与えられたhtml構造を持っています

<li class="g">
 <div class="vsc">    
  <div class="alpha"></div>
  <div class="beta"></div>
  <h3 class="r">
   <a href="http://www.stackoverflow.com"></a>
  </h3>
 </div>
</li>

上記の html 構造は繰り返されます。BeautifulSoupとPythonを使用して、上記の html 構造からすべてのリンク (stackoverflow.com) を解析する最も簡単な方法は何ですか?

score 2 · Accepted Answer

BeautifulSoup 4は、CSS セレクターを使用して、これを実現する便利な方法を提供します。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]

これには、コンテキストによって選択を制限するという利点もあります。クラス r の h3 ノードの子であるアンカーノードのみを選択します。

セレクターを微調整するだけで、制約を省略したり、必要に最も適したものを選択したりするのは簡単です。それについては、CSS セレクターのドキュメントを参照してください。

score 1 · Accepted Answer

Petri によって提案されているように CSS セレクターを使用することは、おそらく BS を使用してそれを行うための最良の方法です。ただし、この仕事にはほぼ完璧なとをlxml.html使用することをお勧めします。xpath

テストhtml:

html="""
<html>
<li class="g">
<div class="vsc"></div>    
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
<a href="http://www.correct.com"></a>
</h3>
</li>
<li class="g">
<div class="vsc"></div>    
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
<a href="http://www.correct.com"></a>
</h3>
</li>
<li class="g">
<div class="vsc"></div>    
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
<a href="http://www.incorrect.com"></a>
</h3>
</li>
</html>"""

そしてそれは基本的にワンライナーです：

    import lxml.html as lh
    doc=lh.fromstring(html)
    doc.xpath('.//li[@class="g"][div/@class = "vsc"][div/@class = "alpha"][div/@class = "beta"][h3/@class = "r"]/h3/a/@href')

    Out[264]:
    ['http://www.correct.com', 'http://www.correct.com']

python - 美しいスープを使用して、指定された html 構造から href を解析します

2 に答える 2

Related

Reference