python - アンカータグ間のテキストを抽出するには?

Question

HTML ページからアーティストの名前を抽出する必要があります。ページのスニペットは次のとおりです。

 </td>
 <td class="playbuttonCell">
   <a class="playbutton preview-track" href="/music/example" data-analytics-redirect="false"  >
      <img class="transparent_png play_icon" width="13" height="13" alt="Play" src="http://cdn.last.fm/flatness/preview/play_indicator.png" style="" />
    </a>
  </td>
  <td class="subjectCell" title="example, played 3 times">
    <div>
      <a href="/music/example-artist"   >Example artist name</a>

私はこれを試しましたが、仕事をしていません。

import urllib
from bs4 import BeautifulSoup

html = urllib.urlopen('http://www.last.fm/user/Jehl/charts?rangetype=overall&subtype=artists').read()
soup = BeautifulSoup(html)
print soup('a')

for link in soup('a'):
    print html

私はどこを台無しにしていますか？

score 3 · Accepted Answer

これを試すことができます：

In [1]: from bs4 import BeautifulSoup

In [2]: s = # Your string here...

In [3]: soup = BeautifulSoup(s)

In [4]: for anchor in soup.find_all('a'):
   ...:     print anchor.text
   ...:
   ...:

here lies the text i need

ここで、find_allメソッドは一致するすべてのアンカータグを含むリストを返します。その後、textプロパティを出力してタグ間の値を取得できます。

score 2 · Accepted Answer

for link in soup.select('td.subjectCell a'):
    print link.text

subjectCellクラスを持つ要素内の要素を (CSS と同様に) 選択します。atd

score -2 · Accepted Answer

正規表現はここであなたの友達です。BeautifulSoupを適切に使用するRocketDonkeyの答えの代替として。次のような正規表現を使用してsoup（'a'）を解析できます。

>([a-zA-Z]*|[0-9]|(\w\s*)*)</a>

re.findallメソッドを使用して、アンカータグの間にあるテキストを直接取得できます。

python - アンカータグ間のテキストを抽出するには?

5 に答える 5

Related

Reference