python - beautifulsoupでHTMLを分析しようとすると奇妙な問題が発生します

Question

公式ウェブサイトから音楽チャートデータを収集するためにいくつかのPythonコードを書き込もうとしていますが、ビルボードのデータを収集するときに問題が発生します。HTMLを処理するためにbeautifulsoupを選択します

私のENV：python-2.7 beautifulsoup-3.2.0

最初にHTMLを分析します

>>> import BeautifulSoup, urllib2, re
>>> html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
>>> soup = BeautifulSoup.BeautifulSoup(html)

次に、アーティスト名など、必要なデータを収集しようとします。

HTML：

<div class="listing chart_listing">

<article id="node-1491420" class="song_review no_category chart_albumTrack_detail no_divider">
  <header>
    <span class="chart_position position-down">11</span>
            <h1>Ho Hey</h1>
        <p class="chart_info">
      <a href="/artist/418560/lumineers">The Lumineers</a>            <br>
      The Lumineers          </p>

アーティスト名はTheLumineersです

>>> print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')})\
... .find("p", {"class":"chart_info"}).a.string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'find'

NoneType！必要なデータをgrepできないようです。おそらく私のルールが間違っているので、代わりにいくつかの基本的なタグをgrepしようとします。

>>> print str(soup.find("div"))
None
>>> print str(soup.find("a"))
None
>>> print str(soup.find("title"))
<title>The Hot 100 : Page 2  | Billboard</title>
>>> print str(soup)
......entire HTML.....

混乱していますが、なぜdivのような基本的なタグをgrepできないのですか？彼らは確かにそこにいます。私のコードの何が問題になっていますか？これらを使って他のチャートを分析しようとしても問題はありません。

score 1 · Accepted Answer

これは Beautifulsoup 3 の問題のようです。出力を prettify() した場合:

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup3(html)
print soup.prettify()

出力の最後に次のように表示されます。

        <script type="text/javascript" src="//assets.pinterest.com/js/pinit.js"></script>
</body>
</html>
  </script>
 </head>
</html>

2 つの html 終了タグがあるため、このデータの Javascript によって BeautifulSoup3 が混乱しているように見えます。

使用する場合:

from bs4 import BeautifulSoup as soup4
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup4(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

出力として得'The Lumineers'られます。

bs4 に切り替えることができない場合は、html 変数をファイルに書き出してからout.txt、スクリプトを変更して読み込みin.txt、出力を入力にコピーし、チャンクを切り取ることをお勧めします。

from BeautifulSoup import BeautifulSoup as soup3
import re

html = open('in.txt').read()
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

私の最初の推測は、 and を削除することでした<head> ... </head>。

その後、プログラムで解決できます。

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

htmlorg = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
head_start = htmlorg.index('<head')
head_end = htmlorg.rindex('</head>')
head_end = htmlorg.index('>', head_end)
html = htmlorg[:head_start] + htmlorg[head_end+1:]
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

python - beautifulsoupでHTMLを分析しようとすると奇妙な問題が発生します

1 に答える 1

Related

Reference