python - HTML要素にクラス名がない場合のbeautifulsoupの使用方法は?

Question

次のコード (Nathan Yau の「Visualize This」の初期の例を少し変更したもの) を使用して、WUnderGround のサイトから気象データをスクレイピングしています。ご覧のとおり、python はクラス名「wx-data」の要素から数値データを取得しています。

ただし、DailyHistory.htmml から平均湿度も取得したいと思います。 問題は、平均湿度セルの場合のように、すべての「スパン」要素にクラス名があるわけではないことです。BeautifulSoup と以下のコードを使用して、この特定のセルを選択するにはどうすればよいですか?

(スクレイピングされたページの例を次に示します。開発モードに切り替えて「wx-data」を検索し、「span」要素が参照されていることを確認します。

http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html )

import urllib2
from BeautifulSoup import BeautifulSoup

year = 2004    


#create comma-delim file

f = open(str(year) + '_LAXwunder_data.txt','w')

#iterate through month and day
for m in range(1,13):
    for d in range (1,32):

        #Chk if already gone through month
        if (m == 2 and d > 28):
            break
        elif (m in [4,6,9,11]) and d > 30:
            break

        # open wug url
        timestamp = str(year)+'0'+str(m)+'0'+str(d)
        print 'Getting data for ' + timestamp
        url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
        page = urllib2.urlopen(url)

        #Get temp from page
        soup = BeautifulSoup(page)
        #dayTemp = soup.body.wx-data.b.string
        dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string

        #Format month for timestamp
        if len(str(m)) < 2:
            mStamp = '0' + str(m)
        else:
            mStamp = str(m)
        #Format day for timestamp
        if len(str(d)) < 2:
            dStamp = '0' + str(d)
        else:
            dStamp = str(d)

        #Build timestamp
        timestamp = str(year)+ mStamp + dStamp

        #Wrtie timestamp and temp to file
        f.write(timestamp + ',' + dayTemp +'\n')

#done - close
f.close()

score 2 · Accepted Answer

テキストを含むセルを検索してから、次のセルに移動できます。

humidity = soup.find(text='Average Humidity')
next_cell = humidity.find_parent('td').find_next_sibling('td')
humidity_value = next_cell.string

ここでは、3 ではなく、BeautifulSoup バージョン 4 を使用しています。バージョン 3 は 2 年前に停止されたため、本当にアップグレードしたいと考えています。

BeautifulSoup 3 もこの特定のトリックを行うことができます。ただし、代わりにfindParent()andを使用します。findNextSibling()

デモ：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> response = requests.get('http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html')
>>> soup = BeautifulSoup(response.content)
>>> humidity = soup.find(text='Average Humidity')
>>> next_cell = humidity.find_parent('td').find_next_sibling('td')
>>> next_cell.string
u'88'

score 0 · Accepted Answer

この最終的なスクリプトの作成を支援してくれた @Martijn_Pieters に感謝します。

import requests
import urllib2
from bs4 import BeautifulSoup

year = 2003

#create comma-delim file
f = open(str(year) + '_LAXwunder_data.txt','w')
#change the year here, ->run


#iterate through month and day
for m in range(1,13):
    for d in range(1,32): #could step 5 days using range(1,32,2)

        #Chk if already gone through month
        if (m == 2 and d > 28):
            break
        elif (m in [4,6,9,11]) and d > 30:
            break

        # open wug url
        timestamp = str(year)+'.'+str(m)+'.'+str(d)
        print 'Getting data for ' + timestamp
        url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
        page = urllib2.urlopen(url)
        #Get temp from page
        soup = BeautifulSoup(page)
        #dayTemp = soup.body.wx-data.b.string
        dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string
            humidity = soup.find(text='Average Humidity')
                next_cell = humidity.find_parent('td').find_next_sibling('td')
                avg_humidity = next_cell.string

        #Format month for timestamp
        if len(str(m)) < 2:
            mStamp = '0' + str(m)
        else:
            mStamp = str(m)
        #Format day for timestamp
        if len(str(d)) < 2:
            dStamp = '0' + str(d)
        else:
            dStamp = str(d)

        #Build timestamp
        timestamp = str(year)+ mStamp + dStamp

        #Wrtie timestamp and temp to file
        f.write(timestamp + ',' + dayTemp + ',' + avg_humidity + '\n')
        print dayTemp, avg_humidity

#done - close
f.close()

python - HTML要素にクラス名がない場合のbeautifulsoupの使用方法は?

2 に答える 2

Related

Reference