python - Pythonのbeautifulsoupで次のページのリンクを取得するには?

Question

私はこのリンクを持っています：

http://www.brothersoft.com/windows/categories.html

div内のアイテムへのリンクを取得しようとしています。例：

http://www.brothersoft.com/windows/mp3_audio/midi_tools/

私はこのコードを試しました：

import urllib
from bs4 import BeautifulSoup

url = 'http://www.brothersoft.com/windows/categories.html'

pageHtml = urllib.urlopen(url).read()

soup = BeautifulSoup(pageHtml)

sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brLeft'})]

for i in sAll:
    print "http://www.brothersoft.com"+i['href']

しかし、私は出力しか得られません:

http://www.brothersoft.com/windows/mp3_audio/

必要な出力を取得するにはどうすればよいですか?

score 2 · Accepted Answer

Urlhttp://www.brothersoft.com/windows/mp3_audio/midi_tools/が tag<div class='brLeft'>に含まれていないため、 output がhttp://www.brothersoft.com/windows/mp3_audio/であれば正解です。

必要なURLを取得したい場合は、変更してください

sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brLeft'})]

に

sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brRight'})]

アップデート：

「midi_tools」内の情報を取得する例

import urllib 
from bs4 import BeautifulSoup

url = 'http://www.brothersoft.com/windows/categories.html'
pageHtml = urllib.urlopen(url).read()
soup = BeautifulSoup(pageHtml)
sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brRight'})]
for i in sAll:
    suburl = "http://www.brothersoft.com"+i['href']    #which is a url like 'midi_tools'

    content = urllib.urlopen(suburl).read()
    anosoup = BeautifulSoup(content)
    ablock = anosoup.find('table',{'id':'courseTab'})
    for atr in ablock.findAll('tr',{'class':'border_bot '}):
        print atr.find('dt').a.string      #name
        print "http://www.brothersoft.com" + atr.find('a',{'class':'tabDownload'})['href']   #link

python - Pythonのbeautifulsoupで次のページのリンクを取得するには?

1 に答える 1

Related

Reference