python - リストからURLを使用してテキストをスクレイピングする (BeautifulSoup4)

Question

1 つの URL からテキストスクレーパーが動作しています。問題は、さらに 25 個の URL をスクレイピングする必要があることです。これらの URL はほとんど同じで、違いは最後の文字だけです。より明確にするためのコードは次のとおりです。

import urllib2
from bs4 import BeautifulSoup

f = open('(path to file)/names', 'a')
links = ['http://guardsmanbob.com/media/playlist.php?char='+ chr(i) for i in range(97,123)]

response = urllib2.urlopen(links[0]).read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr'):
    if not tr.find('td'): continue
    for td in tr.find('td').findAll('a'):
        f.write(td.contents[0] + '\n')

このスクリプトを作成して、リストからすべての URL を一度に実行することはできません。私がなんとか得たのは、各URLの最初の曲名だけです。私の英語でごめんなさい。あなたが私を理解してくれることを願っています。

score 1 · Accepted Answer

I can't make this script to run all urls from list in one time.

1つのパラメーターを持つメソッドにコードを保存します*args（または任意の名前を付けてください。忘れないでください*）。は*自動的にリストを解凍します。正式な名前はありませんが、一部の人（私を含む）はそれをスプラット演算子*と呼ぶのが好きです。

def start_download(*args):
    for value in args:
        ##for debugging purposes
        ##print value

        response = urllib2.urlopen(value).read()
        ##put the rest of your code here

if __name__ == '__main__':
    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]

    start_download(links)

編集： または、リンクのリストを直接ループして、それぞれをダウンロードすることもできます。

    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]
    for link in links:
         response = urllib2.urlopen(link).read()
         ##put the rest of your code here

編集2：

すべてのリンクを取得してファイルに保存するための、特定のコメントを含むコード全体を次に示します。

import urllib2
from bs4 import BeautifulSoup, SoupStrainer

links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
          chr(i) for i in range(97,123)]

    for link in links:
         response = urllib2.urlopen(link).read()
         ## gets all <a> tags
         soup = BeautifulSoup(response, parse_only=SoupStrainer('a'))
         ## unnecessary link texts to be removed
         not_included = ['News', 'FAQ', 'Stream', 'Chat', 'Media',
                    'League of Legends', 'Forum', 'Latest', 'Wallpapers',
                    'Links', 'Playlist', 'Sessions', 'BobRadio', 'All',
                    'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
                    'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
                    'U', 'V', 'W', 'X', 'Y', 'Z', 'Misc', 'Play',
                    'Learn more about me', 'Chat info', 'Boblights',
                    'Music Playlist', 'Official Facebook',
                    'Latest Music Played', 'Muppets - Closing Theme',
                    'Billy Joel - The River Of Dreams',
                    'Manic Street Preachers - If You Tolerate This 
                     Your Children Will Be Next',
                    'The Bravery - An Honest Mistake', 
                    'The Black Keys - Strange Times',
                    'View whole playlist', 'View latest sessions', 
                    'Referral Link', 'Donate to BoB', 
                    'Guardsman Bob', 'Website template', 
                    'Arcsin']

         ## create a file named "test.txt"
         ## write to file and close afterwards
         with open("test.txt", 'w') as output:
             for hyperlink in soup:
                if hyperlink.text:
                    if hyperlink.text not in not_included:
                        ##print hyperlink.text
                        output.write("%s\n" % hyperlink.text.encode('utf-8'))

保存された出力は次のtest.txtとおりです。

ここに画像の説明を入力してください

test.txtリンクのリストが前のファイルを上書きするため、リンクのリストをループするたびに、別のファイル名（Sの曲のタイトルなど）に変更することをお勧めします。

python - リストからURLを使用してテキストをスクレイピングする (BeautifulSoup4)

1 に答える 1

Related

Reference