python - Urllib2 と BeautifulSoup : 素敵なカップルだが遅すぎる - urllib3 とスレッド？

Question

コードを最適化する方法を探していたときに、スレッドと urllib3 について良いことを耳にしました。どうやら、人々はどの解決策が最善かについて意見が分かれているようです。

以下のスクリプトの問題は、実行時間です。とても遅いです!

ステップ 1 : このページを取得します http://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on

ステップ 2 : BeautifulSoup でページを解析します

ステップ 3:データを Excel ドキュメントに入れる

ステップ 4:リスト (大きなリスト) のすべての国に対して、何度も何度も繰り返します (URL の「アフガニスタン」を別の国に変更しているだけです)。

これが私のコードです：

ws = wb.add_sheet("BULATS_IA") #We add a new tab in the excel doc
    x = 0 # We need x and y for pulling the data into the excel doc
    y = 0
    Countries_List = ['Afghanistan','Albania','Andorra','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
    Longueur = len(Countries_List)



    for Countries in Countries_List:
        y = 0

        htmlSource = urllib.urlopen("http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % (Countries)).read() # I am opening the page with the name of the correspondant country in the url
        s = soup(htmlSource)
        tableGood = s.findAll('table')
        try:
            rows = tableGood[3].findAll('tr')
            for tr in rows:
                cols = tr.findAll('td')
                y = 0
                x = x + 1
                for td in cols:
                    hum =  td.text
                    ws.write(x,y,hum)
                    y = y + 1
                    wb.save("%s.xls" % name_excel)

        except (IndexError):
            pass

だから私はすべてが完璧ではないことを知っていますが、Python で新しいことを学ぶことを楽しみにしています! urllib2 はそれほど高速ではなく、BeautifulSoup. スープについては、私は本当にそれを改善することはできないと思いますが、urllib2 についてはそうしません。

編集 1: urllib2 では役に立たないマルチプロセッシング? 私の場合は面白いようです。この潜在的な解決策についてどう思いますか?!

# Make sure that the queue is thread-safe!!

def producer(self):
    # Only need one producer, although you could have multiple
    with fh = open('urllist.txt', 'r'):
        for line in fh:
            self.queue.enqueue(line.strip())

def consumer(self):
    # Fire up N of these babies for some speed
    while True:
        url = self.queue.dequeue()
        dh = urllib2.urlopen(url)
        with fh = open('/dev/null', 'w'): # gotta put it somewhere
            fh.write(dh.read())

EDIT 2: URLLIB3 誰か私にそれについてもっと教えてもらえますか?

複数のリクエスト (HTTPConnectionPool および HTTPSConnectionPool) に同じソケット接続を再利用します (オプションのクライアント側の証明書検証を使用)。 https://github.com/shazow/urllib3

異なるページに対して同じ Web サイトを 122 回要求している限り、同じソケット接続を再利用するのは面白いと思いますが、間違っていますか? もっと速くできないの？...

http = urllib3.PoolManager()
r = http.request('GET', 'http://www.bulats.org')
for Pages in Pages_List:
    r = http.request('GET', 'http://www.bulats.org/agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=%s' % (Pages))
    s = soup(r.data)

score 9 · Accepted Answer

workerpoolのようなものを使用することを検討してください。Mass Downloaderの例を参照すると、 urllib3と組み合わせると次のようになります。

import workerpool
import urllib3

URL_LIST = [] # Fill this from somewhere

NUM_SOCKETS = 3
NUM_WORKERS = 5

# We want a few more workers than sockets so that they have extra
# time to parse things and such.

http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
workers = workerpool.WorkerPool(size=NUM_WORKERS)

class MyJob(workerpool.Job):
    def __init__(self, url):
       self.url = url

    def run(self):
        r = http.request('GET', self.url)
        # ... do parsing stuff here


for url in URL_LIST:
    workers.put(MyJob(url))

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
# (If you don't do this, the script might hang due to a rogue undead thread.)
workers.shutdown()
workers.wait()

Mass Downloader の例から、これを行うには複数の方法があることに気付くかもしれません。私がこの特定の例を選んだのは、魔法が少ないという理由だけですが、他の戦略も有効です。

免責事項:私は、urllib3 と workerpool の両方の作成者です。

score 2 · Accepted Answer

urllib や BeautifulSoup が遅いとは思いません。ローカルマシンでコードを変更したバージョン (Excel を削除) で実行します。接続を開き、コンテンツをダウンロードし、解析し、国のコンソールに出力するのに約 100 ミリ秒かかりました。

10 ミリ秒は、BeautifulSoup がコンテンツを解析し、国ごとにコンソールに出力するのに費やした合計時間です。それは十分に速いです。

Scrappy や Threading を使用しても問題が解決するとは思いません。問題は、高速になるという期待だからです。

HTTP の世界へようこそ。遅くなる場合もあれば、非常に速くなる場合もあります。接続が遅い理由のいくつか

サーバーがリクエストを処理しているため（404を返す場合があります）
DNS解決、
HTTP ハンドシェイク、
ISP の接続の安定性、
あなたの帯域幅率、
パケット損失率

等..

その結果、サーバーに対して 121 の HTTP リクエストを作成しようとしており、サーバーの種類がわからないことを忘れないでください。彼らはまた、結果として生じる呼び出しのためにあなたの IP アドレスを禁止するかもしれません.

Requests ライブラリを見てください。ドキュメントを読んでください。Python をさらに学習するためにこれを行っている場合は、Scrapy に直接ジャンプしないでください。

score 0 · Accepted Answer

やあみんな、

問題からのいくつかのニュース ! 役に立つかもしれないこのスクリプトを見つけました! 私は実際にそれをテストしており、有望です（以下のスクリプトを実行するには6.03）。

私の考えは、それを urllib3 と組み合わせる方法を見つけることです。実際、同じホストで何度もリクエストを行っています。

同じホストを要求するたびに、PoolManager が接続の再利用を処理します。これにより、効率を大幅に損なうことなくほとんどのシナリオをカバーできますが、より詳細な制御のために、いつでも下位レベルのコンポーネントにドロップダウンできます。(urrlib3 ドキュメントサイト)

とにかく、それは非常に興味深いようで、これら 2 つの機能 (urllib3 と以下のスレッド化スクリプト) を混在させる方法がまだわからない場合は、実行可能だと思います! :-)

手を貸してくれてありがとう、いい匂い！

import Queue
import threading
import urllib2
import time
from bs4 import BeautifulSoup as BeautifulSoup



hosts = ["http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=1", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=2", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=3", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=4", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=5", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=6"]

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()

            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib2.urlopen(host)
            chunk = url.read()

            #place chunk into out queue
            self.out_queue.put(chunk)

            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()

            #parse the chunk
            soup = BeautifulSoup(chunk)
            #print soup.findAll(['table'])

            tableau = soup.find('table')
        rows = tableau.findAll('tr')
        for tr in rows:
            cols = tr.findAll('td')
            for td in cols:
                    texte_bu = td.text
                    texte_bu = texte_bu.encode('utf-8')
                    print texte_bu

            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

python - Urllib2 と BeautifulSoup : 素敵なカップルだが遅すぎる - urllib3 とスレッド？

3 に答える 3

Related

Reference