python - Web サイトのコンテンツをスキャンする (高速)

Question

データベースに何千もの Web サイトがあり、すべての Web サイトで特定の文字列を検索したいと考えています。それを行う最も速い方法は何ですか？最初に各 Web サイトのコンテンツを取得する必要があると思います。これが私のやり方です。

import urllib2, re
string = "search string"
source = urllib2.urlopen("http://website1.com").read()
if re.search(word,source):
    print "My search string: "+string

そして文字列を検索します。しかし、これは非常に遅いです。Pythonでそれを加速するにはどうすればよいですか?

score 3 · Accepted Answer

あなたの問題はプログラムではないと思います。何千ものサイトに対して HTTP リクエストを実行しているという事実です。ある種の並列処理を含むさまざまなソリューションを調査することもできますが、解析コードをどれだけ効率的にしても、現在の実装ではリクエストでボトルネックにぶつかることになります。

Queueおよびthreadingモジュールを使用する基本的な例を次に示します。マルチプロセッシングとマルチスレッドの利点 (@JonathanV が言及した投稿など) を読むことをお勧めしますが、これが何が起こっているのかを理解するのに多少役立つことを願っています。

import Queue
import threading
import time
import urllib2

my_sites = [
    'http://news.ycombinator.com',
    'http://news.google.com',
    'http://news.yahoo.com',
    'http://www.cnn.com'
    ]

# Create a queue for our processing
queue = Queue.Queue()


class MyThread(threading.Thread):
  """Create a thread to make the url call."""

  def __init__(self, queue):
    super(MyThread, self).__init__()
    self.queue = queue

  def run(self):
    while True:
      # Grab a url from our queue and make the call.
      my_site = self.queue.get()
      url = urllib2.urlopen(my_site)

      # Grab a little data to make sure it is working
      print url.read(1024)

      # Send the signal to indicate the task has completed
      self.queue.task_done()


def main():

  # This will create a 'pool' of threads to use in our calls
  for _ in range(4):
    t = MyThread(queue)

    # A daemon thread runs but does not block our main function from exiting
    t.setDaemon(True)

    # Start the thread
    t.start()

  # Now go through our site list and add each url to the queue
  for site in my_sites:
    queue.put(site)

  # join() ensures that we wait until our queue is empty before exiting
  queue.join()

if __name__ == '__main__':
  start = time.time()
  main()
  print 'Total Time: {0}'.format(time.time() - start)

特に優れたリソースについてthreadingは、Doug Hellmann の投稿(こちら)、IBM の記事(上記で証明されているように、これが私の一般的なスレッド設定になっています)、実際のドキュメント (こちら) を参照してください。

score 2 · Accepted Answer

multiprocessing を使用して複数の検索を同時に実行することを検討してください。マルチスレッドも機能しますが、適切に管理しないと、共有メモリが呪いに変わる可能性があります。このディスカッションを見て、どの選択肢があなたに適しているかを確認してください.

python - Web サイトのコンテンツをスキャンする (高速)

2 に答える 2

Related

Reference