python - 終わりのないスレッドを使用して Newspaper3k (python3 lib) で URL のリストを処理する

Question

スクリプトが URL のリストを読み取り、そのリストをキューに渡し、python-newspaper3k で処理します。私はさまざまな URL をたくさん持っていますが、それらの多くはあまり人気のある Web サイトではありません。問題は、処理が終了しないことです。場合によっては終了することもありますが、何らかの問題を処理して停止するプロセスがいくつかあります。問題は、python-newspaper が各 HTML を解析しようとするときです。コードは

ここでは、キューに URL をロードし、新聞を使用して各 HTML をダウンロードして解析します。

def grab_data_from_queue():
    #while not q.empty(): # check that the queue isn't empty
    while True:
        if q.empty():
            break
        #print q.qsize()
        try:
            urlinit = q.get(timeout=10) # get the item from the queue
            if urlinit is None:
                print('urlinit is None')
                q.task_done()
            url = urlinit.split("\t")[0]
            url = url.strip('/')
            if ',' in url:
                print(', in url')
                q.task_done()
            datecsv = urlinit.split("\t\t\t\t\t")[1]
            url2 = url
            time_started = time.time()
            timelimit = 2
            #page = requests.get(url)
            #page.raise_for_status()

            #print "Trying: " + str(url)

            if len(url) > 30:

                if photo == 'wp':
                    article = Article(url, browser_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0')
                else:
                    article = Article(url, browser_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0', fetch_images=False)
                    imgUrl = ""

                #response = get(url, timeout=10)
                #article.set_html(response.content)

                article.download()
                article.parse()
                print(str(q.qsize()) + " parse passed")

次に、スレッドを作成します

for i in range(4): # aka number of threadtex
    try:
        t1 = Thread(target = grab_data_from_queue,) # target is the above function
        t1.setDaemon(True)
        t1.start() # start the thread
    except Exception as e:
        exc_type, exc_obj, exc_tb = sys.exc_info()
        print(str(exc_tb.tb_lineno) + ' => ' + str(e))


q.join()

どの URL に問題があり、終了するのに時間がかかるかを見つける方法はありますか? URL が見つからない場合、スレッドデーモンを停止することはできますか?

python - 終わりのないスレッドを使用して Newspaper3k (python3 lib) で URL のリストを処理する

0 に答える 0

Related

Reference