python - HTTPまたはURLエラーでURLをスキップするPython urllib

Question

接続がタイムアウトした場合、または無効/404 の場合に URL をスキップするようにスクリプトを変更するにはどうすればよいですか?

パイソン

#!/usr/bin/python

#parser.py: Downloads Bibles and parses all data within <article> tags.

__author__      = "Cody Bouche"
__copyright__   = "Copyright 2012 Digital Bible Society"

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
    url = link.get('href')
    name = urlparse.urlparse(url).path.split('/')[-1]
    dirname = urlparse.urlparse(url).path.split('.')[-1]
    f = urllib2.urlopen(url)
    s = f.read()
    if (os.path.isdir(dirname) == 0):
        os.mkdir(dirname)
    soup = BeautifulSoup(s)
    articleTag = soup.html.body.article
    converted = str(articleTag)
    full_path = os.path.join(dirname, name)
    open(full_path, 'wb').write(converted)
    print(name)
print("DOWNLOADS COMPLETE!")

score 2 · Accepted Answer

リクエストにタイムアウトを適用するには、timeoutへの呼び出しに変数を追加しますurlopen。ドキュメントから：

オプションのtimeoutパラメーターは、接続試行などのブロック操作のタイムアウトを秒単位で指定します（指定されていない場合は、グローバルなデフォルトのタイムアウト設定が使用されます）。これは実際にはHTTP、HTTPS、FTP接続でのみ機能します。

urllib2で例外を処理する方法については、このガイドのセクションを参照してください。実際、ガイド全体が非常に便利だと思いました。

request timeout例外コードはです408。まとめると、タイムアウト例外を処理する場合は、次のようになります。

try:
    response = urlopen(req, 3) # 3 seconds
except URLError, e:
    if hasattr(e, 'code'):
        if e.code==408:
            print 'Timeout ', e.code
        if e.code==404:
            print 'File Not Found ', e.code
        # etc etc

score 1 · Accepted Answer

urlopen 行を try catch ステートメントの下に置いてみてください。これを見てください：

docs.python.org/tutorial/errors.html セクション 8.3

さまざまな例外を見て、例外が発生した場合は、ステートメント continue を使用してループを再開してください。

python - HTTPまたはURLエラーでURLをスキップするPython urllib

2 に答える 2

Related

Reference