python - Beatifulsoup/urlib を取得してエラーを適切に処理し、文字列を解析するのに助けが必要

Question

私はpythonを使用してWebクローラーに取り組んでおりbeautifulsoup、いくつかの問題に遭遇しました：

404 や 503 などのエラーを処理する方法がわかりません。現在、Web クローラーはプログラムの実行を中断するだけです。
文字列「Python」を含むページを印刷したい場合など、ページ内の特定の文字列を検索する方法がわかりません

これらのいずれかを達成する方法について誰かが意見を持っている場合、または正しい方向に私を押し進めることができる場合は、それをいただければ幸いです。

現在、私のコードは次のとおりです。

    import urllib.request, time, unicodedata
    from bs4 import BeautifulSoup
    num = 0
    def index():
        index = open('index.html', 'w')
        for x in range(len(titles)-1):
                index.write("<a href="+'"'+tocrawl[x]+'"'+" "+"target=" "blank"" >"+titles[x+1]+"</a></br>\n")
        index.close()
        return 'Index Created'


    def crawl(args):
        page = urllib.request.urlopen(args).read()
        soup = BeautifulSoup(page)
        soup.prettify().encode('UTF-8')
        titles.append(str(soup.title.string.encode('utf-8'),encoding='utf-8'))
        for anchor in soup.findAll('a', href=True):
            if str(anchor['href']).startswith(https) or str(anchor['href']).startswith(http):
                if anchor['href'] not in tocrawl:
                    if anchor['href'].endswith(searchfor):
                            print(anchor['href'])
                    if not anchor['href'].endswith('.png') and not anchor['href'].endswith('.jpg'):
                        tocrawl.append(anchor['href'])

    tocrawl, titles, descriptions, scripts, results = [], [], [], [], []
    https = 'https://'
    http = 'http://'
    next = 3
    crawl('http://google.com/')
    while 1:
        crawl(tocrawl[num])
        num = num + 1
        if num==next:
            index()
            next = next + 3

問題が発生した場合に備えて、Python 3.2を使用しています

python - Beatifulsoup/urlib を取得してエラーを適切に処理し、文字列を解析するのに助けが必要

1 に答える 1

Related

Reference