python - python scraping

Question

I am trying to get names of restaurants, their addresses and their phone numbers.

My code keeps on getting stuck in the 2nd definition. The first def works fine. I am not sure why as I cant identify any mistake. The loop just does not go through.

I would appreciate someone to comment if I am doing an obvious mistake.

Thanks

from urllib2 import urlopen
from csv import writer

def get_urls_of_restaurant():
    list_urls = []
    n = 0
    nn = 0
    for i in range(6):
        url = urlopen('http://www.go.co.tz/index.php/restaurants/masaki?start=' +     str(nn)).readlines() #open URL whis lists restaurants
        while n < len(url):
            if '<h2 class="contentheading">' in url[n]:
                list_urls.append(url[n+1].split('"')[1])
            n += 1
        n = 0
        nn += 3
    list_urls.reverse()
    print "Geting urls done! Get %s" %len(list_urls) + ' urls.'
    return list_urls

def open_url_and_write_data(list_urls):
    n = len(list_urls)-1
    csv_file = open('restdar_guide.csv', 'wb')
    file_writer = writer(csv_file, delimiter=';')
    file_writer.writerow(['Name'] + ['address'] + ['phone'])
    while n >= 0:
        print 'Reading %s' % str(int(len(list_urls))-n) + " element of %s" % len(list_urls) + " element's..."
        url = urlopen('http://www.go.co.tz' + list_urls[n]).readlines()
        num_str = 0
        list_write = []
        while num_str < len(url):
            if '<title>' in url[num_str]:
                list_write.append(url[num_str].split('<')[0][7:])
            if 'Location:</strong>' in url[num_str]:
                list_write.append(url[num_str].split('<')[1][9:])
            else:
                list_write.append('unknown')
            if '<li><strong>Tel:</strong>' in url[num_str]:
                list_write.append(url[num_str].split('<')[2][10:])
            else:
                list_write.append('unknown')
            file_writer.writerow([list_write[0]] + [list_write[1]] + [list_write[2]])
        n -= 1
    csv_file.close()
    print 'Done!'

list_urls = get_urls_of_restaurant()
open_url_and_write_data(list_urls)

score 4 · Accepted Answer

4

BeautifulSoupはあなたの生活を少し楽にしてくれるかもしれません。

于 2012-08-28T08:21:37.953 に答える

score 2 · Accepted Answer

プログラムを中止すると、KeyboardInterrupt エラーが発生するだけです。タイミングによっては、その 1 つの while ループ内の任意の行でエラーが発生する可能性があります。最終的にブレークダウンして中止するときに実行している命令です。

これにより、プログラムは非終了ループに入ります。

num_str = 0
...
while num_str < len(url):

num_str の値を変更することはないため、0 より大きいwhile True:値の場合、これは , と同等です。ところで、これは for ループに最適な場所です。len(url)

とはいえ、他の人が指摘しているように、これは HTML 解析/Web スクレイピングを行うには最適ではない方法です。多くのスクレイピングユーティリティと HTML パーサーが利用可能であり、そうすることをお勧めします。

score 1 · Accepted Answer

「」のインデントがn = len(list_urls)-1遠すぎるようです。次の行に揃えてみてください。

python - python scraping

3 に答える 3

Related

Reference