python - Google Scholar をスクレイピングする際の 503 エラーを防ぐ

Question

Google Scholar セキュリティページからデータをスクレイピングするために、次のコードを作成しました。. ただし、実行するたびに次のエラーが表示されます。

 Traceback (most recent call last):
  File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 53, in <module>
    getProfileFromTag(each)
  File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 32, in getProfileFromTag
    page = urllib.request.urlopen(url)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 504, in error
    result = self._call_chain(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 696, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

これは、GS が私のリクエストをブロックしているためだと思います。どうすればこれを防ぐことができますか?

コードは次のとおりです。

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib.request
import string
import csv
import time

#Declares array's to store data
name = []
urlList =[]

#Opens and writer header of CSV file
outputFile = open('sample.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['Name', 'URL', 'Total Citations', 'h-index', 'i10-index'])

def getStat (url):
    #Given an authors URL it retunrs an array of stats.
    url = 'https://scholar.google.pl' + url
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    buttons = soup.findAll("td", { "class" : "gsc_rsb_std" })
    list=[]
    return (list)

def getProfileFromTag(tag):
    url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:" + tag
    while True:
        page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, 'lxml')

        mydivs = BeautifulSoup(urllib.request.urlopen(url), 'lxml').findAll("h3", { "class" : "gsc_1usr_name"})
        for each in mydivs:
            for anchor in each.find_all('a'):
                name.append(anchor.text)
                urlList.append(anchor['href'])
                time.sleep(0.001)
        buttons = soup.findAll("button", {"aria-label": "Następna"})
        if not buttons:
            break
        on_click = buttons[0].get('onclick')
        url = 'http://scholar.google.pl' + on_click[17:-1]
        url = url.encode('utf-8').decode('unicode_escape')
    for each in name:
        list = getStat(urlList[i])
        outputWriter.writerow([each, urlList[i], list[0], list[2], list[4]])

tags = ['security']
for each in tags:
    getProfileFromTag(each)

python - Google Scholar をスクレイピングする際の 503 エラーを防ぐ

2 に答える 2

Related

Reference