python - PythonでWebサイト上のサイト数を決定する

Question

次のリンクがあります。

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN

URL の参照部分には、次の情報が含まれています。

A7 == 議会 (現在は第 7 議会、前者は A6 など)

2010 == 年

0001 == ドキュメント番号

毎年、議会について、ウェブサイト上の文書の数を特定したいと思います。たとえば、2010 年の場合、番号 186、195、196 には空のページがあり、最大番号は 214 であるという事実によって、タスクは複雑になります。理想的には、出力は、欠落している番号を除くすべてのドキュメント番号を含むベクトルである必要があります。

これがPythonで可能かどうか誰か教えてもらえますか?

ベスト、トーマス

score 3 · Accepted Answer

まず、サイトのスクレイピングが合法であることを確認してください。

次に、ドキュメントが存在しない場合、HTML ファイルに次の内容が含まれていることに注意してください。

<title>Application Error</title>

3 番目に、urllib を使用して、必要なすべてのものを反復処理します。

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1

score 1 · Accepted Answer

これは、（urllib2を使用して）機能しているように見えるもう少し完全な（しかしハッキーな）例です-特定のニーズに合わせてカスタマイズできると確信しています。

また、サイトの所有者がコンテンツをスクレイピングしてもかまわないようにすることについてのArrietaの警告を繰り返します。

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

score 1 · Accepted Answer

これが解決策ですが、リクエストの間にタイムアウトを追加することをお勧めします。

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()

python - PythonでWebサイト上のサイト数を決定する

3 に答える 3

Related

Reference