python - PythonでのWebスクレイピングurlopen

Question

この Web サイトからデータを取得しようとしています: http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

urlopen が html コードを取得できないようで、その理由がわかりません。次のようになります。

html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)

私のコードは正しいです。同じコードで他の Web ページの HTML ソースを取得しますが、このアドレスを認識していないようです。

それは印刷します: b''

多分別のライブラリがより適切ですか？urlopen が Web ページの HTML コードを返さないのはなぜですか? 助けてくれてありがとう！

score 4 · Accepted Answer

個人的には、次のように書いています。

# Python 2.7

import urllib

url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
sock = urllib.urlopen(url)
content = sock.read() 
sock.close()

print content

Et si tu parles français,.. bonjour sur stackoverflow.com !

更新 1

実際、私は現在、次のコードを採用することを好みます。

# Python 2.7

import httplib

conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)

req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'

try:
    conn.request('GET',req)
except:
     print 'echec de connexion'

content = conn.getresponse().read()

print content

このコードをに変更httplibするhttp.clientだけで、Python 3 に適応させることができます。

.

これらの 2 つのコードを使用して、あなたが興味を持っているデータを表示するソースコードを取得したことを確認します。

        <td class="L20" width="33%" align="center">11:57:44</td>

        <td class="L20" width="33%" align="center">1.4486</td>

        <td class="L20" width="33%" align="center">0</td>

</tr>

                                        <tr>

        <td  width="33%" align="center">11:57:43</td>

        <td  width="33%" align="center">1.4486</td>

        <td  width="33%" align="center">0</td>

</tr>

更新 2

上記のコードに次のスニペットを追加すると、必要なデータを抽出できるようになります。

for i,line in enumerate(content.splitlines(True)):
    print str(i)+' '+repr(line)

print '\n\n'


import re

regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')

print regx.findall(content)

結果（最後のみ）

.......................................
.......................................
.......................................
.......................................
98 'window.config.graphics = {};\n'
99 'window.config.accordions = {};\n'
100 '\n'
101 "window.addEvent('domready', function(){\n"
102 '});\n'
103 '</script>\n'
104 '<script type="text/javascript">\n'
105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
107 '\t\t\t\tvar sas_formatids = "8968";\n'
108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
111 "\twindow.addEvent('domready', function(){\r\n"
112 'sas_move(1,8968);\t});\r\n'
113 '</script>\n'
114 '<script type="text/javascript">\n'
115 'var _gaq = _gaq || [];\n'
116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
120 "_gaq.push(['_trackPageLoadTime']);\n"
121 "_gaq.push(['_trackPageview']);\n"
122 '(function() {\n'
123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
126 '})();\n'
127 '</script>\n'
128 '</body>\n'
129 '</html>'



[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

外国為替で取引を「プレイ」する予定がないことを願っています。これは、急速にお金を失うための最良の方法の 1 つです。

更新 3

ごめん！Python 3 を使用していることを忘れていました。したがって、次のように正規表現を定義する必要があると思います。

regx = re.compile( b '\t\t\t\t\t......)

つまり、文字列の前にbを付けます。そうしないと、この質問のようなエラーが発生します

score 4 · Accepted Answer

私が考えているのは、サーバーが圧縮されたデータを送信していることを通知せずに送信していることです。Python の標準 HTTP ライブラリは、圧縮形式を処理できません。
圧縮形式を処理できる httplib2 を入手することをお勧めします (そして、一般的には urllib よりもはるかに優れています)。

import httplib2
folder = httplib2.Http('.cache')
response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")

print(response)サーバーからの応答を示します:
{'status': '200', 'content-length': '7787', 'x-sid': '26,E', 'content-language': 'fr', 'セット cookie': 'PHPSESSIONID = ed45f761542752317963ab4762ec604f; パス=/; domain=.www.boursorama.com', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding,User-Agent', 'server': 'nginx', 'connection': 'keep-alive', '-content-encoding': 'gzip' , 'pragma': 'no-cache', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'date': 'Tue, 23 Aug 2011 10:26:46 GMT', 'content-type': 'text/html; charset=ISO-8859-1', '

これは、それが圧縮されたことを確認するものではありませんが (結局のところ、圧縮を処理できることをサーバーに伝えているのです)、理論にいくらかの重みを与えています.

ご想像のとおり、実際のコンテンツはcontent. 簡単に見ると、機能していることがわかります (少しだけ貼り付けます)。
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n\t"http://

編集：はい、これにより.cacheという名前のフォルダーが作成されます。httplib2 に関しては、常にフォルダーを操作する方が良いことがわかりました。後でいつでもフォルダーを削除できます。

score 2 · Accepted Answer

URL をhttplib2でテストし、端末で curl を使用してテストしました。どちらも正常に動作します:

URL = "http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS"
h = httplib2.Http()
resp, content = h.request(URL, "GET")
print(content)

私にとっては、urllib.request にバグがあるか、本当に奇妙なクライアントとサーバーの相互作用が起こっているかのどちらかです。

python - PythonでのWebスクレイピングurlopen

3 に答える 3

更新 1

更新 2

更新 3

Related

Reference