regex - Google Finance html を解析できません

Question

Python3 を使用して Google Finance からいくつかの株価と変動をスクレイピングしようとしていますが、ページまたは正規表現に問題があるかどうかはわかりません。ページ全体の svg グラフィックまたは多くのスクリプトタグにより、正規表現パーサーがコードを適切に分析できなくなっていると考えています。

多くのオンライン正規表現ビルダー/テスターでこの正規表現をテストしましたが、問題ないようです。とにかく、HTML 用に設計された正規表現は問題ありません。

これをテストしているGoogle Financeページはhttps://www.google.com/finance?q=NYSE%3AAAPL で、私のpythonコードは次のとおりです

import urllib.request
import re
page = urllib.request.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
text = page.read().decode('utf-8')
m = re.search("id=\"price-panel.*>(\d*\d*\d\.\d\d)</span>.*\((-*\d\.\d\d%)\)", text, re.S)
print(m.groups())

株価とその変動率を抽出します。私もpython2 + BeautifulSoupを使ってみました。

soup.find(id='price-panel')

しかし、この単純なクエリでも空を返します。これが特に、html に何か奇妙な点があると私が考えている理由です。

そして、これが私が目指しているHTMLの最も重要な部分です

<div id="price-panel" class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span class="unchanged" id="ref_22144_l"><span class="unchanged">96.41</span><span></span></span>
</span>
<div class="id-price-change nwp goog-inline-block">
<span class="ch bld"><span class="down" id="ref_22144_c">-1.13</span>
<span class="down" id="ref_22144_cp">(-1.16%)</span>
</span>
</div>
</div>
<div>
<span class="nwp">
Real-time:
&nbsp;
<span class="unchanged" id="ref_22144_ltt">3:42PM EDT</span>
</span>
<div class="mdata-dis">
<span class="dis-large"><nobr>NASDAQ
real-time data -
<a href="//www.google.com/help/stock_disclaimer.html#realtime" class="dis-large">Disclaimer</a>
</nobr></span>
<div>Currency in USD</div>
</div>
</div>
</div>

このページで同様の問題に遭遇した人や、私のコードに何か問題があるかどうかを判断できる人がいるかどうか疑問に思っています。前もって感謝します！

regex - Google Finance html を解析できません

2 に答える 2

Related

Reference