python - 不適切なスクリプトで BeautifulSoup を使用して値を分離する

Question

次のように、さまざまなページの HTML ソースを解析しようとしています。

http://www.ielts.org//test_centre_search/results.aspx?TestCentreID=dd50346f-60bc-4a4f-a37f-7e3d34df0bf8 または www.ielts.org//test_centre_search/results.aspx?TestCentreID=feb563e3-43db-4d40- a6b1-223e2fb7191b (このようなページが 800 ページあります)

それらはすべて同じ形式です。「テスト料金」の値を解析しようとしています。

<TABLE style="BORDER-RIGHT: buttonshadow 1px solid; BORDER-TOP: buttonhighlight 1px solid; FONT: messagebox; BORDER-LEFT: buttonhighlight 1px solid; COLOR: buttontext; BORDER-BOTTOM: buttonshadow 1px solid; BACKGROUND-COLOR: buttonface" cellSpacing=0 cellPadding=4 width=500>
<TBODY></TBODY></TABLE><table id="Template_ctl21_TestCentreView1_TestCentreTable" Width="400" border="0">
    <tr>
        <td><img src="https://www.ielts.org/TestCentreLogos/TestCentre/dd50346f-60bc-4a4f-a37f-7e3d34df0bf8.jpg" align="right" style="border-width:0px;" /><span class="TestCentreViewTitle">University of Canberra Test Centre</span><BR><BR><span class="TestCentreViewLabel">Address:</span><BR><span class="TestCentreViewBody">IELTS Administrator</span><BR><span class="TestCentreViewBody">Building 16</span><BR><span class="TestCentreViewBody">Wilpena Street, Bruce</span><BR><span class="TestCentreViewBody">ACT - Canberra</span><BR><span class="TestCentreViewBody">2617</span><BR><BR><span class="TestCentreViewLabel">Tel: </span><span class="TestCentreViewBody">61 2 6201 2669</span><BR><span class="TestCentreViewLabel">Fax: </span><span class="TestCentreViewBody">61 2 6201 5089</span><BR><span class="TestCentreViewLabel">Email: </span><a class="TestCentreViewLink" href="mailto:ielts@canberra.edu.au">ielts@canberra.edu.au</a><BR><span class="TestCentreViewLabel">Web: </span><a class="TestCentreViewLink" href="http://www.canberra.edu.au/uceli/ielts">http://www.canberra.edu.au/uceli/ielts</a><BR><BR>**<span class="TestCentreViewLabel">Test Fee: </span><span class="TestCentreViewBody">AUD$330</span>**<BR><BR><div style="overflow-y:scroll;overflow-x:visible;height:250px;;"><table cellspacing="0" cellpadding="2" border="0" style="border-collapse:collapse;">
            <tr>

        </table></div><BR><span class="TestCentreViewBody"><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-US style="mso-ansi-language: EN-US"><FONT size=3><FONT color=#000000><FONT face=Calibri>The IELTS office now closes at 4:00pm on Friday afternoons.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></FONT></FONT></FONT></SPAN></P>
<P>&nbsp;</P></span><BR></td>
    </tr>
</table>

上記は、私たちにとって興味深いソースの一部です。私が解析したいのはこれです： **Test Fee: AUD$330**

問題は、同じクラス (TestCentreViewBody) で多くの違いがあり、1 つのページに 5 ページ、他のページに 8 ページなどがあることです...だから、これを分離する方法がわかりません。

この値を分離する方法を探しています。

PS:最後の 1 つ前には、探している値が常に含まれているように見えることに気付きました。だから私がやろうとしたことは次のとおりです：

LOL = findAll('span' .. with the 'class' : 'TestCentreViewBody')
Value = LOL[len(lol)-1]

しかし、それはうまくいかないようです。

score 1 · Accepted Answer

クラスでfind_all() をTestCentreViewLabel実行し、ループでそれぞれを実行します。各反復でテキストを取得し、「料金」という単語が含まれているかどうかを確認します。存在する場合は、現在のタグの次の兄弟を取得します。その内容は、探している値である必要があります。

score 0 · Accepted Answer

html を string に入れれば、これは少なくとも提供された例では機能しますt。

import re
p = = "TestCentreViewBody\">(\w*)\$(\d*)</span>"
re.findall(p, t)

手数料値のどこかにがあることを要求し$、通貨と値のタプルを返します (金額に小数点以下の桁数がある場合は、2 番目の括弧内のビットを([0-9.]*).

うまくいくことを願っています。

編集：

通貨記号がわからない場合 (ただし、文字や数字ではない記号が常に存在する場合)、および "Test Fee: " が実行できる直前に常に表示される場合:

p = "<span class=\"TestCentreViewLabel\">Test Fee: </span><span class=\"TestCentreViewBody\">(\w*)[^\w\d](\d*)</span>"

しかし、提案された BeautifulSoup ソリューションは、多かれ少なかれ同じものです。

python - 不適切なスクリプトで BeautifulSoup を使用して値を分離する

2 に答える 2

Related

Reference