python - Python で BeautifulSoup を使用して HTML を解析する

Question

BeautifulSoup を使用して Python で HTML を解析しようとしていますが、必要なものを取得できません。

これは私がやりたい個人用アプリの小さなモジュールであり、資格情報を含む Web ログイン部分で構成されています。スクリプトが Web にログインしたら、それを管理して処理するためにいくつかの情報を解析する必要があります。

ログに記録された後の HTML コードは次のとおりです。

<div class="widget_title clearfix">

        <h2>Account Balance</h2>

    </div>

    <div class="widget_body">

        <div class="widget_content">

            <table class="simple">

                <tr>

                    <td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td>

                    <td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">

                        150                         

                    </td>

                </tr>

                <tr>

                    <td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td>

                    <td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">

                        500                     </td>

                </tr>

                <tr>

                    <td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td>

                    <td style="text-align: right; color: #119911; font-weight: bold;">

                        1500                        </td>

                </tr>

                <tr>

                    <td><a href="#" id="west4" title="Total expenses">Total expended</a></td>

                    <td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">

                        430                     </td>

                </tr>

                <tr>

                    <td><a href="#" id="west5" title="Total available">Account Balance</a></td>

                    <td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">

                        840                     </td>

                </tr>

                <tr>

                    <td></td>

                    <td style="padding: 5px;">

                        <center>

                            <form id="request_bill" method="POST" action="index.php?page=dashboard">

                                <input type="hidden" name="secret_token" value="" />

                                <input type="hidden" name="request_payout" value="1" />

                                <input type="submit" class="btn blue large" value="Request Payout" />

                            </form>

                        </center>

                    </td>

                </tr>

            </table>

        </div>

    </div>

</div>

ご覧のとおり、これは適切にフォーマットされた HTML ではありませんが、要素とその値を抽出する必要があります。たとえば、次のようになります。「週間収益」と「500」…

「id」属性が役立つと思いますが、解析しようとするとクラッシュします。

私が扱っているPythonコードは次のとおりです。

def parseo(archivohtml):
    html = archivohtml
    parsed_html = BeautifulSoup(html)
    par = parsed_html.find('td', attrs={'id':'west1'}).string
    print par

archivohtml は、Web にログインした後に保存された html ファイルです。

スクリプトを実行すると、エラーのみが発生します。

私もこれをやってみました：

def parseo(archivohtml):
    soup = BeautifulSoup()
    html = archivohtml
    parsed_html = soup(html)
    par = soup.parsed_html.find('td', attrs={'id':'west1'}).string
    print par

しかし、結果は同じです。

score 1 · Accepted Answer

付きのタグid="west1"は<a>タグです。<td>このタグの後に来るタグを探しています<a>:

import BeautifulSoup as bs

content = '''<div class="widget_title clearfix">
        <h2>Account Balance</h2>
    </div>
    <div class="widget_body">
        <div class="widget_content">
            <table class="simple">
                <tr>
                    <td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td>
                    <td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
                        150                         
                    </td>
                </tr>
                <tr>
                    <td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td>
                    <td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
                        500                     </td>
                </tr>
                <tr>
                    <td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td>
                    <td style="text-align: right; color: #119911; font-weight: bold;">
                        1500                        </td>
                </tr>
                <tr>
                    <td><a href="#" id="west4" title="Total expenses">Total expended</a></td>
                    <td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
                        430                     </td>
                </tr>
                <tr>
                    <td><a href="#" id="west5" title="Total available">Account Balance</a></td>
                    <td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
                        840                     </td>
                </tr>
                <tr>
                    <td></td>
                    <td style="padding: 5px;">
                        <center>
                            <form id="request_bill" method="POST" action="index.php?page=dashboard">
                                <input type="hidden" name="secret_token" value="" />
                                <input type="hidden" name="request_payout" value="1" />
                                <input type="submit" class="btn blue large" value="Request Payout" />
                            </form>
                        </center>
                    </td>
                </tr>
            </table>
        </div>
    </div>
</div>'''

def parseo(archivohtml):
    html = archivohtml
    parsed_html = bs.BeautifulSoup(html)
    par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')        
    print par.string.strip()

parseo(content)

収量

score 0 · Accepted Answer

これがあなたに当てはまるかどうかはあなたの質問からはわかりませんが、別の方法があります：

def parseo(archivohtml):
    html = archivohtml
    parsed_html = BeautifulSoup(html)
    for line in parsed_html.stripped_strings:        
        print line.strip()

これにより、次の結果が得られます。

Account Balance
Daily Earnings
150
Weekly Earnings
500
Monthly Earnings
1500
Total expended
430
Account Balance
840

リスト内のデータが必要な場合は、次のようにします。

data = [line.strip() for line in parsed_html.stripped_strings]

[u'Account Balance', u'Daily Earnings', u'150', u'Weekly Earnings', u'500', u'Monthly Earnings', u'1500', u'Total expended', u'430', u'Account Balance', u'840']

python - Python で BeautifulSoup を使用して HTML を解析する

2 に答える 2

Related

Reference