python - Python3.2を使用してHTMLファイル内の数値を検出する

Question

HTMLファイルがあり、Python 3.2サンプルを使用して解析したいと思います：-

<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>
<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>

タグ付けされていない番号（この場合は15のみ）を検出し、それらを別のテキストファイルに保存するのが仕事です。私はこれに慣れていないので、使用するhtmlパーサー（lxml、美しいスープ）を決定することはできません。この問題への取り組み方を教えてください。よろしくお願いします。

score 0 · Accepted Answer

BeautifulSoupを使用すると、これを非常に簡単に行うことができます。find_allメソッドを使用して、要素を検索し、それらを処理できます。

soup = BeautifulSoup(html_doc)
tds = soup.find_all("td", "ln")
for td in tds:
    pass # do something here

score 0 · Accepted Answer

あなたはこのようなことを試すことができます。

from BeautifulSoup import BeautifulSoup

def getPrintUnicode(soup):

    body=''
    if isinstance(soup, unicode):
        soup = soup.replace('&#39;',"'")
        soup = soup.replace('&quot;','"')
        soup = soup.replace('&nbsp;',' ')
        soup = soup.replace('&gt;','>')
        soup = soup.replace('&lt;','<')
        body = body + soup
    else:
        if not soup.contents:
            return ''
        con_list = soup.contents
        for con in con_list:
            body = body + getPrintUnicode(con)
    return body

print getPrintUnicode(BeautifulSoup('<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>'))

このgetPrintUnicode（）関数は、ページ全体のスープで使用できます。完全なコンテンツが返されます。例外を使用して、文字列を整数に変換します。例えば。

print int(getPrintUnicode(BeautifulSoup('<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>')))

python - Python3.2を使用してHTMLファイル内の数値を検出する

2 に答える 2

Related

Reference