python - robobrowser を使用して HTML からタグを削除する方法

Question

http://robobrowser.readthedocs.org/en/latest/readme.htmlは、美しいスープライブラリに基づく新しい Python ライブラリです。いくつかの助けを借りて、django アプリ内で html ページを返しましたが、タグを取り除いて text だけにする方法がわかりません。私のdjangoアプリには以下が含まれています：

def index(request):    

    from django.utils.html import strip_tags
    p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/'
    browser = RoboBrowser(history=True)
    browser.open(p)
    html = browser.response
    stripped = strip_tags(html)
    return HttpResponse(stripped )

出力された html を見ると、元の html と同じであることがわかります。また、robobrowser には美しいスープの text() メソッドがないと思います。

私も試しました（PythonコードからHTMLタグを文字列から削除します）：

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""    

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c    

    return out

同じ結果！HTMLタグを削除してテキストを返すにはどうすればよいですか?

score 2 · Accepted Answer

BeautifulSoup は、解析された HTML ドキュメントからテキストを抽出するためのメソッドを提供します (やや紛らわしいことに、これはメソッドとプロパティsoup::get_text()に相当します)。を使用して、現在のページの解析済み HTML にアクセスできます。したがって、現在のページのプレーンテキストを取得するには、試してくださいgetTexttextbrowser.parsed

text = browser.parsed.get_text()

score 1 · Accepted Answer

を使用することを好みbleachます。

コード例を次に示します。

import Bleach
varName = ( bleach.clean( result.find_all( class_ = 'className' ),
                          strip  = True
                          )
            ).strip( '[])' )

python - robobrowser を使用して HTML からタグを削除する方法

2 に答える 2

Related

Reference