python - BeautifulSoup で HTML を解析する

Question

ここに画像の説明を入力

(写真は小さいです。別のリンクがあります: http://i.imgur.com/OJC0A.png )

下部にあるレビューのテキストを抽出しようとしています。私はこれを試しました：

y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text

問題は、展開されていないタグに不要なテキストがdivあり、レビューのコンテンツから削除するのが面倒なことです。私の人生では、これを理解することはできません。誰か助けてくれませんか？

編集：HTMLは次のとおりです。

div style="margin-left:0.5em;">
    <div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
    <div style="margin-bottom:0.5em;">
    <div style="margin-bottom:0.5em;">
    <div class="tiny" style="margin-bottom:0.5em;">
        <b>
    </div>
    That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

テキストの上の div タグは次のとおりです。

<div class="tiny" style="margin-bottom:0.5em;">
    <b>
        <span class="h3color tiny">This review is from: </span>
        <a href="https://rads.stackoverflow.com/amzn/click/com/B005C7QVUE" rel="nofollow noreferrer">A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
    </b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

score 2 · Accepted Answer

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-stringsは、.strings メソッドが必要なものであることを示唆しています。オブジェクト内の各文字列の反復子を返します。したがって、その反復子をリストに変換して最後の項目を取得すると、必要なものが得られるはずです。例えば：

$ python
>>> import bs4
>>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
>>> soup = bs4.BeautifulSoup(text)
>>> soup.find_all("div", style="mine")[0].text
u'unwantedwanted'
>>> list(soup.find_all("div", style="mine")[0].strings)[-1]
u'wanted'

score 2 · Accepted Answer

の末尾のテキストを取得するにはdiv.tiny:

review = soup.find("div", "tiny").findNextSibling(text=True)

完全な例:

#!/usr/bin/env python
from bs4 import BeautifulSoup

html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
   9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
    <span class="h3color tiny">This review is from: </span>
    <a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
     A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""

soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)

出力

それは本当です。今朝自分でやってみました。Audible サイトには、「いくつかのタイトルでは 2 クレジットが必要です」などと書かれた小さなメモがあります。ドラゴンとのダンスは、それらの数少ないものの1つです。

同じ出力を生成する同等のlxmlコードを次に示します。

import lxml.html

doc = lxml.html.fromstring(html)
print doc.find(".//div[@class='tiny']").tail

python - BeautifulSoup で HTML を解析する

2 に答える 2

出力

Related

Reference