python - から文字列を取得
タグなし

Question

以下のような HTML コードがあります。

<div class="content">
    <div class="title">
        <a id="hlAdv" class="title" href="./sample.aspx">
            <font size=2>Pretty Beauty Fiesta -1st Avenue Mall!</font>
        </a>
    </div>
    19<sup>th</sup> ~ 21<sup>st</sup> Apr 2013
</div>

私は現在 Python を使用しており、BeatfulSoup を使用して日付を取得しようとしています。私が期待するものは次のとおりです。

19th ~ 21st Apr 2013

私は試した：

find("div", {"class":"content"}).text

出力：

Pretty Beauty Fiesta -1st Avenue Mall!19th ~ 21st Apr 2013

と、

find("div", {"class":"content"}).div.nextSibling

出力：

さらに nextSibling を使用してコンテンツを取得しようとしましたが、「st Apr 2013」を正しく取得できません。

必要なデータを取得するにはどうすればよいですか? ありがとうございました。

score 0 · Accepted Answer

これはどう？element.nextSiblingGenerator気になる div に続く要素をウォークスルーするために使用し、最後の None を無視します。

d = s.find('div', {'class':'content'}).div

def all_text_after(element):
    for item in element.nextSiblingGenerator():
        if not item:
            continue
        elif hasattr(item, 'contents'):
            for c in item.contents:
                yield c
        else:
            yield item

text_parts = list(all_text_after(d))
# -> [u'\n    19', u'th', u' ~ 21', u'st', u' Apr 2013\n']

print ''.join(text_parts)
# ->     19th ~ 21st Apr 2013

score 0 · Accepted Answer

あなたの問題は、特定のタグに続くdivすべてのテキストが.

.next_siblingsここでループで使用したい：

content_div = soup.find('div', class_='content')
text = []
for elem in content_div.div.next_siblings:
    try:
        text.extend(elem.strings)
    except AttributeError:
        text.append(elem)
text = ' '.join(text).strip()

.next_siblings要素.next_siblingを含む一連の属性を単純に生成するジェネレーターです。NavigableString

結果:

>>> ''.join(text).strip()
u'19th ~ 21st Apr 2013'

ここで空白をどのように処理するかは、少し注意が必要です。この特定の例では後でストリップするのが最適ですが、他の例では and を使用elem.stripped_stringsしelem.strip()てもうまくいく場合があります。

python - から文字列を取得タグなし

2 に答える 2

Related

Reference

python - から文字列を取得
タグなし