python - テキストを取得するためのPython BeautifulSoup HTML解析

Question

次のような形式の HTML ページがあります。

<section class="entry-content">
    <p>...</p>
    <p>...</p>
    <p>...</p>
</section>

<p>BeautifulSoup/Python を使用して、タグに含まれるテキストを取得しようとしています。<p>これは私がこれまでに持っているものですが、タグを「掘り下げて」テキストを取得する方法がわかりません。どんな提案でも大歓迎です。

import urllib2
from BeautifulSoup import BeautifulSoup

def main():
    url = 'URL'
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup(data)

    ingreds = bs.find('section', {'class': 'entry-content'})

    fname = 'most.txt'
    with open(fname, 'w') as outf:
    outf.write('\n'.join(ingreds))

if __name__=="__main__":
  main()

score 2 · Accepted Answer

'掘り下げて'、.stripped_stringsiterableを使用してタグからテキストを取得できます。

section = bs.find('section', {'class': 'entry-content'})
ingreds = [' '.join(ch.stripped_strings) for ch in section.find_all(True)]

直接テキストコンテンツ（改行など）ではなく、に.find_all(True)含まれるタグのみをループするために使用します。section

.find_all(True)ネストされたタグを通過することに注意してください。これにより、文字列が重複する可能性があります。以下は、次の直接タグのみをループしますsection。

ingreds = [' '.join(ch.stripped_strings) for ch in section if hasattr(ch, 'stripped_strings')]

python - テキストを取得するためのPython BeautifulSoup HTML解析

1 に答える 1

Related

Reference