python - Python BeautifulSoup要素間のテキストを抽出

Question

次の HTML から「THIS IS MY TEXT」を抽出しようとしています。

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

私はこのようにしてみました：

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

しかし、ネストされたすべてのタグとコメントの間のすべてのテキストを取得します。

これから「THIS IS MY TEXT」を取得するのを手伝ってくれる人はいますか?

score 50 · Accepted Answer

で解析ツリーBeautifulSoupをナビゲートする方法の詳細を確認してください。解析木にはtagsand NavigableStrings(THIS IS A TEXT) があります。例

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

解析ツリーを下に移動するにはcontents、とstring.

contents は、ページ要素内に含まれる Tag および NavigableString オブジェクトの順序付きリストです
タグに子ノードが 1 つしかなく、その子ノードが文字列の場合、子ノードは tag.string および tag.contents[0] として使用可能になります。

上記の場合、つまり、得ることができます

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

いくつかの子ノードの場合、たとえば

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

そのため、ここで遊んでcontents、必要なインデックスでコンテンツを取得できます。

タグを反復処理することもできます。これはショートカットです。例えば、

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

score 11 · Accepted Answer

独自のスープオブジェクトを使用:

soup.p.next_sibling.strip()

<p> をsoup.p* で直接取得します (これは、解析ツリーの最初の <p> であることに依存します)
次に、目的のテキストが解析ツリーの <p> と同じレベルにネストされているため、返さnext_siblingれるタグオブジェクトで使用します。soup.p
.strip()先頭と末尾の空白を削除する Python の str メソッドです。

*それ以外の場合は、選択したフィルターを使用して要素を見つけます

インタープリターでは、これは次のようになります。

In [4]: soup.p
Out[4]: <p>something</p>

In [5]: type(soup.p)
Out[5]: bs4.element.Tag

In [6]: soup.p.next_sibling
Out[6]: u'\n      THIS IS MY TEXT\n      '

In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString

In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'

In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

python - Python BeautifulSoup要素間のテキストを抽出

7 に答える 7

Related

Reference