python - BeautifulSoup は、指定されたタグ内で 1 回だけグラブします

Question

親タグにマーカーが含まれている場合は、親タグを取得したいと思います。たとえば、MARKER とします。たとえば、私は持っています：

<a>
 <b>
  <c>
  MARKER
  </c>
 </b>
 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>
 <b>
  <c>
  stuff
  </c>
 </b>
</a>

私はつかみたい：

 <b>
  <c>
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

私の現在のコードは次のとおりです。

for stuff in soup.find_all(text=re.compile("MARKER")):
        post = stuff.find_parent("b")

これは機能しますが、次のようになります。

 <b>
  <c>
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

これが発生する理由は明らかです。見つかった MARKER ごとに、含まれているタグ全体を 1 回印刷しているため、2 つの MARKER を含むタグが 2 回印刷されます。ただし、完了後に指定されたタグ内を検索しないように BeautifulSoup に指示する方法がわかりません (具体的には、実行できないと思いますか?) または、おそらくすべてを辞書にインデックス付けして重複を拒否する以外に、これを防ぐ方法がわかりません?

編集:これは、何らかの理由で、削除されたバージョンであるにもかかわらず、上記が実際にエラーを生成しないため、私が取り組んでいる特定のケースです。(誰かが興味を持っている場合は、投稿ごとにフェッチしている特定のフォーラムスレッド。)

from bs4 import BeautifulSoup
import urllib.request
import re

url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
soup = urllib.request.urlopen(url).read()
sbsoup = BeautifulSoup(soup)

for stuff in sbsoup.find_all(text=re.compile("\[[Xx]\]")):
        post = stuff.find_parent("li")
        print(post.find("a", class_="username").string)
        print(post.find("blockquote", class_="messageText ugc baseHtml").get_text())

score 0 · Accepted Answer

これは bs3 で書きました。bs4 でも動作する可能性がありますが、概念は同じです。基本的に、li タグには「data-author」属性の下にユーザー名が含まれているため、下位のタグを見つけてから親の li を探す必要はありません。

「マーカー」を含む blockquote タグのみに関心があるようですが、それを指定してみませんか?

ラムダ関数は、一般に、ビューティフルスープをクエリする最も用途の広い方法です。

import os
import sys

# Import System libraries
import re
import urllib2

# Import Custom libraries
#from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup

# The url variable to be searched
url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
# Create a request object
request = urllib2.Request(url)

# Attempt to open the request and read the response
try:
    response = urllib2.urlopen(request)
    the_page = response.read()
except Exception:
    the_page = ""
    # If the response exists, create a BeautifulSoup from it
if(the_page):
    soup = BeautifulSoup(the_page)

    # Define the search location for the desired tags
    li_location = lambda x: x.name == u"li" and set([("class", "message   ")]) <= set(x.attrs)
    x_location = lambda x: x.name == u"blockquote" and bool(re.search("\[[Xx]\]", x.text))

    # Iterate through all the found lis
    for li in soup.findAll(li_location):
        # Print the author name
        print dict(li.attrs)["data-author"]
        # Iterate through all the found blockquotes containing the marker
        for xs in li.findAll(x_location):
            # Print the text of the found blockquote
            print xs.text
        print ""

python - BeautifulSoup は、指定されたタグ内で 1 回だけグラブします

1 に答える 1

Related

Reference