python - ネストされたタグを削除する BeautifulSoup

Question

BeautifulSoup を使用して一般的なスクレーパーを作成しようとしています。そのため、テキストが直接利用できるタグを検出しようとしています。

次の例を検討してください。

<body>
<div class="c1">
    <div class="c2">
        <div class="c3">
            <div class="c4">
                <div class="c5">
                    <h1> A heading for section </h1>
                </div>
                <div class="c5">
                    <p> Some para </p>
                </div>
                <div class="c5">
                    <h2> Sub heading </h2>
                    <p> <span> Blah Blah </span> </p>
                </div>
            </div>
        </div>
    </div>
</div>
</body>

ここでの目的は、すべてのテキストコンテンツがあるので (クラス c4 の div) を抽出することです。その前の div の残りの部分 c1 - c3 は、私にとって単なるラッパーです。

ノードを特定する方法の 1 つとして、私が思いついたのは次のとおりです。

if node.find(re.compile("^h[1-6]"), recursive=False) is not None:
    return node.parent.parent

しかし、この場合にはあまりにも具体的です。

1 レベルの再帰でテキストを見つけるための最適化された方法はありますか。つまり、次のようなことをすると

node.find(text=True, recursion_level=1)

その場合、直接の子のみを考慮してテキストを返す必要があります。

これまでの私の解決策は、すべてのケースに当てはまるかどうかわかりません。

def check_for_text(node):
    return node.find(text=True, recursive=False)

def check_1_level_depth(node):
    if check_for_text(node):
        return check_for_text(node)

    return map(check_for_text, node.children)

上記のコードの場合: node は、現在チェック中のスープの要素です。つまり、div、span などです。check_for_text() ですべての例外を処理していると仮定してください (AttributeError: 'NavigableString')

score 0 · Accepted Answer

必要なものは次のようなものだと思います：

bs = BeautifulSoup(html)
all = bs.findAll()

previous_elements = []
found_element = None

for i in all:
    if not i.string:
        previous_elements.append(i)
    else:
        found_element = i
        break

print("previous:")
for i in previous_elements:
    print(i.attrs)

print("found:")
print(found_element)

出力：

previous:
{}
{'class': ['c1']}
{'class': ['c2']}
{'class': ['c3']}
{'class': ['c4']}
found:
<h1> A heading for section </h1>

python - ネストされたタグを削除する BeautifulSoup

2 に答える 2

Related

Reference