1

次のhtmlコードがあります

<ol>
<li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
<dl>
<dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
</dl>
</li>
</ol>

<li><dl>タグの間のテキストを抽出するにはどうすればよいですか。

私はこれを試しました:

from bs4 import BeautifulSoup

s = """<ol>
    <li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
    <dl>
    <dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
    </dl>
    </li>
    </ol>
"""

soup = BeautifulSoup(s)

for line in soup.find_all('ol'):
    print line.li.get_text()

これは印刷されます

If someone is able to do something, they can do it.

I'm busy today, so I won't be able to see you.

私は最初の行だけが欲しいです。

If someone is able to do something, they can do it.
4

1 に答える 1

4

オブジェクトの子孫をループし、すべてのテキスト オブジェクトline.liを収集し、タグに遭遇したら停止します。NavigableString<dl>

from bs4 import NavigableString

for line in soup.find_all('ol'):
    result = []
    for descendant in line.li.descendants:
        if isinstance(descendant, NavigableString):
            result.append(unicode(descendant).strip())
        elif descendant.name == 'dl':
            break

    print u' '.join(result)

デモ:

>>> for line in soup.find_all('ol'):
...     result = []
...     for descendant in line.li.descendants:
...         if isinstance(descendant, NavigableString):
...             result.append(unicode(descendant).strip())
...         elif descendant.name == 'dl':
...             break
...     print u' '.join(result)
... 
If someone is able to do something, they can do it.

(最初のタグだけでなく)すべての タグに対してこれを行いたい場合は、代わりに次のように見つかっ<li>たタグをループする必要があります。<li>.find_all()

for line in soup.find_all('ol'):
    for item in line.find_all('li'):
        result = []
        for descendant in item.descendants:
            if isinstance(descendant, NavigableString):
                result.append(unicode(descendant).strip())
            elif descendant.name == 'dl':
                break

        print u' '.join(result)
于 2013-09-09T11:52:15.153 に答える