python - HTMLコンテンツからデータを抽出する

Question

いくつかのHTMLページをダウンロードして情報を抽出したいのですが、各HTMLページには次のものがありますtable tag。

<table class="sobi2Details" style='background-image: url(http://www.imd.ir/components/com_sobi2/images/backgrounds/grey.gif);border-style: solid; border-color: #808080' >
    <tr>
        <td><h1>Dr Jhon Doe</h1></td>
    </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td>
          <div id="sobi2outer">
             <br/>
             <span id="sobi2Details_field_name" ><span id="sobi2Listing_field_name_label">name:</span>Jhon</span><br/>
             <span id="sobi2Details_field_family" ><span id="sobi2Listing_field_family_label">family:</span> Doe</span><br/>
             <span id="sobi2Details_field_tel1" ><span id="sobi2Listing_field_tel1_label">tel:</span> 33727464</span><br/>
          </div>
        </td>
    </tr>
</table>

Jhonename（）、family（Doe）、tel（）にアクセスしたいのですが、美しいスープを33727464使用して、idでこれらのスパンタグにアクセスしました。

name=soup.find(id="sobi2Details_field_name").__str__()
family=soup.find(id="sobi2Details_field_family").__str__()
tel=soup.find(id="sobi2Details_field_tel1").__str__()

しかし、これらのタグにデータを抽出する方法がわかりません。属性を使用しようとしましchildrenたcontentが、テーマを使用すると、次のようtagに返されますNone。

name=soup.find(id="sobi2Details_field_name")
for child in name.children:
    #process content inside

しかし、私はこのエラーを受け取ります：

'NoneType' object has no attribute 'children'

その上でstr（）を使用すると、そうではありませんNone!! 何か案が？

編集：私の最終的な解決策

soup = BeautifulSoup(page,from_encoding="utf-8")
name_span=soup.find(id="sobi2Details_field_name").__str__()
name=name_span.split(':')[-1]
result = re.sub('</span>', '',name)

score 3 · Accepted Answer

私はそれを行うためのいくつかの方法を見つけました。

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(path_to_html_file))

name_span = soup.find(id="sobi2Details_field_name")

# First way: split text over ':'
# This only works because there's always a ':' before the target field
name = name_span.text.split(':')[1]

# Second way: iterate over the span strings
# The element you look for is always the last one
name = list(name_span.strings)[-1]

# Third way: iterate over 'next' elements
name = name_span.next.next.next # you can create a function to do that, it looks ugly :)

それが役立つかどうか教えてください。

score 1 · Accepted Answer

xpathに精通している場合は、代わりにetreeでlxmlを使用してください。

import urllib2
from lxml import etree

opener = urllib2.build_opener()
root = etree.HTML(opener.open("myUrl").read())

print root.xpath("//span[@id='sobi2Details_field_name']/text()")[0]

python - HTMLコンテンツからデータを抽出する

2 に答える 2

Related

Reference