python - smgファイルからbodyタグを抽出する Beautiful SoupとPython

Question

次の形式の sgm ファイルがあります。

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><BODY>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</BODY></TEXT>
</REUTERS>

同じファイルにルートノード RETURNS を持つ 1000 のレコードがあります。各レコードから body タグを抽出して何かをしたいのですが、それができません。以下は私のコードです

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
topics= soup.findAll('body') # find all body tags
print len(topics)  # print number of body tags in sgm file
i=0
for link in topics:         #loop through each body tag and print its content 
    children = link.findChildren()
    for child in children:
        if i==0:
            print child
        else:
            print "none"
            i=i+1

print i

問題は、for ループが body タグの内容を出力しないことです。代わりに、レコード自体を出力します。

score 3 · Accepted Answer

コメントで言ったように、（私にとって）理由は不明ですが、タグにbody.

したがって、最初のステップ:bodyタグ名を次のように置き換えcontentます。

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><CONTENT>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</CONTENT></TEXT>
</REUTERS>

コードは次のとおりです。

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
contents = soup.findAll('content')
for content in contents:
    print content.text

python - smgファイルからbodyタグを抽出する Beautiful SoupとPython

2 に答える 2

Related

Reference