次のコードを試してみても、計画どおりにうまくいかないようです: from beautifulsoup import BeautifulSoup
definition = """From encyclopedia:\n<i></i><p>Infobox Country<br>fullcountryname=Thailand ราชอาณาจักรไทยRaja-anachakra Thai <br>image_flag= Flag of Thailand.svg <br>image_coa= Coat of arms of Thailand.png <br>image_location= LocationThailand.png <br>nationalmotto= none <br>nationalsong= Phleng Chat <br>nationalflower= n/a <br>nationalanimal= n/a <br>officiallanguages= Thai (<r><i>Thai language</i></r>) <br>populationtotal= 65,444,371 <br>populationrank= 19 <br>populationdensity= 127 <br>countrycapital= <r>Bangkok</r> <br>countrylargestcity= <r>Bangkok</r> <br>areatotal= 514,000 <br>arearank= 49 <br>areawater= n/a <br>areawaterpercent= 0.4 <br>establishedin= <r>April 7</r>, <r>1782</r> <br>leadertitlename= <br>currency= <r>Baht</r> <br>utcoffset= +7 <br>dialingcode= 66 <br>internettld= .th<p><b>Thailand</b> is a <r>country</r> in Southeast <r>Asia</r>. Its edges touch <r>Laos</r>, <r>Cambodia</r>, <r>Malaysia</r>, and <r>Myanmar</r> (which is also called Burma.) Thailand was called Siam until 1949."""
print BeautifulSoup(definition).find('p[1]').text
これは何も返さない.. BeautifulSoupの使用による構文エラーだと確信しています.
Infobox Country
fullcountryname=Thailand Raja-anachakra Thai
image_flag= Flag of Thailand. svg
image_coa= Coat of arms of Thailand. png
image_location= LocationThailand. png
nationalmotto= none
nationalsong= Phleng Chat
nationalflower= n/a
nationalanimal= n/a
officiallanguages= Thai (Thai language)
populationtotal= 65,444,371
populationrank= 19
populationdensity= 127
countrycapital= Bangkok
countrylargestcity= Bangkok
areatotal= 514,000
arearank= 49
areawater= n/a
areawaterpercent= 0. 4
establishedin= April 7, 1782
leadertitlename=
currency= Baht
utcoffset= +7
dialingcode= 66
internettld= . th
ありがとうございました :)
編集:「Infobox」という単語と最後の単語の間のテキストを取得できれば、実際には望ましいと思います
タグを付けて、スクリプトを使用してライブ ウィキペディア ページを解析できるようにしました。