python - BeautifulSoupとPythonでメタタグを解析する

Question

BeautifulSoup3とpython2.6を使用してHTMLページを解析するのに問題があります。

HTMLコンテンツは次のとおりです。

content='<div class="egV2_EventReportCardLeftBlockShortWidth">
<span class="egV2_EventReportCardTitle">When</span>
<span class="egV2_EventReportCardBody">
<meta itemprop="startDate" content="2012-11-23T10:00:00.0000000">
<span class='egV2_archivedDateEnded'>STARTS</span>Fri 23 Nov,10:00AM<br/>
<meta itemprop="endDate" content="2012-12-03T18:00:00.0000000">
<span class='egV2_archivedDateEnded'>ENDS</span>Mon 03 Dec,6:00PM</span>
<span class="egV2_EventReportCardBody"></span>
<div class="egV2_div_cal" onclick=" showExportEvent()">
<div class="egV2_div_cal_outerFix">
<div class="egV2_div_cal_InnerAdjust"> Cal </div>
</div></div></div>'

そして、文字列'Fri 23 Nov,10：00AM'を変数に入れて、連結してPHPページに送り返したいと思います。

このコンテンツを読むには、次のコードを使用します:(上記のコンテンツは、読んだhtmlページ（http://everguide.com.au/melbourne/event/2012-nov-23/life-with-bird-spring）から取得したものです。 -倉庫販売/）

import urllib2
req = urllib2.Request(URL)
response = urllib2.urlopen(req)
html = response.read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html.decode('utf-8'))
soup.prettify()
import re
for node in soup.findAll(itemprop="name"):
    n = ''.join(node.findAll(text=True)) 
for node in soup.findAll("div", { "class" : "egV2_EventReportCardLeftBlockShortWidth" }):
    d = ''.join(node.findAll(text=True))
print n,"|", d

どちらが返されますか：

[(ssh user)]# python testscrape.py

LIFE with BIRD Spring Warehouse Sale | 
When
<span class="egV2_EventReportCardDateTitle">STARTS</span>
STARTSFri 23 Nov,10:00AMENDSMon 03 Dec,6:00PM
<span class="egV2_EventReportCardDateTitle">ENDS</span>



 Cal 



[(ssh user)]#

（そして、それらすべての改行などが含まれます）。

最後に、これらのストリップされた文字列の両方を1つのプリントアウトにグループ化し、PHPの中央に区切り文字を付けて、文字列を1つとして読み戻し、分割することができます。

問題は、Pythonコードがそのページを読み取ってテキストを保存できることですが、PHPアプリを混乱させるゴミやタグなどがすべて含まれています。

私は本当に戻って欲しいだけです：

Fri 23 Nov,10:00AM

これは、ImがfindAll（text = True）メソッドを使用しているためですか？

ドリルダウンして、そのdiv内のテキストのみを取得するにはどうすればよいですか？スパンタグも取得できませんか？

どんな助けでも大歓迎です、ありがとう。

リック-メルボルン。

score 4 · Accepted Answer

次のようなものを試してみませんか

In [95]: soup = BeautifulSoup(content)

In [96]: soup.find("span", {"class": "egV2_archivedDateEnded"})
Out[96]: <span class="egV2_archivedDateEnded">STARTS</span>

In [97]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next
Out[97]: u'STARTS'

In [98]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next.next
Out[98]: u'Fri 23 Nov,10:00AM'

あるいは

In [99]: soup.find("span", {"class": "egV2_archivedDateEnded"}).nextSibling
Out[99]: u'Fri 23 Nov,10:00AM'

score 0 · Accepted Answer

特定の属性で簡単に識別できる単一のタグを抽出しようとしている場合、pyparsingを使用すると、これは非常に簡単になります（ISO8601タイムストリング値を持つメタタグを追跡します）。

from pyparsing import makeHTMLTags,withAttribute

meta = makeHTMLTags('meta')[0]
# only want matching <meta> tags if they have the attribute itemprop="startDate"
meta.setParseAction(withAttribute(itemprop="startDate"))

# scanString is a generator that yields (tokens,startloc,endloc) triples, we just 
# want the tokens
firstmatch = next(meta.scanString(content))[0]

次に、日時オブジェクトに変換します。このオブジェクトは、任意の形式でフォーマットしたり、データベースに書き込んだり、経過時間の計算に使用したりできます。

from datetime import datetime
dt = datetime.strptime(firstmatch.content[:19], "%Y-%m-%dT%H:%M:%S")

print (firstmatch.content)
print (dt)

プリント：

2012-11-23T10:00:00.0000000
2012-11-23 10:00:00

python - BeautifulSoupとPythonでメタタグを解析する

2 に答える 2

Related

Reference