python - ワイルドカードを使用したPython ElementTree find()?

Question

特定のタグを抽出するために Python で XML フィードを解析しています。私の XML には名前空間が含まれており、これにより、名前空間の後にタグ名が続く各タグが生成されます。

xml は次のとおりです。

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:rte="http://www.rte.ie/schemas/vod">
    <id>10038711/</id>
    <updated>2013-01-24T22:52:43+00:00</updated>
    <title type="text">Reeling in the Years</title>
    <logo>http://www.rte.ie/iptv/images/logo.gif</logo>
    <link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist?type=iptv&amp;showId=10038711" />
    <category term="feed"/>
    <author>
        <name>RTE</name>
        <uri>http://www.rte.ie</uri>
    </author>
    <entry>
        <id>10038711</id>
        <published>2012-07-04T12:00:00+01:00</published>
        <updated>2013-01-06T12:31:25+00:00</updated>
        <title type="text">Reeling in the Years</title>
        <content type="text">National and international events with popular music from the year 1989.First Broadcast: 08/11/1999</content>
        <category term="WEB Exclusive" rte:type="channel"/>
        <category term="Classics 1980" rte:type="genre"/>
        <category term="rte player" rte:type="source"/>
        <category term="" rte:type="transmision_details"/>
        <category term="False" rte:type="copyprotectionoptout"/>
        <category term="long" rte:type="form"/>
        <category term="3275" rte:type="progid"/>
        <link rel="site" type="text/html" href="http://www.rte.ie/tv50/"/>
        <link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist/?itemId=10038711&amp;type=iptv&amp;format=xml" />
        <link rel="alternate" type="text/html" href="http://www.rte.ie/player/#v=10038711"/>
        <rte:valid start="2012-07-23T15:56:04+01:00" end="2017-08-01T15:56:04+01:00"/>
        <rte:duration ms="842205" formatted="0:10"/>
        <rte:statistics views="19"/>
        <rte:bri id="na"/>
        <rte:channel id="13"/>
        <rte:item id="10038711"/>
        <media:title type="plain">Reeling in the Years</media:title>
        <media:description type="plain">National and international events with popular music from the year 1989. First Broadcast: 08/11/1999</media:description>
        <media:thumbnail url="http://img.rasset.ie/00062efc200.jpg" height="288" width="512" time="00:00:00+00:00"/>
        <media:teaserimgref1x1 url="" time="00:00:00+00:00"/>
        <media:rating scheme="http://www.rte.ie/schemes/vod">NA</media:rating>
        <media:copyright>RTÉ</media:copyright>
        <media:group rte:format="single">
            <media:content url="http://vod.hds.rasset.ie/manifest/2012/0728/20120728_reelingint_cl10038711_10039316_260_.f4m" type="video/mp4" medium="video" expression="full" duration="842205" rte:format="content"/>
        </media:group>
        <rte:ads>
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre2&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre3&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
        </rte:ads>
    </entry>
<!-- playlist.xml -->
</feed>

XML が解析されると、各要素は次のようになります。

{http://www.w3.org/2005/Atom}id
{http://www.w3.org/2005/Atom}published
{http://www.w3.org/2005/Atom}updated
.....
.....
{http://www.rte.ie/schemas/vod}valid
{http://www.rte.ie/schemas/vod}duration
....
....
{http://search.yahoo.com/mrss/}description
{http://search.yahoo.com/mrss/}thumbnail
....

3 つの異なる名前空間があり、それらが常に同じであることを保証できないため、次のように各タグを厳密に指定しないことをお勧めします。

for elem in tree.iter({http://www.w3.org/2005/Atom}entry'):
    stream = str(elem.find('{http://www.w3.org/2005/Atom}id').text)
    date_tmp = str(elem.find('{http://www.w3.org/2005/Atom}published').text)
    name_tmp = str(elem.find('{http://www.w3.org/2005/Atom}title').text)
    short_tmp = str(elem.find('{http://www.w3.org/2005/Atom}content').text)
    channel_tmp = elem.find('{http://www.w3.org/2005/Atom}category', "channel")
    channel = str(channel_tmp.get('term'))
    icon_tmp = elem.find('{http://search.yahoo.com/mrss/}thumbnail')
    icon_url = str(icon_tmp.get('url'))

名前空間を単純に無視するように、検索にワイルドカードなどを入れる方法はありますか?

stream = str(elem.find('*id').text)

上記のようにハードコーディングできますが、名前空間が変更され、クエリがデータを返さなくなるのは幸運です..

助けてくれてありがとう。

score 3 · Accepted Answer

local-name() 関数で XPath 式を使用できます。

<?xml version="1.0"?>
<root xmlns="ns">
  <tag/>
</root>

「doc」が上記の XML の ElementTree であると仮定します。

import lxml.etree
doc = lxml.etree.parse(<some_file_like_object>)
root = doc.getroot()
root.xpath('//*[local-name()="tag"]')
[<Element {ns}tag at 0x7fcde6f7c960>]

必要に応じて置き換え<some_file_like_object>ます (または、lxml.etree.fromstringXML 文字列を使用してroot要素を直接取得することもできます)。

python - ワイルドカードを使用したPython ElementTree find()?

1 に答える 1

Related

Reference