python - pythonで特定の拡張子を持つURLを見つけるためにxmlを繰り返します

Question

URLからダウンロードしたxmlファイルがあります。次に、xml を反復処理して、特定のファイル拡張子を持つファイルへのリンクを見つけたいと思います。

私のxmlは次のようになります。

<Foo>
    <bar>
        <file url="http://foo.txt"/>
        <file url="http://bar.doc"/>
    </bar>
</Foo>

次のようなxmlファイルを取得するコードを作成しました。

import urllib2, re
from xml.dom.minidom import parseString

file = urllib2.urlopen('http://foobar.xml')
data = file.read()
file.close()
dom = parseString(data)
xmlTag = dom.getElementsByTagName('file')

そして、私はこのようなものを機能させたいと思っています:

   i=0
    url = ''
    while( i < len(xmlTag)):
         if re.search('*.txt', xmlTag[i].toxml() ) is not None:
              url = xmlTag[i].toxml()
         i = i + 1;

** Some code that parses out the url **

しかし、それはエラーをスローします。より良いアプローチに関するヒントはありますか？

ありがとう！

score 4 · Accepted Answer

コードの最後の部分は、率直に言って、嫌です。dom.getElementsByTagName('file')ツリー内のすべての<file>要素のリストが表示されます...それを繰り返し処理するだけです。

urls = []
for file_node in dom.getElementsByTagName('file'):
    url = file_node.getAttribute('url')
    if url.endswith('.txt'):
        urls.append(url)

余談ですが、Pythonを使用して手動でインデックスを作成する必要はありません。まれにインデックス番号が必要な場合でも、enumerateを使用してください。

mylist = ['a', 'b', 'c']
for i, value in enumerate(mylist):
    print i, value

score 3 · Accepted Answer

lxml、urlparseおよびを使用した例os.path:

from lxml import etree
from urlparse import urlparse
from os.path import splitext

data = """
<Foo>
    <bar>
        <file url="http://foo.txt"/>
        <file url="http://bar.doc"/>
    </bar>
</Foo>
"""

tree = etree.fromstring(data).getroottree()
for url in tree.xpath('//Foo/bar/file/@url'):
    spliturl = urlparse(url)
    name, ext = splitext(spliturl.netloc)
    print url, 'is is a', ext, 'file'

python - pythonで特定の拡張子を持つURLを見つけるためにxmlを繰り返します

2 に答える 2

Related

Reference