python - LXML: x を削除します

Question

LXML を使用してサイトマップパーサーを作成しており、タグとその値を抽出したいと考えています。ただし、結果のタグには常に xmlns 情報が含まれます{http://www.sitemaps.org/schemas/sitemap/0.9}loc。

body = cStringIO.StringIO(item['body'])
parser = etree.XMLParser(recover=True, load_dtd=True, ns_clean=True)
tree = etree.parse(body, parser)

for sitemap in tree.xpath('./*'):
    print sitemap.xpath('./*')[0].tag
    # prints: {http://www.sitemaps.org/schemas/sitemap/0.9}loc

サイトマップ文字列:

<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>

タグのみを抽出したい - ここでは 'loc' なしで{http://www.sitemaps.org/schemas/sitemap/0.9}. LXML でパーサーまたは LXML をそのように構成する方法はありますか?

注: 簡単な正規表現の置換を使用できることは知っています。実装が必要以上に複雑に感じられる場合は、助けを求めるよう友人に言われました。

score 0 · Accepted Answer

タグを削除してテキストを残すつもりだったのかどうかはわかりません。したがって、別の答えになります。

from ehp import *

data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''

html = Html()
dom  = html.feed(data)

for root, ind in dom.find_with_root('loc'):
    root.remove(ind)
    root.append(Data(ind.text()))


# It would give me.
print dom



""" <sitemap xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >

  <lastmod >2011-12-22T15:46:17+00:00</lastmod>
http://www.some_page.com/sitemap-page-2010-11.xml</sitemap>
"""

score 0 · Accepted Answer

このツールでこれを試してみます。

htmlparser.sourceforge.net/

友人はそれが簡単で確かにあると私に言いました!! 美しいスープなどよりもはるかに優れています。

from ehp import *

data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''

html = Html()
dom  = html.feed(data)
seq  = [ind.text() for ind in dom.find('loc')]

print seq

# It gives me.
# ['http://www.some_page.com/sitemap-page-2010-11.xml']

python - LXML: x を削除します

4 に答える 4

Related

Reference