python - Python でのマッチングパターン

Question

次の文字列を含む XML ファイルがあります。

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>

XML の本文には、XML 仕様と互換性のない>および文字があります。<とが次のようになるように、それらを置き換える必要があり>ます<。

 ' "> ' 
 ' " > ' and 
 ' </ '

それぞれ、それらを置き換えてはなりません。他のすべての>and<は、「より大きい」および「より小さい」という文字列に置き換える必要があります。したがって、結果は次のようになります。

 <field name="id">abcdef</field>
 <field name="intro" > pqrst</field>
 <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>

Pythonでそれを行うにはどうすればよいですか?

score 2 · Accepted Answer

lxml.etree.XMLParserオプションで使用できrecover=Trueます：

import sys
from lxml import etree

invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
                        parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)

出力

<root>
<field name="id">abcdef</field>
<field name="intro"> pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 35 and
try to remove non xml compatible characters.</field>
</root>

注:>はそのままドキュメントに残され、>完全<に削除されます (xml テキストの無効な文字として)。

正規表現ベースのソリューション

単純な xml のようなコンテンツの場合re.split()、タグをテキストから分離し、タグ以外のテキスト領域で置換を行うために使用できます。

import re
from itertools import izip_longest
from xml.sax.saxutils import escape  # '<' -> '&lt;'

# assumptions:
#   doc = *( start_tag / end_tag / text )
#   start_tag = '<' name *attr [ '/' ] '>'
#   end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*'  # allow ws between any token
name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'  # note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'

assert '{{' not in tag
while '{' in tag: # unwrap definitions
    tag = tag.format(**vars())

tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)

# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a time
print(''.join(escape(text) + tag for text, tag in pairs))

タグの誤検知を避けるために、'{ws}'上記のいくつかを削除できます。

出力

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
try to remove non xml compatible characters.</field>

注:<>テキストでは両方ともエスケープされています。

escape(text)上記の代わりに任意の関数を呼び出すことができます。

def escape4human(text):
    return text.replace('<', 'less than').replace('>', 'greater than')

score 2 · Accepted Answer

私はそれをやったようです>：

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)

?<!- 否定後読みアサーション、

?!- 否定先読みアサーション、

(...)(...)論理積、

したがって、式全体は、「'>' (' " ' で始まらない) および (' "' で始まらない) および (' ' で終わらない) のすべての発生を置換することを意味します。

ケース<が似ている

score -2 · Accepted Answer

-2

XML 解析にはElementTreeを使用します。

于 2012-11-10T04:03:22.653 に答える

python - Python でのマッチング パターン

3 に答える 3

出力

正規表現ベースのソリューション

出力

Related

Reference

python - Python でのマッチングパターン