python - Python saxパーサーを使用して、XMLタグ間のテキストを文字列として取得して保存するにはどうすればよいですか?

Question

次のような XML ファイルがあります。

<TAG1>
   <TAG2 attribute1 = "attribute_i_need" attribute2 = "attribute_i_dont_need" >
      Text I want to use
   </TAG2>
   <TAG3>
      Text I'm not interested in
   </TAG3>
   <TAG4>
      More text I want to use
   </TAG4>

必要なのは、「使用したいテキスト」と「使用したいその他のテキスト」を何らかの方法で取得することですが、「興味のないテキスト」を後で任意の関数で使用できる文字列の形式にすることではありません. また、「attribute_i_need」を文字列の形式で取得する必要があります。私は以前にsaxパーサーを実際に使用したことがなく、完全に立ち往生しています。以下を使用して、ドキュメント内のすべてのテキストを印刷できました。

import xml.sax

class myHandler(xml.sax.ContentHandler):

    def characters(self, content):
        print (content)

parser = xml.sax.make_parser()
parser.setContentHandler(myHandler())
parser.parse(open("sample.xml", "r"))

これにより、基本的に出力が得られます。

Text I want to use
Text I'm not interested in
More text I want to use

しかし、問題は 2 つあります。まず第一に、これには私が興味のないテキストが含まれています。第二に、テキストを印刷するだけです。特定のテキストのみを出力する方法や、変数に割り当てて後で使用できる文字列としてテキストを返すコードを作成する方法がわかりません。そして、興味のある属性を抽出することから始める方法さえ知りません。

この問題を解決する方法を知っている人はいますか？そして、少なくとも私はそれがどのように機能するかについて漠然とした理解を持っているので、sax パーサーを含むソリューションを好みます。

score 0 · Accepted Answer

アイデアは、TAG2またはTAG4に遭遇した後にすべての文字の保存を開始し、要素が終了するたびに停止することです。開始要素は、興味深い属性を検査して保存する機会でもあります。

import xml.sax

class myHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.text = []
        self.keeping_text = False
        self.attributes = []

    def startElement(self, name, attrs):
        if name.lower() in ('tag2', 'tag4'):
            self.keeping_text = True

        try:
            # must attribute1 be on a tag2 or anywhere?
            attr = attrs.getValue('attribute1')
            self.attributes.append(attr)
        except KeyError:
            pass

    def endElement(self, name):
        self.keeping_text = False

    def characters(self, content):
        if self.keeping_text:
            self.text.append(content)

parser = xml.sax.make_parser()
handler = myHandler()
parser.setContentHandler(handler)
parser.parse(open("sample.xml", "r"))

print handler.text
print handler.attributes

# [u'\n', u'      Text I want to use', u'\n', u'   ',
#  u'\n', u'      More text I want to use', u'\n', u'   ']
# [u'attribute_i_need']

私はBeautifulSoup、あるいは裸でさえもlxml簡単だと思います。

python - Python saxパーサーを使用して、XMLタグ間のテキストを文字列として取得して保存するにはどうすればよいですか?

1 に答える 1

Related

Reference