python - Python で XML ファイル内の単語のリストを検索していますか?

Question

私は 2000 を超えるフレーズを含むこの XML ファイルを持っています。以下は小さなサンプルです。

<TEXT>

<PHRASE>
<V>played</V>
<N>John</N>
<PREP>with</PREP>
<en x='PERS'>Adam</en>
<PREP>in</PREP>
<en x='LOC'> ASL school/en>
</PHRASE>

<PHRASE>
<V y='0'>went</V>
<en x='PERS'>Mark</en>
<PREP>to</PREP>
<en x='ORG>United Nations</en>
<PREP>for</PREP>
<PREP>a</PREP>
<N>visit</N>
</PHRASE>

<PHRASE>
<PREP>in</PREP>
<en x='DATE'>1987</en>
<en x='PERS'>Nick</en>
<V>founded</V>
<en x='ORG'>XYZ company</en>
</PHRASE>

<PHRASE>
<en x='ORG'>Google's</en>
<en x='PERS'>Frank</en>
<V>went</V>
<N>yesterday</N>
<PREP>to</PREP>
<en x='LOC'>San Fransisco/en>
</PHRASE>
</TEXT>

そして、私はパターンのリストを持っています:

 finalPatterns=['went \n to \n','created\n  the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']

私が望むのは、たとえば、各 finalPattern を取得することです:テキスト内の各フレーズでその存在を検索して検索し、両方が含まれているフレーズがある場合は、その2つのタグを出力します。[en タグが PERS & ORG と等しくない場合は何も出力されません]<en>

検索するとき:

-"went" & "to" --> this is the output: Frank -San Fransisco
-"founded" & "in" --> output: Nick-XYZ Company

それは私がやったことですが、うまくいきませんでした。何も印刷されませんでした。

for phrase in root.findall('./PHRASE'):
 ens = {en.get('x'): en.text for en in phrase.findall('en')}
 if 'ORG' in ens and 'PERS' in ens:
   if all(word in phrase for word in finalPatterns):
      x="".join(phrase.itertext())   #print whats in between [since I would also like to print the whole sentence]
      print("ORG is: {}, PERS is: {} /".format(ens["ORG"],ens["PERS"]))

score 1 · Accepted Answer

一致した値に従って元の xml を書き換える検索の処理では、 XSLT (XML ドキュメントを操作する専用言語) を検討してください。

以下の XSLT は Python に埋め込まれており、リストを使用して一致しない要素を動的に削除しfinalPatternsます。そこから、Python は (lxmlモジュールを使用して) 元のドキュメントを変換し、そのような出力を最終用途のニーズに使用できます。

Pythonスクリプト

import lxml.etree as ET

finalPatterns=['went \n to \n','created\n  the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']

# BUILDING XSLT FILTER STRING
contains = ''
for p in finalPatterns:
    contains += "("
    for i in p.split('\n '):
        contains += "contains(., '{}') and \n".format(i.replace('\n', '').strip(' '))    
    contains += ")"
    contains = contains.replace(' and \n)', ') or ')

xslstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
            <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
            <xsl:strip-space elements="*"/>

              <!-- Identity Transform -->
              <xsl:template match="@*|node()">
                <xsl:copy>
                  <xsl:apply-templates select="@*|node()"/>
                </xsl:copy>
              </xsl:template>

               <!-- Rewrites Matching Phrase elements -->
               <xsl:template match="PHRASE">
                <xsl:copy>      
                  <wholetext>
                    <xsl:call-template name="join">
                      <xsl:with-param name="valueList" select="*"/>
                      <xsl:with-param name="separator" select="' '"/>
                    </xsl:call-template>
                  </wholetext>

                  <xsl:choose>
                      <xsl:when test="contains(., 'went') = True and contains(., 'to') = True">
                        <match>went to</match>
                      </xsl:when>
                      <xsl:when test="contains(., 'founded') = True and contains(., 'in') = True">
                        <match>founded in</match>
                      </xsl:when>
                      <xsl:when test="contains(., 'created') = True and contains(., 'the') = True">
                        <match>created the</match>
                      </xsl:when>
                      <xsl:otherwise test="contains(., 'a') = True and contains(., 'visit') = True">
                        <match>a visit</match>
                      </xsl:otherwise>
                  </xsl:choose>
                  <person><xsl:value-of select="en[@x='PERS']"/></person>
                  <organization><xsl:value-of select="en[@x='ORG']"/></organization>
                  <location><xsl:value-of select="en[@x='LOC']"/></location>
                </xsl:copy>
              </xsl:template>

              <!-- Rewrites Unmatched Phrase elements -->
              <xsl:template match="PHRASE[not({0})]"/>

              <!-- Join Text values -->
              <xsl:template name="join">
                <xsl:param name="valueList" select="''"/>
                <xsl:param name="separator" select="','"/>
                <xsl:for-each select="$valueList">
                  <xsl:choose>
                    <xsl:when test="position() = 1">
                      <xsl:value-of select="."/>
                    </xsl:when>
                    <xsl:otherwise>
                      <xsl:value-of select="concat($separator, .) "/>
                    </xsl:otherwise>
                  </xsl:choose>
                </xsl:for-each>
              </xsl:template>

            </xsl:transform>'''.format(contains[:-4])    

dom = ET.parse(os.path.join(cd, 'SearchWords.xml'))
xslt = ET.fromstring(xslstr)

transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
print(tree_out.decode("utf-8"))

for phrase in newdom.findall('PHRASE'):    
    print("Text: {} \n ORG is: {}, PERS is: {} /".format(phrase.find('wholetext').text,
                                                         phrase.find('organization').text,
                                                          phrase.find('person').text))

出力

以下に、デモンストレーション用に変換された xml を示します。文字列はtree_out、新しい xml ファイルとして外部に保存できます。

<TEXT>
  <PHRASE>
    <wholetext>went Mark to United Nations for a visit</wholetext>
    <person>Mark</person>
    <organization>United Nations</organization>
    <location/>
  </PHRASE>
  <PHRASE>
    <wholetext>in 1987 Nick founded XYZ company</wholetext>
    <person>Nick</person>
    <organization>XYZ company</organization>
    <location/>
  </PHRASE>
  <PHRASE>
    <wholetext>Google's Frank went yesterday to San Fransisco</wholetext>
    <person>Frank</person>
    <organization>Google's</organization>
    <location>San Fransisco</location>
  </PHRASE>
</TEXT>

Text: went Mark to United Nations for a visit 
 ORG is: United Nations, PERS is: Mark /
Text: in 1987 Nick founded XYZ company 
 ORG is: XYZ company, PERS is: Nick /
Text: Google's Frank went yesterday to San Fransisco 
 ORG is: Google's, PERS is: Frank /

リスト内包表記

を使用したリスト内包表記の試みを参照してくださいxpath。ただし、課題は、finalPatterns一致する一致で一致しないことです。たとえば、 text はwent \n tolike の間に単語を使用できますwent \n Mark \n to。リストの要素ごとに 1 つのキーワードのみを含める場合は、以下が機能します。それ以外の場合は、パターン認識に正規表現を検討してください。

dom = ET.parse(os.path.join(cd, 'Input.xml'))

phraselist = dom.xpath('//PHRASE')    
for phrase in phraselist:    
    if any(word in p for p in phrase.xpath('./*/text()') for word in finalPatterns):
        print(' '.join(phrase.xpath('./*/text()')))
        print('ORG is: {0}, PERS is: {1}'.format(phrase.xpath("./en[@x='ORG']")[0].text, \
                                                 phrase.xpath("./en[@x='PERS']")[0].text))

score 1 · Accepted Answer

これでうまくいくはずです：

phrasewords = [w.text for w in phrase.findall('V')+phrase.findall('N')+phrase.findall('PREP')]
for words in finalPatterns:
    if all(word in phrasewords for word in words.split()):
         print "found"

python - Python で XML ファイル内の単語のリストを検索していますか?

2 に答える 2

Related

Reference