python - Python lxml ループの問題を使用したテキスト抽出

Question

ここに私のxmlファイルの一部があります..

- <a:p>
    - <a:pPr lvl="2">
        - <a:spcBef>
              <a:spcPts val="200" /> 
          </a:spcBef>
     </a:pPr>
    - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> 
          <a:t>The</a:t> 
     </a:r>
    - <a:r>
         <a:rPr lang="en-US" sz="1400" dirty="0" /> 
         <a:t>world</a:t> 
      </a:r>
     - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> 
          <a:t>is small</a:t> 
      </a:r>
  </a:p>
    - <a:p>
    - <a:pPr lvl="2">
        - <a:spcBef>
              <a:spcPts val="200" /> 
          </a:spcBef>
     </a:pPr>
    - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0" /> 
          <a:t>The</a:t> 
     </a:r>
    - <a:r>
         <a:rPr lang="en-US" sz="1400" dirty="0" b="0" /> 
         <a:t>world</a:t> 
      </a:r>
     - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0" /> 
          <a:t>is too big</a:t> 
      </a:r>
  </a:p>

lxml を使用してテキストを抽出するコードを作成しました。でも、文が2行に分かれているので、この2行をつなげてのように1文にしたいThe world is small...です。だからここに私はコードを書く：

path4 = file.xpath('/p:sld/p:cSld/p:spTree/p:sp/p:txBody/a:p/a:r/a:rPr', namespaces={'p':'http://schemas.openxmlformats.org/presentationml/2006/main',
                'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
    if path5:
        for a in path4:  
            if a.get('sz') == '1400' and a.xpath('node()') == [] and a.get('b') != '0':
                b = a.getparent()
                c = b.getparent()
                d = c.xpath('./a:r/a:t/text()' , namespaces {'p':'http://schemas.openxmlformats.org/presentationml/2006/main', 'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
                print ''.join(d)
             elif a.get('sz') == '1400' and a.xpath('node()') == [] and a.get('b') == '0':
                b = a.getparent()
                c = b.getparent()
                d = c.xpath('./a:r/a:t/text()' , namespaces {'p':'http://schemas.openxmlformats.org/presentationml/2006/main', 'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
                print ''.join(d)

出力が得られます：

The world is samll...
The world is small...
The world is small...

期待される出力:

the world is small...

助言がありますか？

python - Python lxml ループの問題を使用したテキスト抽出

1 に答える 1

Related

Reference