4

xml ドキュメント内のすべてのテキストを抽出しています。タグと説明を探してから、すべての子と孫を検索すると、さらに多くの要素が存在する可能性があり、テキストを抽出します。

これが私のコードですが、孫タグ内のテキストを取得することはできません:

for element in root.find('description'):
    print 'parent: ', element.tag, '|', element.attrib
    try:
        data.write(element.text)
        for all_tags in element.findall('./'):
            print 'child: ', all_tags.tag, '|', all_tags.attrib
            if all_tags.text:
                data.write('\n')
                data.write(all_tags.text)
                if all_tags.tail:
                    data.write('\n')
                    data.write(all_tags.tail)
                    data.write('\n')
        data.write('\n')
    except TypeError:
        pass
    except UnicodeEncodeError:
        unicodestr = element.text.encode("utf-8")
        data.write(unicodestr)

    data.write('\n')

問題はfor all_tagsループにあります。

サンプル入力:

<description>
<p num="p-0003">
Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,
<i>Nature</i>
33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,
<i>Trends in Biochemical Sciences</i>
26, 675-664.
</p>
<p num="p-0004">
A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,
<i>Nature Reviews in Cancer</i>
(2002) 2: 489-501.
</p>
<p num="p-0005">
The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.
</p>
</description>

この入力では、後のテキスト<i> Nature </i> が欠落しており、最初の行のテキストに置き換えられています。これはall_tags.tail、子タグや孫タグからではなく、親タグからテキストを取得しているためだと思います。

4

2 に答える 2

6

element.findall('./')タグの直接の子のみを明示的に検索します。すべての子孫を検索する式は.//(ダブル スラッシュ) です。

与えられたサンプルに対するループの簡略化されたバージョンは、次のようになります。

>>> for element in root:
...     print 'parent: ', element.tag, '|', element.attrib
...     print element.text
...     for all_tags in element.findall('.//'):
...         print 'child: ', all_tags.tag, '|', all_tags.attrib
...         if all_tags.text:
...             print all_tags.text, '|', all_tags.tail
... 
parent:  p | {'num': 'p-0003'}

Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,

child:  i | {}
Nature | 
33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,

child:  i | {}
Trends in Biochemical Sciences | 
26, 675-664.

parent:  p | {'num': 'p-0004'}

A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,

child:  i | {}
Nature Reviews in Cancer | 
(2002) 2: 489-501.

parent:  p | {'num': 'p-0005'}

The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.

または、repr()代わりに文字列リテラルを表示するために使用します。

parent:  p | {'num': 'p-0003'}
'\nProtein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,\n'
child:  i | {}
'Nature' | u'\n33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKB\u03b1, Akt2/PKB\u03b2, and Akt3/PKB\u03b3. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,\n'
child:  i | {}
'Trends in Biochemical Sciences' | '\n26, 675-664.\n'
parent:  p | {'num': 'p-0004'}
u'\nA number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110\u03b1, or mutations in the PI3-K regulatory subunit, p85\u03b1, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,\n'
child:  i | {}
'Nature Reviews in Cancer' | '\n(2002) 2: 489-501.\n'
parent:  p | {'num': 'p-0005'}
'\nThe tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.\n'
于 2013-05-23T16:48:52.000 に答える
-1

おそらく itertext() を使用したいと思うでしょうが、ゲームを少し強化したい場合は、xpathを試してみてください。このような状況で本当に輝きます。

ここに例があります - 私が与えたxpathは基本的に次のように言っています:

XML ドキュメント内のすべてのタグを検索し、そのタグとすべての子のテキストを返します。

#!/usr/bin/python

from lxml import etree

tree = etree.fromstring(open('t.xml').read())

for el in tree.xpath('//description/descendant-or-self::*/text()'):
    print el
于 2013-05-23T17:04:04.007 に答える