xml ドキュメント内のすべてのテキストを抽出しています。タグと説明を探してから、すべての子と孫を検索すると、さらに多くの要素が存在する可能性があり、テキストを抽出します。
これが私のコードですが、孫タグ内のテキストを取得することはできません:
for element in root.find('description'):
print 'parent: ', element.tag, '|', element.attrib
try:
data.write(element.text)
for all_tags in element.findall('./'):
print 'child: ', all_tags.tag, '|', all_tags.attrib
if all_tags.text:
data.write('\n')
data.write(all_tags.text)
if all_tags.tail:
data.write('\n')
data.write(all_tags.tail)
data.write('\n')
data.write('\n')
except TypeError:
pass
except UnicodeEncodeError:
unicodestr = element.text.encode("utf-8")
data.write(unicodestr)
data.write('\n')
問題はfor all_tags
ループにあります。
サンプル入力:
<description>
<p num="p-0003">
Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,
<i>Nature</i>
33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,
<i>Trends in Biochemical Sciences</i>
26, 675-664.
</p>
<p num="p-0004">
A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,
<i>Nature Reviews in Cancer</i>
(2002) 2: 489-501.
</p>
<p num="p-0005">
The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.
</p>
</description>
この入力では、後のテキスト<i> Nature </i>
が欠落しており、最初の行のテキストに置き換えられています。これはall_tags.tail
、子タグや孫タグからではなく、親タグからテキストを取得しているためだと思います。