html - xml および html ファイルの解析中の R XML パッケージの奇妙なバグ

Question

R の XML パッケージを使用して、さまざまな html および xml ファイルから可能なすべてのデータを抽出しています。これらのファイルは、基本的にドキュメント、ビルドプロパティ、または readme ファイルです。

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE chapter PUBLIC '-//OASIS//DTD DocBook XML V4.1.2//EN'
                      'http://www.oasis-open.org/docbook/xml/4.0 docbookx.dtd'>

<chapter lang="en">
<chapterinfo>
<author>
<firstname>Jirka</firstname>
<surname>Kosek</surname>
</author>
<copyright>
<year>2001</year>
<holder>Ji&rcaron;&iacute; Kosek</holder>
</copyright>
<releaseinfo>$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp $</releaseinfo>
</chapterinfo>
<title>Using XSL stylesheets to generate HTML Help</title>
<?dbhtml filename="htmlhelp.html"?>

<para>HTML Help (HH) is help-format used in newer versions of MS
Windows and applications written for this platform. This format allows
to pack several HTML files together with images, table of contents and
index into single file. Windows contains browser for this file-format
and full-text search is also supported on HH files. If you want know
more about HH and its capabilities look at <ulink
url="http://msdn.microsoft.com/library/tools/htmlhelp/chm/HH1Start.htm">HTML
Help pages</ulink>.</para>

<section>
<title>How to generate first HTML Help file from DocBook sources</title>

<para>Working with HH stylesheets is same as with other XSL DocBook
stylesheets. Simply run your favorite XSLT processor on your document
with stylesheet suited for HH:</para>

</section>

</chapter>

私の目標は、このようなものを使用して htmlTreeParse または xmlTreeParse を使用してツリーを解析した後に xmlValue を使用することです(xmlファイルの場合..)

Text = xmlValue(xmlRoot(xmlTreeParse(XMLFileName)))

ただし、xml ファイルと html ファイルの両方に対してこれを行うと、1 つのエラーが発生します。レベル 2 以上の子ノードがある場合、テキストフィールドはそれらの間にスペースなしで貼り付けられます。

たとえば、上記の例では

xmlValue(chapterInfo) は

JirkaKosek2001JiKosek$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp

各子ノード (再帰的) の xmlValues は、それらの間にスペースを追加せずに一緒に貼り付けられます。このデータの抽出中に空白を追加するように xmlValue を取得するにはどうすればよいですか

事前にご協力いただきありがとうございます。

シヴァニ

score 3 · Accepted Answer

ドキュメントによると、xmlValue単一のテキストノード、または「単一のテキストノードを含む XML ノード」でのみ機能します。非テキストノードのスペースは明らかに保持されません。

ただし、テキストノードが 1 つの場合でも、コードによって空白が取り除かれます。

library(XML)
doc <- xmlTreeParse("<a> </a>")
xmlValue(xmlRoot(doc))
# [1] ""

ignoreBlanks=FALSEおよびuseInternalNodes=TRUE 引数をに追加してxmlTreeParse、すべての空白を保持できます。

doc <- xmlTreeParse(
  "<a> </a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] " "

# Spaces inside text nodes are preserved
doc <- xmlTreeParse(
  "<a>foo <b>bar</b></a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foo bar"

# Spaces between text nodes (inside non-text nodes) are not preserved
doc <- xmlTreeParse(
  "<a><b>foo</b> <b>bar</b></a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foobar"

html - xml および html ファイルの解析中の R XML パッケージの奇妙なバグ

1 に答える 1

Related

Reference