xml - Rで最上位のXMLノードを変更するにはどうすればよいですか?

Question

xml ファイルの一番上のノードに属性を追加して、ファイルを保存したいと思います。考えられるxpathとサブセット化のすべての組み合わせを試しましたが、うまくいかないようです。簡単な例を使用するには:

xml_string = c(
 '<?xml version="1.0" encoding="UTF-8"?>',
 '<retrieval-response status = "found">',
      '<coredata>',
           '<id type = "author" >12345</id>',
      '</coredata>',
      '<author>',
           '<first>John</first>',
           '<last>Doe</last>',
      '</author>',
 '</retrieval-response>')

# parse xml content
xml = xmlParse(xml_string)

やってみると

xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)

エラーが発生します：

object of type 'externalptr' is not subsettable

ただし、属性が挿入されているため、何が間違っているのかわかりません。

(詳細な背景: これは Scopus の API からのデータの簡略化されたバージョンです。同様に構造化された何千もの xml ファイルを結合していますが、id は、すべてを含む「author」ノードの兄弟である「coredata」ノードにあります。そのため、SAS を使用して結合された XML ドキュメントをデータセットにコンパイルすると、ID とデータの間にリンクがありません.ID を階層の最上位に追加すると、それがすべての階層に伝播することを期待しています.他のレベル）。

score 2 · Accepted Answer

データセットとデータフレームの構造に従って XML データを 2 次元の行と列に移行するには、すべてのネストを削除して、反復する親レベルと 1 つの子レベルのみにする必要があります。したがって、最終用途のニーズに合わせて XML データを再構築するには、XML ドキュメントをあらゆる微妙なニーズに合わせて再構築する専用の宣言型プログラミング言語であるXSLTが便利です。

XMLの例を考えると、以下は実行できるXSLTであり、結果のXMLはSASに正常にインポートされます。SAS コードをループさせて、何千もの XML ファイルをすべて再構築します。

XSLT (.xsl または .xslt 形式で保存)

 <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
       xmlns:ait="http://www.elsevier.com/xml/ani/ait"
       xmlns:ce="http://www.elsevier.com/xml/ani/common"
       xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
       xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
       xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
       xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
       exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />

 <xsl:template match="author-retrieval-response">
  <xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
  <root>
      <coredata>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="coredata/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="concat(.,@href)"/>
          </xsl:element>
        </xsl:for-each>
      </coredata>

      <subjectAreas>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="subject-areas/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </subjectAreas>

      <authorname>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/preferred-name/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </authorname>

      <classifications>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/classificationgroup/classifications/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </classifications>

      <journals>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/journal-history/journal/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </journals>

      <ipdoc>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </ipdoc>

      <address>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </address>  
  </root>
 </xsl:template>

</xsl:transform>

SAS (上記のスクリプトを使用)

proc xsl 
    in="C:\Path\To\Original.xml"
    out="C:\Path\To\Output.xml"
    xsl="C:\Path\To\XSLT.xsl";
run;

** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml'; 

** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata; 
    retain authorid;
    set temp.Coredata;  ** NAME OF PARENT NODE IN XML;
run;

data Work.SubjectAreas; 
    retain authorid;
    set temp.SubjectAreas;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Authorname;   
    retain authorid;
    set temp.Authorname;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Classifications;
    retain authorid;
    set temp.Classifications;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Journals; 
    retain authorid;
    set temp.Journals;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Ipdoc;    
    retain authorid;
    set temp.Ipdoc;  ** NAME OF PARENT NODE IN XML;
run;

XML OUTPUT (1 行と 40 個の変数の Authorsdata データセットとしてインポートされます)

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <coredata>
      <authorid>1234567</authorid>
      <url>http://api.elsevier.com/content/author/author_id/1234567</url>
      <identifier>AUTHOR_ID:1234567</identifier>
      <eid>9-s2.0-1234567</eid>
      <document-count>3</document-count>
      <cited-by-count>95</cited-by-count>
      <citation-count>97</citation-count>
      <link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
      <link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&amp;authorId=1234567&amp;origin=inward</link>
      <link>http://api.elsevier.com/content/author/author_id/1234567</link>
      <link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
   </coredata>
   <subjectAreas>
      <authorid>1234567</authorid>
      <subject-area>Human-Computer Interaction</subject-area>
      <subject-area>Control and Systems Engineering</subject-area>
      <subject-area>Software</subject-area>
      <subject-area>Computer Vision and Pattern Recognition</subject-area>
      <subject-area>Artificial Intelligence</subject-area>
   </subjectAreas>
   <authorname>
      <authorid>1234567</authorid>
      <initials>A.</initials>
      <indexed-name>John A.</indexed-name>
      <surname>John</surname>
      <given-name>Doe</given-name>
   </authorname>
   <classifications>
      <authorid>1234567</authorid>
      <classification>1709</classification>
      <classification>2207</classification>
      <classification>1712</classification>
      <classification>1707</classification>
      <classification>1702</classification>
   </classifications>
   <journals>
      <authorid>1234567</authorid>
      <sourcetitle>Very Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
      <issn>10504729</issn>
      <sourcetitle>2005 Another Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
   </journals>
   <ipdoc>
      <authorid>1234567</authorid>
      <afnameid>Prestigious University#1111111</afnameid>
      <afdispname>Prestigious University University</afdispname>
      <preferred-name>Prestigious University University</preferred-name>
      <sort-name>Prestigious University</sort-name>
      <org-domain>pu.edu</org-domain>
      <org-URL>http://www.pu.edu/index.shtml</org-URL>
   </ipdoc>
   <address>
      <authorid>1234567</authorid>
      <address-part>1234 Prestigious Lane</address-part>
      <city>City</city>
      <state>ST</state>
      <postal-code>12345</postal-code>
      <country>United States</country>
   </address>
</root>

R代替

包括的な R XSLT ライブラリが存在しないため、解析は R 言語で直接行う必要があります。ただし、R は、コマンドライン、 RCOMClientパッケージ、およびその他のインターフェイスを介して、他の実行可能ファイル (つまり、Python、Saxon、VBA) の XSLT プロセッサを呼び出すことができます。

xmlToDataFrame()それにもかかわらず、R は、およびxpathSApply()(後者はXPathに似ています)によって XML データを抽出できますauthorid。

library(XML)

coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                          xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                              xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

authorname <-  xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                            xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                                 xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                       xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

score 1 · Accepted Answer

編集: トップノードを編集するアプローチを試した後 (以下の古い回答を参照)、SAS XML マッパーがすべての ID を保持していないため、トップノードを編集しても問題が解決しないことに気付きました。

完全に機能する各サブノードに作成者 ID を追加する新しいアプローチを試みました。また、XPath を使用して複数のノードを選択するには、次のようにベクターに入れることも学びました。

c("//coredata",
  "//affiliation-current",
  "affiliation-history",
  "subject-areas",
  "//author-profile")

したがって、私が使用した最終的なプログラムは次のとおりです。

files <- list.files()

for (i in 1:length(files)) {
     author_record <- xmlParse(files[i])

     xpathApply(
          author_record, c(
               "//coredata",
               "//affiliation-current",
               "affiliation-history",
               "subject-areas",
               "//author-profile"
          ),
          addAttributes,
          auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]]))
     )

     saveXML(author_record, file = files[i])
}

古い回答: 多くの実験の後、私は自分の問題に対するかなり単純な解決策を見つけました。

を使用するだけで、最上位ノードに属性を追加できます。

addAttributes(xmlRoot(xmlfile), attribute = "attributeValue")

私の特定のケースでは、最も簡単な解決策は単純なループです。

setwd("C:/directory/with/individual/xmlfiles")

files <- list.files()

for (i in 1:length(files)) {

 author_record <- xmlParse(files[i])

 addAttributes(node = xmlRoot(author_record), 
               id   = gsub   (pattern = "AUTHOR_ID:", 
                              replacement = "", 
                              x = xmlValue(auth[["//dc:identifier"]])
               )
 )

  saveXML(author_record, file = files[i])
}

もっと良い方法があると確信しています。明らかに、XLST を学ぶ必要があります。これは非常に強力なアプローチでした。

xml - Rで最上位のXMLノードを変更するにはどうすればよいですか?

2 に答える 2

Related

Reference