xml - 混合コンテンツと文字列操作のクリーンアップ

Question

私は、Word ベースの文書を XML に変換するという非常に骨の折れる作業の真っ最中です。次の問題が発生しました。

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <p>
        <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">Is this a
            quote</hi>?” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is a
            quote</hi>” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is
            definitely a quote</hi>!” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text.„&lt;hi rend="italics">This is a
            first quote</hi>” (Source). „&lt;hi rend="italics">Sometimes there is a second quote as
            well</hi>!?” (Source). </p>

</root>

<p>ノードには混合コンテンツがあります。<element>前回の繰り返しでお世話になりました。しかし、問題は、部分的にテキストノード内に表示され、部分的にテキストノードとして表示される引用符とソースにあり<hi rend= "italics"/>ます。

XSLT 2.0 を使用して次のことを行うにはどうすればよいですか。

<hi rend="italics">最後の文字が "„" であるテキストノードの直前にあるすべてのノードに一致しますか?
<hi rend="italics">asの内容を出力し<quote>...</quote>、引用符 ("„" と """) を取り除きますが、 ?<quote/>の兄弟の直後に現れる疑問符と感嘆符を含めます。<hi rend="italics">
<hi rend="italics">ノードに続く "(" と ")" の間のテキストノードを<source>...</source>角かっこなしで変換します。
最後のピリオドを含めます。

つまり、出力は次のようになります。

<root>
<p>
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>. 
</p>

</root>

このような混合コンテンツと文字列操作を扱ったことは一度もありません。私はあなたのヒントに非常に感謝しています。

score 2 · Accepted Answer

この変換:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "hi[@rend='italics'
     and
      preceding-sibling::node()[1][self::text()[ends-with(., '„')]]
      ]">

  <quote>
    <xsl:value-of select=
     "concat(.,
             if(matches(following-sibling::text()[1], '^[?!]+'))
              then replace(following-sibling::text()[1], '^([?!]+).*$', '$1')
              else()
             )
      "/>
  </quote>
 </xsl:template>

 <xsl:template match="text()[true()]">
  <xsl:variable name="vThis" select="."/>
  <xsl:variable name="vThis2" select="translate($vThis, '„”?!', '')"/>

  <xsl:value-of select="substring-before(concat($vThis2, '('), '(')"/>
  <xsl:if test="contains($vThis2, '(')">
   <source>
    <xsl:value-of select=
      "substring-before(substring-after($vThis2, '('), ')')"/>
   </source>
   <xsl:value-of select="substring-after($vThis2, ')')"/>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

提供された XML ドキュメントに適用した場合:

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">Is this a
                quote</hi>?” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is a
                quote</hi>” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is
                definitely a quote</hi>!” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text.„&lt;hi rend="italics">This is a
                first quote</hi>” (Source). „&lt;hi rend="italics">Sometimes there is a second quote as
                well</hi>!?” (Source). </p>

</root>

必要な正しい結果を生成します。

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. <quote>Is this a
                quote?</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is a
                quote</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is
                definitely a quote!</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text.<quote>This is a
                first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as
                well!?</quote> <source>Source</source>. </p>

</root>

score 1 · Accepted Answer

これが代替ソリューションです。これにより、よりナラティブなスタイルの入力ドキュメントが可能になります（引用符内の引用符、1つのテキストノード内の複数の（ソース）フラグメント、hi要素が続かない場合のデータとしての「„」）。

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:so="http://stackoverflow.com/questions/12690177"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xsl xs so">
<xsl:output omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*" />  

<xsl:template match="@*|comment()|processing-instruction()">
  <xsl:copy />
</xsl:template>

<xsl:template match="*">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
</xsl:template>

<xsl:function name="so:clip-start" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring($in-text,1,string-length($in-text)-1)" />
</xsl:function>

<xsl:function name="so:clip-end" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring-after($in-text,'”')" />
</xsl:function>

<xsl:function name="so:matches-start" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/following-sibling::node()/self::hi[@rend='italics'] and
                        ends-with($text-node, '„')" />
</xsl:function>

<xsl:template match="text()[so:matches-start(.)]"    priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-start(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:function name="so:matches-end" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/preceding-sibling::node()/self::hi[@rend='italics'] and
                        matches($text-node,'^[!?]*”')" />
</xsl:function>

<xsl:template match="text()[so:matches-end(.)]"   priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()[so:matches-start(.)][so:matches-end(.)]" priority="3">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(so:clip-start(.))" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()" name="parse-text" priority="1">
  <xsl:param name="text" select="." />
  <xsl:analyze-string select="$text" regex="\(([^)]*)\)">
    <xsl:matching-substring>
      <source>
        <xsl:value-of select="regex-group(1)" />
      </source>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>

<xsl:template match="hi[@rend='italics']">
  <quote>
    <xsl:apply-templates select="(@* except @rend) | node()" />
    <xsl:for-each select="following-sibling::node()[1]/self::text()[matches(.,'^[!?]')]">
      <xsl:value-of select="replace(., '^([!?]+).*$', '$1')" />
    </xsl:for-each>   
  </quote>
</xsl:template>

</xsl:stylesheet>

xml - 混合コンテンツと文字列操作のクリーンアップ

2 に答える 2

Related

Reference