c# - Open XML Office SDK を使用して WordDocument から取得した文字列からスタイルを削除する

Question

Open XML Office SDK 2.0 を使用して Word 文書内の文字列を検索し、それらを一覧表示しています。

    MatchCollection Matches;
    using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(txtLocation.Text, true))
    {
        string docText = null;
        using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
        {
            docText = sr.ReadToEnd();
        }
        Regex regex = new   Regex(@"\(.*?\)");
        Matches = regex.Matches(docText);
    }
    int i = 0;
    while (i < Matches.Count)
    {    Label lb = new Label();
         lb.Text = Matches[i].ToString();
         lb.Location = new System.Drawing.Point(24, (28 + i * 24));
         this.panel1.Controls.Add(lb);
         i++;
     }

問題は、(HelloWorld) のように正しい文字列を返すこともありますが、< w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial "/ >

どうすればそれらを取り除くことができますか?

score 0 · Accepted Answer

おそらく、すべてのフォーマットタグはXMLスタイル（山括弧の間）です。その場合、String.StartsWithandString.EndsWithメソッドを使用して文字列がXMLタグであるかどうかを判断できます。

// ...
while (i < Matches.Count)
{
     String str = Matches[i].ToString();
     if (!(str.StartsWith("<") && str.EndsWith(">"))) {
         // ...
     }
     i++;
}

score 0 · Accepted Answer

私がしなければならなかったことを見つけて、文字列を別の Regex.Replace に実行しました。これはすべての <> タグを置き換えます (つまり、XML/HTML)

String str = Matches[i].ToString();
str = Regex.Replace(str, @"<(.|\n)*?>", "");
lb.Text  = str;

c# - Open XML Office SDK を使用して WordDocument から取得した文字列からスタイルを削除する

2 に答える 2

Related

Reference