c# - Word の Html タグを削除

Question

特定の場所で Word HTML タグを削除する必要があります。現時点で私はこれをやっています：

public string CleanWordStyle(string html)
{
    StringCollection sc = new StringCollection();
    sc.Add(@"<table\b[^>]*>(.*?)</table>");
    sc.Add(@"(<o:|</o:)[^>]+>");
    sc.Add(@"(<v:|</v:)[^>]+>");
    sc.Add(@"(<st1:|</st1:)[^>]+>");
    sc.Add(@"(mso-bidi-|mso-fareast|mso-spacerun:|mso-list: ign|mso-ascii|mso-hansi|mso-ansi|mso-element|mso-special|mso-highlight|mso-border|mso-yfti|mso-padding|mso-background|mso-tab|mso-width|mso-height|mso-pagination|mso-theme|mso-outline)[^;]+;");
    sc.Add(@"(font-size|font-family):[^;]+;");
    sc.Add(@"font:[^;]+;");
    sc.Add(@"line-height:[^;]+;");
    sc.Add(@"class=""mso[^""]+""");
    sc.Add(@"times new roman&quot;,&quot;serif&quot;;");
    sc.Add(@"verdana&quot;,&quot;sans-serif&quot;;");
    sc.Add(@"<p> </p>");
    sc.Add(@"<p>&nbsp;</p>");
    foreach (string s in sc)
    {
        html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
    }
    html = Regex.Replace(html, @"&nbsp;", @"&#160;"); //can not be read by as XmlDocument if not!
    return html;
}

<p>現在、タグの HTML 全体をで削除していますがsc.Add(@"<p> </p>");、必要なのは、テーブルタグにヒットした場合、テーブル終了タグにヒットするまで置換を停止する必要があることです。出来ますか？

score 0 · Accepted Answer

正規表現は、行または非常に単純な html 構造体に対して機能します。

最小限のコードで実際に作業を行うことに勝った場合は、http://htmlagilitypack.codeplex.com/ から HTMLAgilityPack を取得し、すべてのタグの内部値からすべてのテキストを取得します。

次のように簡単になります。

public string CleanWordStyle(string htmlPage)
{
  HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
  doc.LoadHtml(htmlPage);

  return doc.DocumentNode.InnerText;
}

c# - Word の Html タグを削除

1 に答える 1

Related

Reference