c# - AntiXSS v3 出力から HTML エンコードされたテキスト (#decimal 表記) をサニタイズします

Question

XSS セーフのブログエンジンでコメントを作成しようとしています。さまざまなアプローチを試しましたが、非常に難しいことがわかりました。

コメントを表示するときは、まずMicrosoft AntiXss 3.0を使用してすべてを html エンコードします。次に、ホワイトリストアプローチを使用して安全なタグを html デコードしようとしています。

refactormycode の Atwood の「HTML のサニタイズ」スレッドで、 Steve Downing の例を見てきました。

私の問題は、AntiXss ライブラリが値を &#DECIMAL; にエンコードすることです。私の正規表現の知識が限られているため、スティーブの例を書き直す方法がわかりません。

エンティティを 10 進形式に単純に置き換えた次のコードを試しましたが、正しく動作しません。

&lt; with &#60;
&gt; with &#62;

私の書き直し：

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding.  Splits the input so we can get discrete
    /// chunks that start with &lt; and ends with either end of line or &gt;
    /// </summary>
    private static Regex _tags = new Regex("&#60;(?!&#62;).+?(&#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &gt; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>
    private static Regex _whitelist = new Regex(@"
^&#60;/?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&#62;$
|^&#60;(b|h)r\s?/?&#62;$
|^&#60;a(?!&#62;).+?&#62;$
|^&#60;img(?!&#62;).+?/?&#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {

        string tagname = "";
        Match tag;
        MatchCollection tags = _tags.Matches(html);
        string safeHtml = "";

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }

        }

        return html;
    }

}

私の入力テストhtmlは次のとおりです。

<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>

AntiXss の後は次のようになります。

&#60;p&#62;&#60;script language&#61;&#34;javascript&#34;&#62;alert&#40;&#39;XSS&#39;&#41;&#60;&#47;script&#62;&#60;b&#62;bold should work&#60;&#47;b&#62;&#60;&#47;p&#62;

上記の Sanitize(string html) のバージョンを実行すると、次のようになります。

<p><script language="javascript">alert&#40;&#39;XSS&#39;&#41;</script><b>bold should work</b></p>

正規表現は、私が望まないホワイトリストのスクリプトと一致しています。これに関する任意の助けをいただければ幸いです。

score 1 · Accepted Answer

ユーザーがコメントをマークアップするために、Markdown、VBCode、または同様のアプローチを使用することを検討しましたか? 次に、すべての HTML を禁止できます。

HTML を許可する必要がある場合は、(HTMLTidy の精神で) HTML パーサーの使用を検討し、そこでホワイトリストを作成します。

score 1 · Accepted Answer

はい、マークダウンでWMDエディターを使用していますが、ユーザーがStack OverflowのようにHTMLとコード例を投稿できるようにしたいので、HTMLを完全に禁止したくありません.

HTML Tidyを見てきましたが、まだ試していません。ただし、Html Agility Packを使用して、HTML が正しいこと (孤立タグがないこと) を確認しています。これは、AntiXss を実行する前に行います。

現在のソリューションを思い通りに動作させることができない場合は、HTML Tidy を試してみます。提案に感謝します。

score 1 · Accepted Answer

あなたの問題は、C# が正規表現を誤って解釈していることです。# 記号をエスケープする必要があります。エスケープがないと、一致しすぎます。

private static Regex _whitelist = new Regex(@"
    ^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
    |^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
    |^&\#60;a(?!&\#62;).+?&\#62;$
    |^&\#60;img(?!&\#62;).+?(&\#47;)?&\#62;$",

    RegexOptions.Singleline |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.ExplicitCapture 
    RegexOptions.Compiled
 );

更新 2: このxssおよびregexpサイトに興味があるかもしれません。

score 0 · Accepted Answer

誰かがこれを使用することに興味があるなら、私はここで完全なコードを再び投稿します（少しリファクタリングされ、コメントが更新されています）。

また、@ Pezと@someが、これを許可するのは危険である可能性があると指摘したため、ホワイトリストからimgタグを削除することにしました。

また、XSS攻撃の可能性に対してこれを適切にテストしていないことも指摘する必要があります。これは、この方法がどれだけうまく機能するかを理解するための単なる説明です。

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding to &#DECIMAL; notation. Microsoft AntiXSS 3.0 can be used to preform this. Splits the input so we can get discrete
    /// chunks that start with &#60; and ends with either end of line or &#62;
    /// </summary>
    private static readonly Regex _tags = new Regex(@"&\#60;(?!&\#62;).+?(&\#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &#60; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>

    private static readonly Regex _whitelist = new Regex(@"
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
|^&\#60;a(?!&\#62;).+?&\#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {
        Match tag;
        MatchCollection tags = _tags.Matches(html);

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            string tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                string safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }
        }
        return html;
    }
}

score 0 · Accepted Answer

私は Mac を使用しているため、C# コードをテストできません。しかし、私には、_whitelist 正規表現をタグ名でのみ機能させる必要があるようです。これは、開始タグと終了タグの 2 つのパスを作成する必要があることを意味する場合があります。しかし、それはそれをはるかに簡単にします。

c# - AntiXSS v3 出力から HTML エンコードされたテキスト (#decimal 表記) をサニタイズします

5 に答える 5

Related

Reference