c# - FCKEditor からすべての不要な MS Word 書式設定を削除する方法はありますか

Question

fckeditor をインストールしましたが、MS Word から貼り付けると、多くの不要な書式が追加されます。ボールド、イタリック、ブレットなどの特定のものを保持したい. 私はウェブを検索し、太字やイタリック体のように保持したいものでさえすべてを取り除く解決策を思いつきました. 不要な単語の書式だけを削除する方法はありますか?

score 10 · Accepted Answer

誰かが受け入れられた回答の ac# バージョンを必要とする場合に備えて:

public string CleanHtml(string html)
    {
        //Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
        // Only returns acceptable HTML, and converts line breaks to <br />
        // Acceptable HTML includes HTML-encoded entities.

        html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
        // Does this have HTML tags?

        if (html.IndexOf("<") >= 0)
        {
            // Make all tags lowercase
            html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
                return m.ToString().ToLower();
            });
            // Filter out anything except allowed tags
            // Problem: this strips attributes, including href from a
            // http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
            string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
            string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
            html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
            // Make all BR/br tags look the same, and trim them of whitespace before/after
            html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
        }


         // No CRs
         html = html.Replace("\r", "");
         // Convert remaining LFs to line breaks
         html = html.Replace("\n", "<br />");
         // Trim BRs at the end of any string, and spaces on either side
         return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
    }

score 7 · Accepted Answer

リッチテキストエディターから受信した HTML をスクラブするために使用するソリューションを次に示します。これは VB.NET で記述されており、C# に変換する時間はありませんが、非常に簡単です。

 Public Shared Function CleanHtml(ByVal html As String) As String
     '' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
     '' Only returns acceptable HTML, and converts line breaks to <br />
     '' Acceptable HTML includes HTML-encoded entities.
     html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
     '' Does this have HTML tags?
     If html.IndexOf("<") >= 0 Then
         '' Make all tags lowercase
         html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
         '' Filter out anything except allowed tags
         '' Problem: this strips attributes, including href from a
         '' http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
         Dim AcceptableTags      As String   = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
         Dim WhiteListPattern    As String   = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
         html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
         '' Make all BR/br tags look the same, and trim them of whitespace before/after
         html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
     End If
     '' No CRs
     html = html.Replace(controlChars.CR, "")
     '' Convert remaining LFs to line breaks
     html = html.Replace(controlChars.LF, "<br />")
     '' Trim BRs at the end of any string, and spaces on either side
     Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
 End Function

 Public Shared Function LowerTag(m As Match) As String
   Return m.ToString().ToLower()
 End Function

あなたの場合、「AcceptableTags」の「承認された」HTML タグのリストを変更する必要があります。コードは、不要な属性をすべて削除します (残念ながら、HREF や SRC などの便利な属性は削除されますが、そうでないことを願っています)。あなたにとって重要です）。

もちろん、これにはサーバーへのトリップが必要です。それが望ましくない場合は、ツールバーにある種の「クリーンアップ」ボタンを追加して、JavaScript を呼び出してエディターの現在のテキストを台無しにする必要があります。残念ながら、「貼り付け」は、マークアップを自動的にクリーンアップするためにトラップできるイベントではなく、OnChange のたびにクリーンアップすると、エディターが使用できなくなります (マークアップを変更するとテキストカーソルの位置が変わるため)。

score 4 · Accepted Answer

受け入れられた解決策を試しましたが、生成された単語のタグを消去しませんでした。

しかし、このコードは私のために働いた

static string CleanWordHtml(string html) {

StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"<!--(\w|\W)+?-->");
sc.Add(@"<title>(\w|\W)+?</title>");
// Get rid of classes and styles
sc.Add(@"\s?class=\w+");
sc.Add(@"\s+style='[^']+'");
// Get rid of unnecessary tags
sc.Add(
@"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+&nbsp;(</\w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"\s+v:\w+=""[^""]+""");
// remove extra lines
sc.Add(@"(\n\r){2,}");
foreach (string s in sc)
{
    html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html; 
}

score 2 · Accepted Answer

私はその問題をよく理解しています。MS-Word (またはワードプロセッシングまたはリッチテキスト編集対応のテキスト領域) からコピーして FCKEditor に貼り付けると (同じ問題が TinyMCE で発生します)、元のマークアップがクリップボードの内容に含まれて処理されます。このマークアップは、貼り付け操作のターゲットに埋め込まれたマークアップと常に補完的であるとは限りません。

FCKEditor への貢献者になってコードを研究し、変更を加える以外に解決策はわかりません。私が通常行っていることは、ユーザーに 2 段階のクリップボード操作を実行するように指示することです。

MS-Word からのコピペ
メモ帳に貼り付け
すべて選択
メモ帳からコピペ
FCKEDitor に貼り付けます

score 0 · Accepted Answer

しかし fckeditor は、その名前と Web サイトが示すように、テキストエディターです。私にとって、それはファイル内の文字を表示するだけであることを意味します。

余分な文字がなければ、太字や斜体の書式を設定することはできません。

編集：ああ、なるほど。Fckeditor の Web サイトを詳しく見てみると、これは HTML エディターであり、私が慣れ親しんだ単純なテキストエディターではありません。

特徴として挙げられてPaste from Word cleanup with autodetectionいます。

c# - FCKEditor からすべての不要な MS Word 書式設定を削除する方法はありますか

6 に答える 6

Related

Reference