java - JavaXMLパーサーエラーWordからのコピー/貼り付け時に無効な文字Unicode0x1A

Question

二重投稿してすみません。しかし、私の以前の投稿はFlexに基づいていました：

FlexTextArea-Wordからのコピー/貼り付け-xml解析で無効なUnicode文字

しかし今、私はこれをJava側に投稿しています。

問題は次のとおりです。

XML文字列を作成してキューに入れる電子メール機能（アプリケーションの一部）があります。別のアプリケーションがそれを取得し、XMLを解析して、電子メールを送信します。

(<BODY>....</BODY)電子メールテキストがWordからコピー/貼り付けされると、XMLパーサー例外が発生します。

Invalid character in attribute value BODY (Unicode: 0x1A)

Javaも使用しているので、次を使用して文字列から無効な文字を削除しようとしています。

body = body.replaceAll("‘", "");
body = body.replaceAll("’", "");

//無効な文字を削除します

public String stripNonValidXMLCharacters(String in) {
        StringBuffer out = new StringBuffer(); // Used to hold the output.
        char current; // Used to reference the current character.

        if (in == null || ("".equals(in))) {
            return ""; // vacancy test.
        }
        for (int i = 0; i < in.length(); i++) {
            //NOTE: No IndexOutOfBoundsException caught here; it should not happen.
            current = in.charAt(i); 
            if ((current == 0x9) 
                    || (current == 0xA) 
                    || (current == 0xD) 
                    || ((current >= 0x20) && (current <= 0xD7FF)) 
                    || ((current >= 0xE000) && (current <= 0xFFFD)) 
                    || ((current >= 0x10000) && (current <= 0x10FFFF)))
                out.append(current);
        }
        return out.toString();
    }

//もう一度ストリップします

private String stripNonValidXMLCharacter(String in) {      
        if (in == null || ("".equals(in))) { 
            return null;
        }
        StringBuffer out = new StringBuffer(in);
        for (int i = 0; i < out.length(); i++) {
            if (out.charAt(i) == 0x1a) {
                out.setCharAt(i, '-');
            }
        }
        return out.toString();
    }

//特殊文字がある場合は置き換えます

 emailText = emailText.replaceAll("[\\u0000-\\u0008\\u000B\\u000C" 
                        + "\\u000E-\\u001F" 
                        + "\\uD800-\\uDFFF\\uFFFE\\uFFFF\\u00C5\\u00D4\\u00EC"
                        + "\\u00A8\\u00F4\\u00B4\\u00CC\\u2211]", " ");
            emailText = emailText.replaceAll("[\\x00-\\x1F]", "");
            emailText = emailText.replaceAll(
                                    "[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f]", "");
            emailText = emailText.replaceAll("\\p{C}", "");

しかし、それでも機能しません。また、XML文字列は次で始まります。

 <?xml version="1.0" encoding="UTF-8"?>  
                    <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\\SMTPSchema.xsd\">

この問題は、Wordドキュメントに複数のタブがある場合に発生すると思います。例えばのように。

Text......text
<newLine>
<tab><tab><tab> text...text
<newLine>

結果のxml文字列は次のとおりです。

<?xml version="1.0" encoding="UTF-8"?> <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\SMTPSchema.xsd"> <EMAIL SOURCE="t@t.com" DEST="t@t.com" CC="" BCC="t@t.com" SUBJECT="test 61" BODY="As such there was no mechanism constructed to migrate the enrollment user base to Data Collection or to keep security attributes for common users in sync between the two systems.  The purpose of this document is to outline two strategies for bring the user base between the two applications into sync.?  It still is the same.  ** Please note: This e-mail message was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message."/> </EMAILS>

次に「？」に注意してください Wordドキュメントに複数のタブがある場所です。私の質問が明確で、誰かが問題の解決に役立つことを願っています

ありがとう

score 0 · Accepted Answer

TagSoup / JSoup / JTidyなどのXMLライブラリを使用してXMLをサニタイズしてみましたか？

score 0 · Accepted Answer

無効な（非表示の）文字はUI（Flex TextArea）からのものでした。そのため、Javaにも渡されないように、UIでそれを処理する必要がありました。Flex textAreaのchagingHandlerを使用して処理および削除し、文字を制限しました。

java - JavaXMLパーサーエラーWordからのコピー/貼り付け時に無効な文字Unicode0x1A

2 に答える 2

Related

Reference