c# - UTF-8文字列をUnicodeに変換する方法は？

Question

UTF-8でエンコードされた文字を表示する文字列があり、Unicodeに変換し直したいと思います。

今のところ、私の実装は次のとおりです。

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}

私はという言葉で遊んでいます"déjà"。このオンラインツールを使用してUTF-8に変換したので、文字列を使用してメソッドのテストを開始しました"dÃ©jÃ"。

残念ながら、この実装では文字列は同じままです。

私はどこが間違っていますか？

score 18 · Accepted Answer

したがって、問題は、UTF-8コードユニットの値が16ビットコードユニットのシーケンスとしてC＃に格納されていることstringです。各コードユニットがバイトの範囲内にあることを確認し、それらの値をバイトにコピーしてから、新しいUTF-8バイトシーケンスをUTF-16に変換するだけです。

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

これは簡単ですが、根本的な原因を見つけるのが最善です。誰かがUTF-8コードユニットを16ビットコードユニットにコピーしている場所。string考えられる原因は、誰かが間違ったエンコーディングを使用してバイトをC＃に変換していることです。例Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)：

または、文字列の生成に使用された誤ったエンコーディングがわかっていて、誤ったエンコーディング変換が失われなかった場合（通常、誤ったエンコーディングが1バイトエンコーディングの場合）、単純に逆エンコーディングを実行できます。元のUTF-8データを取得する手順を実行すると、UTF-8バイトから正しい変換を実行できます。

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

score 9 · Accepted Answer

UTF-8でエンコードされた文字を表示する文字列があります

.NETにはそのようなものはありません。文字列クラスは、UTF-16エンコーディングでのみ文字列を格納できます。UTF-8でエンコードされた文字列は、byte[]としてのみ存在できます。バイトを文字列に格納しようとしても、うまくいきません。UTF-8は、有効なUnicodeコードポイントを持たないバイト値を使用します。文字列が正規化されると、コンテンツは破棄されます。したがって、DecodeFromUtf8（）の実行が開始されるまでに、文字列を回復するにはすでに遅すぎます。

byte[]を使用してUTF-8でエンコードされたテキストのみを処理します。そして、UTF8Encoding.GetString（）を使用して変換します。

score 9 · Accepted Answer

すべてのバイトが正しいUTF-8文字列（'Ö'->[195、0]、[150、0]）がある場合は、次を使用できます。

public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}

私の場合、DLLの結果もUTF-8文字列ですが、残念ながらUTF-8文字列はUTF-16エンコーディング（'Ö'->[195、0]、[19、32]）で解釈されます。したがって、150であるANSI'–'は8211であるUTF-16'–'に変換されました。この場合も、代わりに次を使用できます。

public static string Utf8ToUtf16(string utf8String)
{
    // Get UTF-8 bytes by reading each byte with ANSI encoding
    byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String);

    // Convert UTF-8 bytes to UTF-16 bytes
    byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);

    // Return UTF-16 bytes as UTF-16 string
    return Encoding.Unicode.GetString(utf16Bytes);
}

またはネイティブメソッド：

[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);

public static string Utf8ToUtf16(string utf8String)
{
    Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
    if (iNewDataLen > 1)
    {
        StringBuilder utf16String = new StringBuilder(iNewDataLen);
        MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);

        return utf16String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

逆に必要な場合は、Utf16ToUtf8を参照してください。お役に立てば幸いです。

score 4 · Accepted Answer

あなたが持っているものはstring、別のエンコーディング、おそらくコードページ1252から誤ってデコードされているようです。これは米国のWindowsのデフォルトです。他の損失がないと仮定して、元に戻す方法は次のとおりです。すぐにはわからない損失の1つはnon-breaking space、文字列の最後にある（U + 00A0）が表示されていないことです。もちろん、そもそもデータソースを正しく読み取ったほうがいいのですが、そもそもデータソースが間違って保存されていたのかもしれません。

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string junk = "dÃ©jÃ\xa0";  // Bad Unicode string

        // Turn string back to bytes using the original, incorrect encoding.
        byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);

        // Use the correct encoding this time to convert back to a string.
        string good = Encoding.UTF8.GetString(bytes);
        Console.WriteLine(good);
    }
}

結果：

déjà

c# - UTF-8文字列をUnicodeに変換する方法は？

4 に答える 4

Related

Reference