c# - エスケープされた ASCII シーケンスから UTF8/UNICODE 文字を読み取る

Question

ファイルに次の名前があり、文字列を UTF8 でエンコードされた文字列として読み取る必要があるため、次のようになります。

test_\303\246\303\270\303\245.txt

次のものを取得する必要があります。

test_æøå.txt

C# を使用してこれを実現する方法を知っていますか?

score 4 · Accepted Answer

次の文字列があるとします。

string input = "test_\\303\\246\\303\\270\\303\\245.txt";

IE 文字通り

test_\303\246\303\270\303\245.txt

あなたはこれを行うことができます：

string input = "test_\\303\\246\\303\\270\\303\\245.txt";
Encoding iso88591 = Encoding.GetEncoding(28591); //See note at the end of answer
Encoding utf8 = Encoding.UTF8;


//Turn the octal escape sequences into characters having codepoints 0-255
//this results in a "binary string"
string binaryString = Regex.Replace(input, @"\\(?<num>[0-7]{3})", delegate(Match m)
{
    String oct = m.Groups["num"].ToString();
    return Char.ConvertFromUtf32(Convert.ToInt32(oct, 8));

});

//Turn the "binary string" into bytes
byte[] raw = iso88591.GetBytes(binaryString);

//Read the bytes into C# string
string output = utf8.GetString(raw);
Console.WriteLine(output);
//test_æøå.txt

「バイナリ文字列」とは、コードポイントが 0 ～ 255 の文字のみで構成される文字列を意味します。したがって、インデックスの値ではなく、byte[]インデックスの文字のコードポイントを取得するのは貧乏人のことです(これは、数年前に JavaScript で行ったことです)。iso-8859-1 は最初の 256 個の Unicode コードポイントを正確に 1 バイトにマップするため、「バイナリ文字列」を.ibytebyte[]ibyte[]

c# - エスケープされた ASCII シーケンスから UTF8/UNICODE 文字を読み取る

1 に答える 1

Related

Reference