c++ - UTF-8、CString、CFile? (C++、MFC)

Question

私は現在、特にUTF-8で動作する必要があるMFCプログラムに取り組んでいます。ある時点で、UTF-8 データをファイルに書き込まなければなりません。そのために、CFiles と CStrings を使用しています。

utf-8 (より正確にはロシア語の文字) データをファイルに書き込むと、出力は次のようになります。

Ðàñïå÷àòàíî:
Ñèñòåìà
Ïðîèçâîäñòâî

など。これは確実に utf-8 ではありません。このデータを正しく読み取るには、システム設定を変更する必要があります。非 ASCII 文字をロシア語のエンコーディングテーブルに変更することはできますが、ラテン語ベースの非 ASCII 文字はすべて失敗します。とにかく、それは私がそれを行う方法です。

CFile CSVFile( m_sCible, CFile::modeCreate|CFile::modeWrite);
CString sWorkingLine;
//Add stuff into sWorkingline
CSVFile.Write(sWorkingLine,sWorkingLine.GetLength());
//Clean sWorkingline and start over

何か不足していますか？代わりに何か他のものを使用しますか? 私が見逃したある種のキャッチはありますか？仲間のプログラマーの皆さんの知恵と経験に耳を傾けます。

編集：もちろん、質問したばかりなので、ここで見つけることができる興味深いものをついに見つけました。私はそれを共有するかもしれないと思った。

編集2：

さて、BOM をファイルに追加しました。ファイルには中国語の文字が含まれています。おそらく、行を UTF-8 に変換していないためです。ボムを追加するには...

char BOM[3]={0xEF, 0xBB, 0xBF};
CSVFile.Write(BOM,3);

そしてその後、私は追加しました...

    TCHAR TestLine;
    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,sWorkingLine,sWorkingLine.GetLength(),TestLine,strlen(TestLine)+1,NULL,NULL);
    //Add the line to file.
    CSVFile.Write(TestLine,strlen(TestLine)+1);

しかし、TestLine の長さを取得する方法がよくわからないため、コンパイルできません。strlen は TCHAR を受け入れないようです。 修正され、代わりに 1000 の静的な長さが使用されました。

編集3：

だから、私はこのコードを追加しました...

    wchar_t NewLine[1000];
    wcscpy( NewLine, CT2CW( (LPCTSTR) sWorkingLine ));
    TCHAR* TCHARBuf = new TCHAR[1000];

    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,NewLine,1000,TCHARBuf,1000,NULL,NULL);

    //Find how many characters we have to add
    size_t size = 0;
    HRESULT hr = StringCchLength(TCHARBuf, MAX_PATH, &size);

    //Add the line to the file
    CSVFile.Write(TCHARBuf,size);

コンパイルは問題なく行われますが、新しいファイルを見てみると、この新しいコードがまったくないときとまったく同じです (例: Ðàñïå÷àòàíî:)。一歩も踏み出せなかったような気もしますが、勝敗を分けるのは些細なことだと思います。

編集4：

ネイトが尋ねたので、以前に追加したコードを削除し、代わりに彼のコードを使用することにしました。

        CT2CA outputString(sWorkingLine, CP_UTF8);

    //Add line to file.
    CSVFile.Write(outputString,::strlen(outputString));

すべて正常にコンパイルされますが、ロシア語の文字は ??????? と表示されます。近づいていますが、まだそうではありません。ところで、私を助けようとした/助けようとしてくれたすべての人に感謝したいと思います。私はこれでしばらく立ち往生しています。この問題がなくなるのが待ちきれません。

最終編集 (私は願っています) 最初に UTF-8 文字を取得する方法を変更することで (本当に知らずに再エンコードしました)、これはテキストを出力する新しい方法で間違っていましたが、許容できる結果が得られました。ファイルの先頭に UTF-8 BOM 文字を追加することで、Excel などの他のプログラムで Unicode として読み取ることができます。

万歳！みんなありがとう！

score 28 · Accepted Answer

When you output data you need to do (this assumes you are compiling in Unicode mode, which is highly recommended):

CString russianText = L"Привет мир";

CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);

CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));

If _UNICODE is not defined (you are working in multi-byte mode instead), you need to know what code page your input text is in and convert it to something you can use. This example shows working with Russian text that is in UTF-16 format, saving it to UTF-8:

// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

More likely, your Russian text is in some other code page, such as KOI-8R. In that case, you need to convert from the other code page into UTF-16. Then convert the UTF-16 into UTF-8. You cannot convert directly from KOI-8R to UTF-8 using the conversion macros because they always try to convert narrow text to the system code page. So the easy way is to do this:

// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

You don't need a BOM (it's optional; I wouldn't use it unless there was a specific reason to do so).

Make sure you read this: http://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx. If you incorrectly use CT2CA (for example, using the assignment operator) you will run into trouble. The linked documentation page shows examples of how to use and how not to use it.

Further information:

The C in CT2CA indicates const. I use it when possible, but some conversions only support the non-const version (e.g. CW2A).
The T in CT2CA indicates that you are converting from an LPCTSTR. Thus it will work whether your code is compiled with the _UNICODE flag or not. You could also use CW2A (where W indicates wide characters).
The A in CT2CA indicates that you are converting to an "ANSI" (8-bit char) string.
Finally, the second parameter to CT2CA indicates the code page you are converting to.

To do the reverse conversion (from UTF-8 to LPCTSTR), you could do:

CString myString(CA2CT(russianText, CP_UTF8));

In this case, we are converting from an "ANSI" string in UTF-8 format, to an LPCTSTR. The LPCTSTR is always assumed to be UTF-16 (if _UNICODE is defined) or the current system code page (if _UNICODE is not defined).

score 6 · Accepted Answer

sWorkingLineUTF-8 に変換してから、ファイルに書き込む必要があります。

コードページを選択すると、WideCharToMultiByteCP_UTF8はUnicode文字列をUTF-8に変換できます。 MultiByteToWideCharは、ASCII文字をUnicodeに変換できます。

score 0 · Accepted Answer

Unicodeを使用していることを確認してください（TCHARはwchar_tです）。次に、データを書き込む前に、WideCharToMultiByteWin32API関数を使用してデータを変換します。

c++ - UTF-8、CString、CFile? (C++、MFC)

3 に答える 3

Related

Reference