c++ - C++ UTF-16 to char conversion (Linux/Ubuntu)

Question

I am trying to help a friend with a project that was supposed to be 1H and has been now 3 days. Needless to say I feel very frustrated and angry ;-) ooooouuuu... I breath.

So the program written in C++ just read a bunch of file and process them. The problem is that my program reads files which are using a UTF-16 encoding (because the files contain words written in different languages) and a simple use to ifstream just doesn't seem to work (it reads and outputs garbage). It took me a while to realise that this was because the files were in UTF-16.

Now I spent literally the whole afternoon on the web trying to find info about READING UTF16 files and converting the content of a UTF16 line to char! I just can't seem to! It's a nightmare. I try to learn about <locale> and <codecvt>, wstring, etc. which I have never used before (I am specialised in graphics apps, not desktop apps). I just can't get it.

This is what I have done so fare (but doesn't work):

std::wifstream file2(fileFullPath);
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>);
std::cout.imbue(loc);
while (!file2.eof()) {
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl;
}

That's the maximum I could come up with but it doesn't even work. And it doesn't do anything better. But the problem is that I don't understand what I am doing in the first place anyway.

SO PLEASE PLEASE HELP! This is really driving crazy that I can even read a G*** D*** text file.

On top, my friend uses Ubuntu (I use clang++) and this code needs -stdlib=libc++ which doesn't seem to be supported by gcc on his side (even though he uses a pretty advanced version of gcc, which is 4.6.3 i believe). So I am not even sure using codecvt and locale is a good idea (as in "possible"). Would there be a better (another) option.

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information?

Thank a lot, I will ever be grateful to you if you help me on this.

score 3 · Accepted Answer

コマンドラインから (Linux コマンドを使用して) すべてのファイルを utf-8 に変換すると、情報が失われる可能性がありますか?

いいえ、すべての UTF-16 データはロスレスで UTF-8 に変換できます。これはおそらく最善の方法です。

ワイド文字が導入されたとき、それらはプログラムの内部でのみ使用されるテキスト表現であり、ワイド文字としてディスクに書き込まれることはありませんでした。ワイドストリームは、出力ファイルで書き出すワイド文字をナロー文字に変換し、読み取り時にファイル内のナロー文字をメモリ内でワイド文字に変換することでこれを反映します。

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.

もちろん、実際のエンコーディングはcodecvtストリームに組み込まれたロケールのファセットに依存しますが、ストリームが行うことは、書き込み時にはを使用してそのファセットを使用するcodecvtように変換し、読み取り時にはからに変換することです。wchar_tcharcharwchar_t

しかし、一部の人々が UTF-16 でファイルを書き始めて以来、他の人々はそれに対処しなければなりませんでした。彼らが C++ ストリームでそれを行う方法は、UTF-16 コード単位の半分を保持しているものとしてcodecvt扱うファセットを作成することです。charcodecvt_utf16

その説明で、コードの問題は次のとおりです。

std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}

上記を書き換える 1 つの方法を次に示します。

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}

score 0 · Accepted Answer

私は、Mats Petersson の印象的なソリューションを採用し、修正し、テストしました。

int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}



#ifdef __cplusplus    // If used by C++ code,
extern "C" {          // we need to export the C interface
#endif
void convert_utf16_to_utf32(UTF16 *input,
                            size_t input_size,
                            UTF32 *output)
{
     const UTF16 * const end = input + 1 * input_size;
     while (input < end){
       const UTF16 uc = *input++;
       std::vector<int> vec; // endianess
       vec.push_back(U16_LEAD(uc) & oxFF);
       printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF);
       vec.push_back(U16_TRAIL(uc) & oxFF);
       printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF);
       *output++ = utf16_to_utf32(vec);
     }
}
#ifdef __cplusplus
}
#endif

c++ - C++ UTF-16 to char conversion (Linux/Ubuntu)

3 に答える 3

Related

Reference