c++ - utf8procを使用したc++文字列からutf8有効な文字列へ

Question

std::string 出力があります。utf8proc を使用して、有効な utf8 文字列に変換したいと思います。 http://www.public-software-group.org/utf8proc-documentation

typedef int int32_t;
#define ssize_t int
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options)
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned.
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!

最初に、最後に余分なバイトを追加するにはどうすればよいですか? 次に、std::string から int32_t *buffer に変換するにはどうすればよいですか?

これは動作しません：

std::string g = output();
fprintf(stdout,"str: %s\n",g.c_str());
g += " ";   //add an extra byte?? 
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0);
fprintf(stdout,"strutf8: %s\n",g.c_str());

score 0 · Accepted Answer

あなたは実際には望んでいない可能性が非常に高いですutf8proc_reencode()-その関数は有効なUTF-32バッファを取り、それを有効なUTF-8バッファに変換しますが、データがどのエンコーディングにあるのかわからないので、使用できませんその機能。

したがって、最初に、データが実際にどのエンコーディングであるかを把握する必要があります。http://utfcpp.sourceforge.net/を使用して、有効な UTF-8 がutf8::is_valid(g.begin(), g.end()). それが本当なら、あなたは終わりです！

false の場合、事態は複雑になりますが、ICU ( http://icu-project.org/ ) が役に立ちます。http://userguide.icu-project.org/conversion/detectionを参照

データのエンコーディングがある程度確実にわかったら、ICU はデータを UTF-8 に変換するのに役立ちます。たとえば、ソースデータgが ISO-8859-1 にあるとします。

UErrorCode err = U_ZERO_ERROR; // check this after every call...
// CONVERT FROM ISO-8859-1 TO UChar
UConverter *conv_from = ucnv_open("ISO-8859-1", &err);
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err);
converted.resize(conv_len);
ucnv_close(conv_from);
// CONVERT FROM UChar TO UTF-8
g.resize(converted.size()*4);
UConverter *conv_u8 = ucnv_open("UTF-8", &err);
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err);
g.resize(u8_len);
ucnv_close(conv_u8);

その後、gUTF-8データを保持しています。

c++ - utf8procを使用したc++文字列からutf8有効な文字列へ

1 に答える 1

Related

Reference