c++ - マルチバイトのutf8文字列をデコードするには? (C++)

Question

I'm trying to build a set of helper functions for decoding and modifying multibyte utf-8 strings. For example, finding the amount of characters in the string, and finding the byte offset of a particular character.

I've been looking for a solution for a while, but haven't been able to figure it out. If anyone could show me a cross platform and portable way to do this only using the STL I would really appreciate it. Also if there is a c++11 way to do it I'm open to that as well.

score 3 · Accepted Answer

UTF-8 に関するウィキペディアのページを読んで勉強する必要があります。エンコーディングはそこに明確に記載されていますL https://en.wikipedia.org/wiki/UTF-8

UTF-8 をデコードするには、最初のバイトを読み取ります。これにより、文字を形成する後続のバイト数がわかります。次に、他の多くのバイトを読み取り、「データ」ビットを連結すると、コードポイント番号が得られます。

文字列の最後に到達するまでこれを行うと、文字列に含まれるコードポイントの数を計算できます。

特定のコードポイントインデックスに到達するまでこれを行うと、そのコードポイントインデックスのバイトオフセットがわかります。

基本的な以外に、これに役立つSTL機能は実際にはないと思いますstd::string::const_iterator。

非標準ライブラリについては、 ICUなどの Unicode ライブラリを使用するか、自分でコードを記述する代わりに、強くお勧めします。.Net ライブラリは注意すればある程度は機能しますが、Windows にはこれに役立つ API が他にないと思います。

c++ - マルチバイトのutf8文字列をデコードするには? (C++)

1 に答える 1

Related

Reference