c++ - バイナリモードでutf16をファイルに書き込む

Question

バイナリモードで ofstream を使用して wstring をファイルに書き込もうとしていますが、何か間違っていると思います。これは私が試したことです：

ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();

たとえば Firefox でエンコードを UTF16 に設定して test.txt を開くと、次のように表示されます。

h�e�l�l�o�</p>

なぜこれが起こるのか誰か教えてもらえますか？

編集：

16進エディタでファイルを開くと、次のようになります。

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00

何らかの理由で、すべての文字の間に 2 バイト余分に入っているように見えますか?

score 14 · Accepted Answer

ここで、ほとんど使用されていないロケールプロパティに出くわします。文字列を (生データではなく) 文字列として出力すると、適切な変換を自動的に行うロケールを取得できます。

注: このコードは、wchar_t 文字のエディアン性を考慮していません。

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}

ファイル: UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};

score 6 · Accepted Answer

標準を使用すれば簡単です（この問題を永久に解決するC++11ような追加のインクルードがたくさんあるため）。"utf8"

ただし、古い標準でマルチプラットフォームコードを使用する場合は、次のメソッドを使用してストリームを使用して記述できます。

ストリーム用のUTFコンバーターに関する記事を読む
stxutif.h上記のソースからプロジェクトに追加します

次のように、ファイルをANSIモードで開き、ファイルの先頭にBOMを追加します。

std::ofstream fs;
fs.open(filepath, std::ios::out|std::ios::binary);

unsigned char smarker[3];
smarker[0] = 0xEF;
smarker[1] = 0xBB;
smarker[2] = 0xBF;

fs << smarker;
fs.close();

次に、としてファイルを開き、UTFそこにコンテンツを書き込みます。

std::wofstream fs;
fs.open(filepath, std::ios::out|std::ios::app);

std::locale utf8_locale(std::locale(), new utf8cvt<false>);
fs.imbue(utf8_locale); 

fs << .. // Write anything you want...

score 6 · Accepted Answer

あなたの環境では sizeof(wchar_t) が 4 であると思われます。つまり、UTF-16 ではなく UTF-32/UCS-4 を書き出しています。それは確かに16進ダンプがどのように見えるかです。

これはテストするのに十分簡単です (sizeof(wchar_t) を出力するだけです) が、何が起こっているのかはかなり確信しています。

UTF-32 wstring から UTF-16 に移行するには、サロゲートペアが機能するため、適切なエンコーディングを適用する必要があります。

score 2 · Accepted Answer

wofstream を使用する Windows では、上記で定義された utf16 ファセットが失敗します。これは、wofstream が値 0A を持つすべてのバイトを 2 バイト 0D 0A に変換するためです。 L'\x000A'、'\n'、L'\n'、および std::endl はすべて同じ結果になります。Windows では、ファイルをバイナリモードで (wofsteam ではなく) ofstream で開き、元の投稿と同じように出力を書き込む必要があります。

score 1 · Accepted Answer

提供されUtf16Facetたものgccは大きな文字列では機能しませんでした。これが私のために機能したバージョンです...このようにしてファイルはに保存されUTF-16LEます。の場合、とUTF-16BEの割り当てを単純に反転します。例：とdo_indo_outto[0] = from[1]to[1] = from[0]

#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};

score 0 · Accepted Answer

WinHexなどの 16 進エディタで出力ファイルを調べて、実際のビットとバイトを確認し、出力が実際に UTF-16 であることを確認する必要があります。ここに投稿して、結果をお知らせください。これにより、Firefox と C++ プログラムのどちらが原因かがわかります。

しかし、あなたの C++ プログラムは動作しているように見えますが、Firefox は UTF-16 を正しく解釈していません。UTF-16 では、文字ごとに 2 バイトが必要です。しかし、Firefox は本来の 2 倍の文字数を出力しているため、おそらく文字列を UTF-8 または ASCII (通常は 1 文字あたり 1 バイト) として解釈しようとしています。

「エンコードが UTF16 に設定された Firefox」とはどういう意味ですか? 私はその仕事がうまくいくか懐疑的です。

c++ - バイナリモードでutf16をファイルに書き込む

6 に答える 6

Related

Reference