c - wchar_t 変数は、C のウルドゥー語文字の半分のみを格納します

Question

ファイルからウルドゥー語のテキストを読み取って操作しようとしています。wchar_tただし、文字全体が変数に読み込まれていないようです。テキストを読み取り、各文字を新しい行に出力するコードは次のとおりです。

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");
    printf("This program tests Urdu reading:\n");
    wchar_t c;
    FILE *f = fopen("urdu.txt", "r");
    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc\n", c);
    }
    fclose(f);
}

そして、ここに私のサンプルテキストがあります:

میرا نام ابراھیم ھے۔

میں وینڈربلٹ یونیورسٹی میں پڑھتا ھوں۔

ただし、テキスト内の文字の 2 倍の文字が印刷されているようです。ワイド文字やマルチバイト文字が複数のバイトを使用することは理解していますが、このwchar_t型はアルファベットの文字に対応するすべてのバイトをまとめて格納すると思いました。

いつでも変数に文字全体を格納できるように、テキストを読み取るにはどうすればよいですか?

私の環境の詳細:
gcc: (x86_64-posix-seh-rev0, Build by MinGW-W64 project) 5.3.0
OS: Windows 10 64 ビット
テキストファイルのエンコーディング: UTF-8

これは私のテキストが16進形式でどのように見えるかです:

d9 85 db 8c d8 b1 d8 a7 20 d9 86 d8 a7 d9 85 20 d8 a7 d8 a8 d8 b1 d8 a7 da be db 8c d9 85 20 da be db 92 db 94 ad 98 5d b8 cd ab a2 0d 98 8d b8 cd 98 6d a8 8d 8b 1d 8a 8d 98 4d 9b 92 0d b8 cd 98 8d 98 6d b8 cd 98 8d 8b 1d 8b 3d 9b 9d b8 c2 0d 98 5d b8 cd ab a2 0d 9b ed a9 1d ab ed 8a ad 8a 72 0d ab ed 98 8d ab ad b9 4a

score 1 · Accepted Answer

Windows の Unicode サポートはほとんど独自のものであり、UTF-8 を使用し、Windows ネイティブライブラリを使用して Windows で動作する移植可能なソフトウェアを作成することは不可能です。移植性のないソリューションを検討する場合は、次の 1 つがあります。

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <fcntl.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");

    // Next line is needed to output wchar_t data to the console. Note that 
    // Urdu characters are not supported by standard console fonts. You may
    // have to install appropriate fonts to see Urdu on the console.
    // Failing that, redirecting to a file and opening with a text editor
    // should show Urdu characters.

    _setmode(_fileno(stdout), _O_U16TEXT);

    // Mixing wide-character and narrow-character output to stdout is not
    // a good idea. Using wprintf throughout. (Not Windows-specific)

    wprintf(L"This program tests UTF-8 reading:\n");

    // WEOF is not guaranteed to fit into wchar_t. It is necessary
    // to use wint_t to keep a result of fgetwc, or to print with
    // %lc. (Not Windows-specific)

    wint_t c;

    // Next line has a non-standard parameter passed to fopen, ccs=...
    // This is a Windows way to support different file encodings.
    // There are no UTF-8 locales in Windows. 

    FILE *f = fopen("urdu.txt", "r,ccs=UTF-8");

    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc", c);
    }
    fclose(f);
}

glibc を使用した OTOH (例: cygwin を使用) では、glibc がこれらを内部で処理するため、これらの Windows 拡張機能は必要ありません。

score 0 · Accepted Answer

UTF-8 は、1 文字あたり 1 ～ 4 バイトの Unicode のエンコーディングです。各 Unicode 文字を uint32_t (または一部の UNIX プラットフォームでは u_int32_t) 変数に格納することができました。私が使用したライブラリは ( utf8.h | utf8.c ) です。UTF-8 文字列の変換および操作関数を提供します。

したがって、ファイルがUTF-8 でnバイトの場合、最大でn 個のUnicode 文字が含まれます。つまり、ファイルの内容を格納するには、 4*nバイト (u_int32_t 変数ごとに 4 バイト)のメモリが必要です。

#include "utf8.h"

// here read contents of file into a char* => buff
// keep count of # of bytes read => N

ubuff = (u_int32_t*) calloc(N, sizeof(u_int32_t));  // calloc initializes to 0
u8_toucs(ubuff, N, buff, N);

// ubuff now is an array of 4-byte integers representing
// a Unicode character each

もちろん、複数のバイトが単一の文字を表す場合、ファイル内の Unicode 文字がn未満になる可能性は十分にあります。これは、4*nのメモリ割り当てが多すぎることを意味します。その場合、チャンクはubuff0 (Unicode Null 文字) になります。したがって、配列をスキャンして、必要に応じてメモリを再割り当てするだけです。

u_int32_t* original = ubuff;
int sz=0;
while *ubuff != 0 {
    ubuff++;
    sz++;
}
ubuff = realloc(original, sizeof(*original) * i);

注:について型エラーが発生した場合は、コードの先頭にをu_int32_t入れてください。typedef uint32_t u_int32_t;

c - wchar_t 変数は、C のウルドゥー語文字の半分のみを格納します

2 に答える 2

Related

Reference