objective-c - 部分的な UTF-8 を NSString にデコードする

Question

UTF-8クラスを使用してネットワーク経由でエンコードされたファイルをフェッチしている間NSURLConnection、デリゲートのメッセージがファイルを切り捨てるconnection:didReceiveData:とともに送信される可能性が高くなります。NSDataUTF-8UTF-8NSData

つまり、取得したすべてのデータを結合するconnection:didReceiveData:と、有効なUTF-8ファイルが得られますが、個別のデータはそれぞれ有効ではありませんUTF-8()。

ダウンロードしたすべてのファイルをメモリに保存したくありません。

私が欲しいのは: given NSData、できるものは何でもにデコードしますNSString。の最後の数バイトがNSData閉じていないサロゲートである場合は、教えてください。次のNSData.

initWithData:encoding:明らかな解決策の 1 つは、成功するまで最後のバイトを切り捨てるたびに、を使用してデコードを繰り返し試行することです。残念ながら、これは非常に無駄になる可能性があります。

score 2 · Accepted Answer

UTF-8 マルチバイトシーケンスの途中で停止しないようにしたい場合は、バイト配列の最後を見て、上位 2 ビットを確認する必要があります。

最上位ビットが 0 の場合、それは ASCII スタイルのエスケープされていない UTF-8 コードの 1 つであり、完了です。
最上位ビットが 1 で、上から 2 番目のビットが 0 の場合、それはエスケープシーケンスの継続であり、そのシーケンスの最後のバイトを表している可能性があるため、後で使用するために文字をバッファリングしてから前のバイトを調べる必要があります。キャラクター*
最上位ビットが 1 で、上から 2 番目のビットも 1 の場合、それはマルチバイトシーケンスの始まりであり、最初の 0 ビットを探して、シーケンス内の文字数を判断する必要があります。

ウィキペディアのエントリのマルチバイトテーブルを見てください: http://en.wikipedia.org/wiki/UTF-8

// assumes that receivedData contains both the leftovers and the new data

unsigned char *data= [receivedData bytes];
UInteger byteCount= [receivedData length];

if (byteCount<1)
    return nil;  // or @"";

unsigned char *lastByte = data[byteCount-1];
if ( lastByte & 0x80 == 0) {
    NSString *newString = [NSString initWithBytes: data length: byteCount 
                                    encoding: NSUTF8Encoding];
    // verify success
    // remove bytes from mutable receivedData, or set overflow to empty
    return newString;
}

// now eat all of the continuation bytes
UInteger backCount=0;
while ( (byteCount > 0)  && (lastByte & 0xc0 == 0x80)) {
    backCount++;
    byteCount--;
    lastByte = data[byteCount-1];
}
// at this point, either we have exhausted byteCount or we have the initial character
// if we exhaust the byte count we're probably in an illegal sequence, as we should 
// always have the initial character in the receivedData

if (byteCount<1) {
    // error!
    return nil;
}

// at this point, you can either use just byteCount, or you can compute the 
// length of the sequence from the lastByte in order
// to determine if you have exactly the right number of characters to decode UTF-8.

UInteger requiredBytes = 0;
if (lastByte & 0xe0 == 0xc0) {  // 110xxxxx
    // 2 byte sequence
    requiredBytes= 1;
} else if (lastByte & 0xf0 == 0xe0) {   // 1110xxxx
    // 3 byte sequence
    requiredBytes= 2;
} else if (lastByte & 0xf8 == 0xf0) {   // 11110xxx
    // 4 byte sequence
    requiredBytes= 3;
} else if (lastByte & 0xfc == 0xf8) {   // 111110xx
    // 5 byte sequence
    requiredBytes= 4;
} else if (lastByte & 0xfe == 0xfc) {   // 1111110x
    // 6 byte sequence
    requiredBytes= 5;
 } else {
    // shouldn't happen, illegal UTF8 seq
 }

 // now we know how many characters we need and we know how many
 //  (backCount) we have, so either use them, or take the 
 // introductory character away.
 if (requiredBytes==backCount) {
     // we have the right number of bytes
     byteCount += backCount;
 } else { 
     // we don't have the right number of bytes, so remove the intro character 
     byteCount -= 1;   
 }

 NSString *newString = [NSString initWithBytes: data length: byteCount 
                                 encoding: NSUTF8Encoding];
 // verify success
 // remove byteCount bytes from mutable receivedData, or set overflow to the 
 // bytes between byteCount and [receivedData count]
 return newString;

score 0 · Accepted Answer

同様の問題があります-utf8を部分的にデコードしています

前

  NSString * adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    adsInfo->adsTopic = malloc(sizeof(char) * adsTopic.length + 1);
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], adsTopic.length + 1);

【解決後】

  NSString *adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    NSUInteger byteCount = [adsTopic lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
    NSLog(@"number of Unicode characters in the string topic == %lu",(unsigned long)byteCount);

    adsInfo->adsTopic = malloc(byteCount+1);
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], byteCount + 1);

    NSString *text=[NSString stringWithCString:adsInfo.adsTopic encoding:NSUTF8StringEncoding];
                NSLog(@"=== %@", text);

score 0 · Accepted Answer

UTF-8 は解析するのが非常に簡単なエンコーディングであり、不完全なシーケンスを簡単に検出できるように設計されており、不完全なシーケンスの途中で開始した場合はその始まりを見つけることができます。

最後から逆方向に検索して、<= 0x7f または > 0xc0 のバイトを探します。<= 0x7f であれば完了です。0xc0 と 0xdf の間にある場合は、次の 1 バイトが完了する必要があります。0xe0 と 0xef の間にある場合、完全な次の 2 バイトが必要です。>= 0xf0 の場合、完全な次の 3 バイトが必要です。

objective-c - 部分的な UTF-8 を NSString にデコードする

3 に答える 3

Related

Reference