ios - UTF-8でNSInputStreamを読み取る方法は？

Question

NSInputStreamを使用してiOSで大きなファイルを読み取ろうとすると、ファイルの行が改行で区切られます（メモリを大量に使用するため、使用したくありませんcomponentsSeparatedByCharactersInSet）。

ただし、すべての行がUTF-8でエンコードされているわけではないため（ASCII、同じバイトとして表示される可能性があるため）、Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future.警告が表示されることがよくあります。

私の質問は次のとおりです。たとえば、コンパイラフラグを設定することで、この警告を抑制する方法はありますか？

さらに：バイトストリームからの読み取りとして、2つのバッファー読み取りを追加/連結し、バッファーを文字列に変換してから文字列を追加すると、文字列が破損する可能性があるため、保存できますか？

以下のメソッド例では、バイトから文字列への変換で、UTF-8文字の前半と後半が無効として破棄されることを示しています。

- (void)NSInputStreamTest {
  uint8_t testString[] = {0xd0, 0x91}; // @"Б"

  // Test 1: Read max 1 byte at a time of UTF-8 string
  uint8_t buf1[1], buf2[1];
  NSString *s1, *s2, *s3;
  NSInteger c1, c2;
  NSInputStream *inStream = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]];

  [inStream open];
  c1 = [inStream read:buf1 maxLength:1];
  s1 = [[NSString alloc] initWithBytes:buf1 length:1 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 1: Read %d byte(s): %@", c1, s1);
  c2 = [inStream read:buf2 maxLength:1];
  s2 = [[NSString alloc] initWithBytes:buf2 length:1 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 1: Read %d byte(s): %@", c2, s2);
  s3 = [s1 stringByAppendingString:s2];
  NSLog(@"Test 1: Concatenated: %@", s3);
  [inStream close];

  // Test 2: Read max 2 bytes at a time of UTF-8 string
  uint8_t buf4[2];
  NSString *s4;
  NSInteger c4;
  NSInputStream *inStream2 = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]];

  [inStream2 open];
  c4 = [inStream2 read:buf4 maxLength:2];
  s4 = [[NSString alloc] initWithBytes:buf4 length:2 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 2: Read %d byte(s): %@", c4, s4);
  [inStream2 close];
}

出力：

2013-02-10 21:16:23.412 Test[11144:c07] Test 1: Read 1 byte(s): (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Read 1 byte(s): (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Concatenated: (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 2: Read 2 byte(s): Б

score 1 · Accepted Answer

まず、一列に並んでいs3 = [s1 stringByAppendingString:s2];ます。「nil」の値に連結しようとしています。結果も「nil」になります。したがって、文字列の代わりにバイトを連結することをお勧めします。

uint8_t buf3[2];
buf3[0] = buf1[0];
buf3[1] = buf2[0];
s3 = [[NSString alloc] initWithBytes:buf3 length:2 encoding:NSUTF8StringEncoding];

出力：

2015-11-06 12:57:40.304 Test[10803:883182] Test 1: Read 1 byte(s): (null)
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Read 1 byte(s): (null)
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Concatenated: Б

二次的に、UTF-8文字の長さは[1..6]バイトである可能性があります。

(1 byte)   0aaa aaaa         //if symbol lays in 0x00 .. 0x7F (ASCII)
(2 bytes)  110x xxxx 10xx xxxx
(3 bytes)  1110 xxxx 10xx xxxx 10xx xxxx
(4 bytes)  1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
(5 bytes)  1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
(6 bytes)  1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx

したがって、NSInputStreamの生のバイトから読み取り、それらをUTF-8 NSStringに変換する場合は、有効な文字列が得られるまで、NSInputStreamからバイトごとに読み取ることをお勧めします。

#define MAX_UTF8_BYTES 6
NSString *utf8String;
NSMutableData *_data = [[NSMutableData alloc] init]; //for easy 'appending' bytes

int bytes_read = 0;
while (!utf8String) {
    if (bytes_read > MAX_UTF8_BYTES) {
        NSLog(@"Can't decode input byte array into UTF8.");
        return;
    }
    else {
        uint8_t byte[1];
        [_inputStream read:byte maxLength:1];
        [_data appendBytes:byte length:1];
        utf8String = [NSString stringWithUTF8String:[_data bytes]];
        bytes_read++;
    }
}

score 0 · Accepted Answer

ASCII（したがって改行文字）はUTF-8のサブセットであるため、競合は発生しないはずです。

単純なASCIIストリームの場合と同様に、ストリームを改行文字で分割できるはずです。NSString次に、各チャンク（「行」）をUTF-8を使用して変換できます。

エンコーディングエラーが実際のものではないこと、つまり、UTF-8エンコーディングに関してストリームに誤った文字が実際に含まれている可能性があることを確認しますか？

コメントから追加するために編集：

これは、UTF-8から変換する前に、行全体をメモリに保持するのに十分な数の文字で構成されていることを前提としています。

ios - UTF-8でNSInputStreamを読み取る方法は？

2 に答える 2

Related

Reference