php - PHP mb_strlen の戻り値がおかしい

Question

gb2312 は 2 バイト文字セットです。mb_strlen() を使用して単一の漢字をチェックすると 2 が返されますが、さらに 2 文字の場合、結果が奇妙になることがあります。どうすれば適切な長さを取得できますか?

<?php
header('Content-type: text/html;charset=utf-8');//
$a="大";
echo mb_strlen($a,'gb2312'); // output 2
echo mb_strlen($a.$a,'gb2312'); // output 3 , it should be 4
echo mb_strlen($a.'a','gb2312'); // output 2, it should be 3
echo mb_strlen('a'.$a,'gb2312'); // output 3, 
?>

deceze に感謝します。あなたの文書は非常に役に立ちます。私のようにエンコーディングについてほとんど知らない人は読むべきです。すべてのプログラマーが、テキストを操作するためのエンコーディングと文字セットについて絶対に、積極的に知っておく必要があること

score 4 · Accepted Answer

あなたの文字列はおそらくUTF-8として保存されています。

の UTF-8 コード"大"はE5 A4 A7(この Web ページによると)、次のとおりです。

$a       // 3 bytes, gb2312 -> 2 char (1 + 0.5)
$a . $a  // 6 bytes, gb2312 -> 3 char
$a . 'a' // 4 bytes, gb2312 -> 2 char
'a' . $a // 4 bytes, first byte is <128 so will be interpreted as one
         // single character, gb2312 -> 3 char

これは単なる推測ですが、このように考えると完全に理にかなっています。おそらく、このウィキペディアのページを参照できます。

本当にテストしたい場合は、gb2312エンコーディングで保存された別のファイルを作成し、fopenまたは何かを使用してそれを読み取ることをお勧めします. 次に、それが目的のエンコーディングであることを確認します。

score 4 · Accepted Answer

MB 内部エンコーディングを UTF-8 に設定してみてください

/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");

http://www.php.net/manual/en/function.mb-internal-encoding.php

score 0 · Accepted Answer

gb2312 の代わりに utf-8 を使用する必要があると思います

これを試して：

<?php
header('Content-type: text/html;charset=utf-8');//
$a="大";
echo mb_strlen($a,'utf8'); // output 1
echo mb_strlen($a.$a,'utf8'); // output 2 
echo mb_strlen($a.'a','utf8'); // output 2
echo mb_strlen('a'.$a,'utf8'); // output 2, 
?>

score 0 · Accepted Answer

By writing $a = "大"; into a PHP file, the variable $a contains a byte sequence of whatever was between the quotes in your source code file. If that source code file was saved in UTF-8, the string is a UTF-8 byte sequence representing the character "大". If the source code file was saved in GB2312, it is the GB2312 byte sequence representing "大". But a PHP file saved in GB2312 won't actually parse as valid PHP, since PHP needs an ASCII compatible encoding.

mb_strlen is supposed to give you the number of characters in the given string in the specified encoding. I.e. mb_strlen('大', 'gb2312') expects the string to be a GB2312 byte sequence representation and is supposed to return 1. You're wrong in expecting it to return 2, even if GB2312 is a double byte encoding. mb_strlen returns the number of characters.

strlen('大') would give you the number the bytes, because it's a naïve old-style functions which doesn't know anything about encodings and only counts bytes.

The bottom-line being: your expectation was wrong, and you have a mismatch between what the "大" is actually encoded in (whatever you saved your source code as) and what you tell mb_strlen it is encoded in (gb2312). Therefore mb_strlen cannot do its job correctly and gives you varying random results.

php - PHP mb_strlen の戻り値がおかしい

4 に答える 4

Related

Reference