php - PHPで正しい文字エンコーディングを検出していますか?

Question

文字列の文字エンコーディングを検出しようとしていますが、正しい結果が得られません。
例えば：

$str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

そのコードは出力しますISO-8859-1が、それはWindows-1252.

これの何が問題なのですか？

編集:
@raina77ow に応じて、例を更新しました。

$str = "&euro;&sbquo;&fnof;&bdquo;&hellip;" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

私は再び間違った結果を得ます。

score 2 · Accepted Answer

PHP での Windows-1252 の問題は、テキストに 0x80 から 0x9f 以外の文字が含まれるとすぐに Windows-1252 として検出されないため、ほとんど検出されないことです。

これは、文字列に "A" のような通常の ASCII 文字、またはスペース文字が含まれている場合、PHP はこれが有効な Windows-1252 ではないと判断し、この場合、次の可能なエンコーディング (ISO) にフォールバックすることを意味します。 8859-1。これは PHP のバグです。https://bugs.php.net/bug.php?id=64667を参照してください。

score 0 · Accepted Answer

ISO-8859-1 と CP-1252 でエンコードされた文字列のバイトコード表現は異なりますが、

<?php
$str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
    $new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
    printf('%15s: %s detected: %10s explicitly: %10s',
        $encoding,
        implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
        mb_detect_encoding($new),
        mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
    );
    echo PHP_EOL;
}

結果：

Windows-1252: 802082208320842085 detected:            explicitly: ISO-8859-1
  ISO-8859-1: 3f203f203f203f203f detected:      ASCII explicitly: ISO-8859-1

...ここでわかることから、の 2 番目のパラメーターに問題があるように見えますmb_detect_encoding。パラメータの代わりに使用mb_detect_orderすると、非常によく似た結果が得られます。

php - PHPで正しい文字エンコーディングを検出していますか?

2 に答える 2

Related

Reference