php - マルチバイト文字列とルックアラウンドの奇妙なバグ

Question

次のコードは、異なるマルチバイト文字列に対して異なる動作をするのはなぜですか?

echo preg_replace('@(?=\pL)@u', '*', 'م');  // prints: '*م'     ✓ 
echo preg_replace('@(?=\pL)@u', '*', 'ض');  // prints: '*ض'     ✓ 
echo preg_replace('@(?=\pL)@u', '*', 'غ');  // prints: '*�*�'   ✗ 
echo preg_replace('@(?=\pL)@u', '*', 'ص');  // prints: '*�*�'   ✗

参照: http://3v4l.org/fvab1

score 2 · Accepted Answer

修飾文字も含める必要があります ( Lm)。アラビア語の Unicode ブロック全体を反復する次のスクリプトを参照してください。

<?php
function uchar_2($dec)
{
    $utf = chr(192 + (($dec - ($dec % 64)) / 64)); 
    $utf .= chr(128 + ($dec % 64)); 


    return $utf;
}

$issues = 0;
$count = 0;
for ($dec = 1536; $dec <= 1791; $dec++) {
    $char = uchar_2($dec);
    if (preg_replace('@^(?=\pLm)$@u', '*', $char) !== $char) {
        printf("Issue with %s (%s)\n", $dec, $char);
        $issues++;
    }
    $count++;
}

printf("Found %d issues in %d rows\n", $issues, $count);

がないLmと、これは約半分の文字で失敗します。

php - マルチバイト文字列とルックアラウンドの奇妙なバグ

1 に答える 1

Related

Reference