php - PHP の preg_match と UTF-8

Question

preg_matchを使用して、UTF8 でエンコードされた文字列を検索しようとしています。

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

"H" は文字列 "¡Hola!" のインデックス 1 にあるため、これは 1 を出力するはずです。しかし、それは2を出力します。正規表現で「u」修飾子を渡しているにもかかわらず、件名をUTF8でエンコードされた文字列として扱っていないようです。

私のphp.iniには次の設定があり、他のUTF8関数は機能しています：

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

何か案は？

score 47 · Accepted Answer

u修飾子により、パターンとサブジェクトの両方がUTF-8 として解釈されますが、キャプチャされたオフセットはバイト単位でカウントされます。

mb_strlenバイトではなく UTF-8 文字で長さを取得するために使用できます。

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

score 26 · Accepted Answer

正規表現の前にこれ(*UTF8)を追加してみてください:

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

魔法、 https://www.php.net/manual/function.preg-match.php#95828のコメントのおかげで

score 24 · Accepted Answer

これは「機能」のようです。http://bugs.php.net/bug.php?id=37391を参照してください。

'u' スイッチは pcre に対してのみ意味があり、PHP 自体はそれを認識しません。

PHP の観点からは、文字列はバイトシーケンスであり、バイトオフセットを返すことは論理的に思えます (「正しい」とは言いません)。

score 8 · Accepted Answer

necroposting で失礼しますが、誰かが役に立つと思うかもしれません: 以下のコードは preg_match と preg_match_all 関数の両方の代わりとして機能し、UTF8 でエンコードされた文字列の 正しいオフセットで正しい一致を返します。

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

私の例の出力：

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

score 1 · Accepted Answer

H のマルチバイトの安全な位置を見つけることだけが必要な場合は、mb_strpos() を試してください。

mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";

出力：

¡Hola!
1
H

php - PHP の preg_match と UTF-8

7 に答える 7

Related

Reference