php - preg_match_allはラテン文字を削除します

Question

ラテン文字に問題があります。コードは次のとおりです。

$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www', 'on', 'ona', 'ja');

$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string

$string = preg_replace('/[^a-zA-Z0-9žšđčćŽŠĐČĆ -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…

$string = mb_strtolower($string); // make it lowercase

preg_match_all('/\b.*?\b/i', $string, $matchWords);

$matchWords = $matchWords[0];

foreach ( $matchWords as $key=>$item ) {
    if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
        unset($matchWords[$key]);
    }
}

$wordCountArr = array();
if ( is_array($matchWords) ) {
    foreach ( $matchWords as $key => $val ) {
        $val = strtolower($val);
        if ( isset($wordCountArr[$val]) ) {
            $wordCountArr[$val]++;
        } else {
            $wordCountArr[$val] = 1;
        }
    }
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;

$matchWords[0]このコードから戻ったとき：

preg_match_all('/\b.*?\b/i', $string, $matchWords);

配列にスペースが含まれているこの文字列を取得します。

ti si mi znaj na srcu kvar znajznajznajsrcužurka

にスペースがありますž urka

score 2 · Accepted Answer

ドキュメントから: 単語境界は、現在の文字と前の文字の両方が \w または \W と一致しない (つまり、一方が \w と一致し、もう一方が \W と一致する)、または開始または終了の文字列内の位置です。最初または最後の文字がそれぞれ \w に一致する場合、文字列の

ž (その前のスペースを含む) は\Wに一致しますが、u は\wžに一致するため、次のようになります。 urka

末尾のこれらの文字は、パターンに一致しません。

 žšđčć ŽŠĐČĆ :)

...それらはすべて\W文字であり、パターンに一致させるために\w文字が続く必要があります (2 番目の\b )

u-修飾子を探していると思います。試す

preg_match_all('/\b.*?\b/iu', $string, $matchWords);

php - preg_match_allはラテン文字を削除します

1 に答える 1

Related

Reference