php - 特定の文字のみを含む文を中国語のテキストコーパスで検索する

Question

目的：「既知の文字」配列からの文字のみを含む文を見つけるために、何万もの中国語文の配列を検索します。

例：私のコーパスが次の文で構成されているとしましょう：1）我去中国。2）妳爱他。3）你在哪里？私はこれらの文字だけを含む文を「知っている」または欲しいだけです：1）我2）中3）国4）你5）在6）去7）爱8）哪9）里。3つの文字すべてが私の2番目の配列にあるため、結果として最初の文が返されます。妳や他を求めなかったので、2番目の文は拒否されます。結果として、3番目の文が返されます。句読点は無視されます（および英数字も無視されます）。

これを行う作業スクリプトがあります（以下）。これが効率的な方法かどうか疑問に思います。興味のある方は、ご覧になって変更を提案するか、自分で書くか、アドバイスをお願いします。私はこのスクリプトからいくつかを収集し、いくつかのスタックオーバーフローの質問をチェックしましたが、それらはこのシナリオに対処していませんでした。

<?php
$known_characters = parse_file("FILENAME") // retrieves target characters
$sentences = parse_csv("FILENAME"); // retrieves the text corpus

$number_wanted = 30; // number of sentences to attempt to retrieve

$found = array(); // stores results
$number_found = 0; // number of results
$character_known = false; // assume character is not known
$sentence_known = true; // assume sentence matches target characters

foreach ($sentences as $s) {

    // retrieves an array of the sentence
    $sentence_characters = mb_str_split($s->ttext);

    foreach ($sentence_characters as $sc) {
        // check to see if the character is alpha-numeric or punctuation
        // if so, then ignore.
        $pattern = '/[a-zA-Z0-9\s\x{3000}-\x{303F}\x{FF00}-\x{FF5A}]/u';
        if (!preg_match($pattern, $sc)) {
            foreach ($known_characters as $kc) {;
                if ($sc==$kc) {
                    // if character is known, move to next character
                    $character_known = true;
                    break;
                }
            }
        } else {
            // character is known if it is alpha-numeric or punctuation
            $character_known = true;
        }
        if (!$character_known) {
            // if character is unknown, move to next sentence
            $sentence_known = false;
            break;
        }
        $character_known = false; // reset for next iteration
    }
    if ($sentence_known) {
        // if sentence is known, add it to results array
        $found[] = $s->ttext;
        $number_found = $number_found+1;
    }
    if ($number_found==$number_wanted)
        break; // if required number of results are found, break

    $sentence_known = true; // reset for next iteration 
}
?>

score 0 · Accepted Answer

私にはこれがそれをするべきであるように思われます：

$pattern = '/[^a-zA-Z0-9\s\x{3000}-\x{303F}\x{FF00}-\x{FF5A}我中国你在去爱哪里]/u';
if (preg_match($pattern, $sentence) {
    // the sentence contains characters besides a-zA-Z0-9, punctuation
    // and the selected characters
} else {
    // the sentence contains only the allowed characters
}

ソースコードファイルは必ずUTF-8で保存してください。

php - 特定の文字のみを含む文を中国語のテキストコーパスで検索する

1 に答える 1

Related

Reference