php - 部分文字列を知らなくても、単語リスト内の部分文字列の一意の出現を数えますか?

Question

*単語のリスト内の部分文字列のユニークな出現を数えようとしています * 単語 のリストをチェックして、複数回出現する最小文字に基づく部分文字列が単語にあるかどうかを検出し、それらを数えます。部分文字列がわかりません。

これは、部分文字列がわかっている場合の実用的なソリューションですが、わからない場合はどうなりますか? 単語が基づいている最小文字数があります。

「Book」が単語の部分文字列であるすべての単語を検索します。以下のphp関数で。

欲しい結果の代わり:

book count (5)
stor count (2)

score 1 · Accepted Answer

これは私の最初の概算です。未完成、未テスト、少なくとも1つのバグがあり、eiffelで記述されています。さて、私はあなたのためにすべての仕事をするつもりはありません。

deferred class
    SUBSTRING_COUNT
feature
    threshold : INTEGER_32 =5

    biggest_starting_substring_length(a,b:STRING):INTEGER_32
        deferred
    end

    biggest_starting_substring(a,b:STRING):STRING
    do
        Result := a.substring(0,biggest_starting_substring_length(a,b))
    end

    make_list_of_substrings(a,b:STRING)
    local
        index:INTEGER_32
        this_one: STRING
    do
        from
            a_index := b_index + 1
        invariant
            a_index >=0 and a_index <= a.count
        until
            a_index >= a.count
        loop
            this_one := biggest_starting_substring(a.substring (a_index, a.count-1),b)
            if this_one.count > threshold then
                list.extend (this_one)
            end
        variant
            a.count - a_index
        end
    end -- biggest_substring

    list : ARRAYED_LIST[STRING]

end

score 1 · Accepted Answer

長さ 100 の文字列が与えられた場合

book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789...                                                     ... 100

アルゴリズムは次のようになります。

異なる開始点と部分文字列の長さから部分文字列を調査します。0 から始まり、長さが 1 ～ 100 のすべての部分文字列を取得します。つまり、0-1、0-2、0-3、... であり、これらの部分文字列のいずれかが文字列全体で複数回発生するかどうかを確認します。1 から始まるすべての部分文字列 (1-2、1-3、1-4... など) を 99-100 に達するまで検索しながら、位置を増やしながら文字列を進めていきます。

すべての部分文字列とその出現回数のテーブルを保持し、それらを並べ替えることができます。

最小長と最大長を指定することで最適化できます。これにより、検索数とヒット精度が大幅に減少します。さらに、部分文字列が見つかったら、検索された部分文字列の配列に保存します。部分文字列が再び見つかった場合は、スキップしてください。(つまり、bookすでにカウントしたヒットは、次の部分文字列をヒットしたときに再度カウントする必要はありませんbook)。さらに、文字列全体の半分よりも長い文字列を検索する必要はありません。

文字列の例では、文字列の一意性について追加のテストを実行できます。あなたが持っているだろう

o              x ..
oo             x  7
bo             x  7
ok             x  6 
book           x  5
booking        x  2
bookingservice x  1

3 より短い文字列 (およびテキスト文字列全体の半分より長い文字列) を無視すると、次のようになります。

book           x  5
booking        x  2
bookingservice x  1

これはすでにかなり妥当な結果です。

[編集] これは明らかに、自然な単語だけでなく、すべての文字列を調べます。

[編集] 通常、私は OP のコードを書くのが好きではありませんが、この場合は少し興味がありました:

$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";

function search($string, $min = 4, $max = 16, $threshhold = 2) {
    echo "<pre><br/>";
    echo "searching <em>'$string'</em> for string occurances ";
    echo "of length $min - $max: <br/>";

    $hits = array();
    $foundStrings = array();

    // no string longer than half of the total string will be found twice
    if ($max > strlen($string) / 2) {
        $max = strlen($string);
    }

    // examin substrings:
    // start from 0, 1, 2...
    for ($start = 0; $start < $max; $start++) {

        // and string length 1, 2, 3, ... $max
        for ($length = $min; $length < strlen($string); $length++) {

            // get the substring in question, 
            // but search for natural words (trim)
            $substring = trim(substr($string, $start, $length));

            // if substring was not counted yet, 
            // add the found count to the hits
            if (!in_array($substring, $foundStrings)) {
                preg_match_all("/$substring/i", $string, $matches);
                $hits[$substring] = count($matches[0]);
            }
        }
    }

    // sort the hits array desc by number of hits
    arsort($hits);

    // remove substring hits with hits less that threshhold
    foreach ($hits as $substring => $count) {
        if ($count < $threshhold) {
            unset($hits[$substring]);
        }
    }

    print_r($hits);
}

search($string);

?>

コメントと変数名は、コード自体を説明するものでなければなりません。あなたの場合、 $string はファイルの読み取りに使用されます。この例では、次のように出力されます。

searching 'book bookshelf booking foobar bar booking selfservice 
bookingservice cooking' for string occurances of length 4 - 16: 
Array
(
    [ook] => 6
    [book] => 5
    [boo] => 5
    [bookin] => 3
    [booking] => 3
    [booki] => 3
    [elf] => 2
)

実装方法を教えてください:)

php - 部分文字列を知らなくても、単語リスト内の部分文字列の一意の出現を数えますか?

2 に答える 2

Related

Reference