php - 配列で最もユニークな文字列を見つける方法は?

Question

配列にはたくさんの文字列があります-数千。その配列内のすべての文字列を互いに比較し、それらから最もユニークな文字列を見つける必要があります。

あなたは私のコードを見てテストすることができますが、あなたが見ることができるように - たった100個のアイテムを比較するのに多くの時間がかかります(localhost = Intel Core i7で約160秒)。このコードを最適化しますか?

他の場所からデータを抽出しているため、コードの最初の部分 (データの生成) を最適化する必要はありません。コードの 2 番目の部分 (比較) を最適化するだけです。誰かが気づいたように、スクリプトは重複比較を行わないことで最適化できます (a -> b、b -> a) - 私はこれについて知っていますが、それでも半分以上の時間を節約しようとしています。類似のテキストよりも文字列を比較するためのより良い機能があるかもしれませんが、私は他の何かの経験がありません。それが私がここで尋ねている理由です...

コード：

    <?php

    //set how many strings generate for test
    $number_of_test_strings = 100;


    $strings = array();
    $chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    $size_chars_array = strlen( $chars );


    /*
     * Creating some random strings - data for test
     */

    //just for testing performance
    $creating_test_data_time_start =  microtime();

    //create some random strings in to array
    for ( $i = 1; $i < $number_of_test_strings; $i++ ) {

        //set random string to empty string
        $random_string = '';

        //choose by random from characters array - also the length of random string will be random - between 1800 and 2500chars
        for( $j = 0; $j < rand ( 1800, 2500); $j++ ) {
                $random_string .= $chars[ rand( 0, $size_chars_array - 1 ) ];
        }

        //insert random string in to strings array
        $strings[] = $random_string;

    }

    //just for testing performance
    $creating_test_data_time_end =  microtime();




    /*
     * Comparison itself
     */


    //just for testing performance
    $uniqueness_time_start =  microtime();

    //foreach for all strings in array
    foreach ($strings as $key_first_element => $first_element) {

        //reset of matched value
        $matched = 0;

        //foreach with each first element
        foreach ($strings as $key_second_element => $second_element) {

            // dont compare the same string
            if ($key_first_element != $key_second_element) {

                //compare those two strings
                similar_text($first_element, $second_element, $match);

                //add match value to matched
                $matched = ($matched + $match);

            }

        }

        // create average uniqueness for that string
        $uniqueness = ($matched / (count($strings) - 1));

        //store it in array
        $uniqueness_array[$key_first_element] = $uniqueness;

    }

    //sort the array by uniqueness (less match the better)- the best on the beginning
    asort($uniqueness_array);

    //just for testing performance
    $uniqueness_time_end =  microtime();


    //just output performance info
    echo 'Creating of test data: '. (array_sum( explode( ' ' , $creating_test_data_time_end ) ) - array_sum( explode( ' ' , $creating_test_data_time_start ) )) .' s, comparing strings: '. (array_sum( explode( ' ' , $uniqueness_time_end ) ) - array_sum( explode( ' ' , $uniqueness_time_start ) )) .' s<br />';

    $i = 0;
    foreach ($uniqueness_array as $key_string => $uniquness_of_string)
    {

        // output just 10 best results
        if ($i < 10) {
            echo 'Uniqueness of a string with key '.$key_string.' is '.$uniquness_of_string.'<br />';    
            $i++;
        }
        else break;

    }

    ?>

期待される入力と出力:

    //Expected input array
    $input = array(
        'Today is a great day for skiing and I dont have enough time',
        'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
        'Today is a superior day for skiing and I dont have enough time',
        'Completly different string about nothing'
    );


    //Expected output array - the order is important - the most different strings at the beginning of the array
    $output = array(
        'Completly different string about nothing',
        'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
        'Today is a superior day for skiing and I dont have enough time',
        'Today is a great day for skiing and I dont have enough time'
    );

score 1 · Accepted Answer

本当にsimilar_text十分だとは思いません..それを組み合わせてlevenshtein、目的の結果を得ることができます。

$words = array(
    'Today is a great day for skiing and I dont have enough time',
    'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
    'Today is a superior day for skiing and I dont have enough time',
    'Completly different string about nothing'
);

$unique = array_map(function ($v) use($words) {
    return new Word($words, $v);
}, $words);

類似テキストの使用

echo "Uniqness By similar_text\n\n";
usort($unique, function ($a, $b) {
    $a = $a->getSimilar();
    $b = $b->getSimilar();
    return ($a == $b) ? 0 : (($a < $b) ? - 1 : 1);
});


foreach ( $unique as $var ) {
    printf("%s (%s) \n",$var->getWord(),$var->getSimilar());
}

同様のテキスト出力

Uniqness By similar_text

Completly different string about nothing (36.363636363636) 
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (75.342465753425) 
Today is a great day for skiing and I dont have enough time (90.909090909091) 
Today is a superior day for skiing and I dont have enough time (90.909090909091)

ご覧のとおり、適切な位置にありませToday is a greatんToday is a superior

レーベンシュタインの使用

echo "\n\nUniqness By levenshtein\n\n";
usort($unique, function ($a, $b) {
    $a = $a->getLev();
    $b = $b->getLev();
    return ($a == $b) ? 0 : (($a < $b) ? 1 : - 1);
});

foreach ( $unique as $var ) {
    printf("%s (%s) \n", $var->getWord(), $var->getLev());
}

レーベンシュタイン出力

Uniqness By levenshtein

Completly different string about nothing (63) 
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (63) 
Today is a superior day for skiing and I dont have enough time (45) 
Today is a great day for skiing and I dont have enough time (43)

ご覧のとおりToday is a superior、Today is a great両方のlevenshtein距離が非常に近い..最終的に同じになる場合、結果は最新のものではない可能性があります

両方を組み合わせて単純なインデックスを作成します

echo "\n\nUniqness By Simple Index \n\n";
usort($unique, function ($a, $b) {
    $a = $a->getIndex();
    $b = $b->getIndex();
    return ($a == $b) ? 0 : (($a < $b) ? - 1 : 1);
});

foreach ( $unique as $var ) {
    printf("%s (%s) \n", $var->getWord(), $var->getIndex());
}

簡易インデックス出力

Uniqness By Simple Index 

Completly different string about nothing (0.57720057720058) 
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (1.1959121548163) 
Today is a superior day for skiing and I dont have enough time (2.020202020202) 
Today is a great day for skiing and I dont have enough time (2.1141649048626)

両方を組み合わせると、起こりうる競合を解決する可能性が高くなります

使用クラス

class Word {
    private $lev = 0;
    private $similar = 0;
    private $index = 0;
    private $word;

    function __construct($words, $word) {
        $this->word = $word;
        foreach ( $words as $selected ) {

            if ($selected == $word)
                continue;

            $lev = levenshtein($word, $selected);
            if ($lev > $this->lev)
                $this->lev = $lev;
            similar_text($word, $selected, $match);

            if ($match > $this->similar)
                $this->similar = $match;
        }

        $this->index = $this->similar / $this->lev;
    }

    function getLev() {
        return $this->lev;
    }

    function getSimilar() {
        return $this->similar;
    }

    function getIndex() {
        return $this->index;
    }

    function getWord() {
        return $this->word;
    }
}

ライブデモを見る

php - 配列で最もユニークな文字列を見つける方法は?

1 に答える 1

Related

Reference