algorithm - オートコンプリートのアルゴリズム?

Question

ユーザーが Google で検索語を入力したときにクエリの候補を表示するために使用されるアルゴリズムについて言及しています。

私が主に興味を持っているのは: 1. 最も重要な結果 (一致するものよりもクエリの可能性が最も高い) 2. 部分文字列の一致 3. あいまい一致

Trie または一般化された trie を使用して一致を見つけることができることは知っていますが、上記の要件を満たしていません...

ここで以前に尋ねられた同様の質問

score 67 · Accepted Answer

（heh）素晴らしいファジー/部分的な文字列照合アルゴリズムについては、DamnCoolAlgorithmsをチェックしてください。

これらは試行に取って代わるものではなく、試行でのブルートフォースルックアップを防ぎます。これは依然として大きなメリットです。次に、トライのサイズを制限する方法が必要になる可能性があります。

グローバルに使用されている最近の/上位N語のトライを保持します。
ユーザーごとに、そのユーザーの最近の/上位N語のトライを保持します。

最後に、可能な限りルックアップを防止したい...

キャッシュルックアップ結果：ユーザーが検索結果をクリックすると、それらを非常に迅速に提供し、完全な部分的/ファジールックアップを非同期的にフェッチできます。
事前計算ルックアップ結果：ユーザーが「appl」と入力した場合、「apple」、「apply」を続行する可能性があります。
データのプリフェッチ：たとえば、Webアプリは、JSでのブルートフォース検索を実行可能にするのに十分なほど小さい、より小さな結果セットをブラウザーに送信できます。

score 10 · Accepted Answer

言いたいのは... この問題に対する良い解決策は、三分探索木以上のものを組み込むことです。Ngram、およびShingles（フレーズ）が必要です。単語境界エラーも検出する必要があります。"hello o" は "hello" にする必要があります ... そして "whitesocks" は "white socks" にする必要があります。これらは前処理のステップです。データを適切に前処理しないと、価値のある検索結果が得られません。三分探索木は、単語が何であるかを把握したり、入力された単語がインデックス内の有効な単語ではない場合に関連単語の推測を実装したりするのに役立つコンポーネントです。

Google アルゴリズムは、フレーズの提案と修正を実行します。Google アルゴリズムにはコンテキストの概念もあります... 検索する最初の単語が天気に関連していて、それらを組み合わせた場合、「weatherforcst」vs「monsoonfrcst」vs「deskfrcst」 - 私の推測では、舞台裏でランキングが変更されています。遭遇した最初の単語に基づく提案 - 予測と天気は関連する単語であるため、予測は意味の推測で上位にランクされます。

word-partials (ngrams)、phrase-terms (シングル)、word-proximity (word-clustering-index)、ternary-search-tree (word lookup)。

score 9 · Accepted Answer

Google の正確なアルゴリズムは不明ですが、ユーザー入力の統計分析によって機能すると言われています。ほとんどの場合に適していないアプローチ。より一般的には、オートコンプリートは次のいずれかを使用して実装されます。

木。ツリー構造 (プレフィックスツリー、サフィックスツリー、dawg など) で検索可能なテキストをインデックス化することにより、メモリストレージを犠牲にして非常に高速な検索を実行できます。ツリートラバーサルは、近似マッチングに適合させることができます。
パターン分割。テキストをトークン (ngram) に分割することにより、単純なハッシュスキームを使用してパターンの出現を検索できます。
フィルタリング。一連の潜在的な一致を見つけ、順次アルゴリズムを適用して各候補をチェックします。

後者の概念のいくつかを実装する Java オートコンプリートライブラリである fully を見てください。

score 4 · Accepted Answer

特定の範囲内にあるあいまいな一致を見つけるために使用できる、 soundexやレーベンシュタイン距離などのツールがあります。

Soundex は似ている単語を検索し、レーベンシュタイン距離は別の単語から特定の編集距離内にある単語を検索します。

score 4 · Accepted Answer

Firefox の Awesome bar アルゴリズムを見てみましょう

何百万もの人気のあるクエリ + 過去の関連クエリを考慮に入れるため、Google の提案は便利です。

ただし、適切な補完アルゴリズム/ UI はありません。

部分文字列を実行しません
比較的単純な単語境界プレフィックスアルゴリズムのようです。
例: 試してみてくださいtomcat tut--> 「Tomcat チュートリアル」を正しく提案してください。今試してみてくださいtomcat rial--> 提案はありません )-:
「もしかして？」をサポートしていません。- Google 検索結果のように。

score 2 · Accepted Answer

部分文字列とあいまい一致については、レーベンシュタイン距離アルゴリズムがかなりうまく機能しました。認めますが、オートコンプリート/サジェストの業界実装ほど完璧ではないようです。Google と Microsoft の Intellisense はどちらも優れた仕事をしています。おそらく、この基本的なアルゴリズムを改良して、異なる文字列を照合するために必要な編集操作の種類を比較検討したからだと思います。たとえば、2 文字の転置は、おそらく 2 回 (挿入と削除) ではなく、1 回の操作としてカウントする必要があります。

それでも、これは十分に近いと思います。これがC#での実装です...

// This is the traditional Levenshtein Distance algorithem, though I've tweaked it to make
// it more like Google's autocomplete/suggest.  It returns the number of operations 
// (insert/delete/substitute) required to change one string into another, with the 
// expectation that userTyped is only a partial version of fullEntry.
// Gives us a measurement of how similar the two strings are.
public static int EditDistance(string userTyped, string fullEntry)
{
    if (userTyped.Length == 0) // all entries are assumed to be fully legit possibilities 
        return 0; // at this point, because the user hasn't typed anything.

    var inx = fullEntry.IndexOf(userTyped[0]);
    if (inx < 0) // If the 1st character doesn't exist anywhere in the entry, it's not
        return Int32.MaxValue; // a possible match.

    var lastInx = inx;
    var lastMatchCount = 0;
TryAgain:
    // Is there a better starting point?
    var len = fullEntry.Length - inx;
    var matchCount = 1;
    var k = 1;
    for (; k < len; k++)
    {
        if (k == userTyped.Length || userTyped[k] != fullEntry[k + inx])
        {
            if (matchCount > lastMatchCount)
            {
                lastMatchCount = matchCount;
                lastInx = inx;
            }
            inx = fullEntry.IndexOf(userTyped[0], inx + 1);
            matchCount = 0;
            if (inx > 0)
                goto TryAgain;
            else
                break;
        }
        else
            matchCount++;
    }
    if (k == len && matchCount > lastMatchCount)
        lastInx = inx;

    if (lastInx > 0)
        fullEntry = fullEntry.Substring(lastInx); // Jump to 1st character match, ignoring previous values 

    // The start of the Levenshtein Distance algorithem.
    var m = userTyped.Length;
    var n = Math.Min(m, fullEntry.Length);

    int[,] d = new int[m + 1, n + 1]; // "distance" - meaning number of operations.

    for (var i = 0; i <= m; i++)
        d[i, 0] = i; // the distance of any first string to an empty second string
    for (var j = 0; j <= n; j++)
        d[0, j] = j; // the distance of any second string to an empty first string

    for (var j = 1; j <= n; j++)
        for (var i = 1; i <= m; i++)
            if (userTyped[i - 1] == fullEntry[j - 1])
                d[i, j] = d[i - 1, j - 1];       // no operation required
            else
                d[i, j] = Math.Min
                           (
                             d[i - 1, j] + 1,  // a deletion
                             Math.Min(
                             d[i, j - 1] + 1,  // an insertion
                             d[i - 1, j - 1] + 1 // a substitution
                             )
                           );

    return d[m, n];
}

score 1 · Accepted Answer

問題の全体的な設計を探している場合は、https://www.interviewbit.com/problems/search-typeahead/のコンテンツを読んでみてください。

彼らは、トライを使用するという素朴なアプローチでオートコンプリートを構築することから始め、次にそれを構築します。また、特定のユースケースに対応するためのサンプリングやオフライン更新などの最適化手法についても説明します。

ソリューションのスケーラビリティを維持するには、トライデータをインテリジェントに分割する必要があります。

score 0 · Accepted Answer

まったく異なるデータ構造を追求するよりも、特殊なトライを構築したほうがよいのではないかと思います。

その機能は、対応する単語の検索頻度を反映するフィールドが各リーフにあるトライで明らかになったことがわかりました。

検索クエリメソッドは、各子孫リーフノードまでの距離に各子孫リーフノードに関連付けられた検索頻度を乗算して計算された最大値を持つ子孫リーフノードを表示します。

Google が使用するデータ構造 (およびその結果としてのアルゴリズム) はおそらく非常に複雑であり、特定のアカウントからの検索頻度 (および時間帯... 天気... 季節など) など、他の多くの要因が考慮される可能性があります。 ...そして月相...そして...)。ただし、基本的なトライデータ構造は、各ノードに追加のフィールドを含め、検索クエリメソッドでそれらのフィールドを使用することにより、あらゆる種類の特殊な検索設定に拡張できると考えています。

score 0 · Accepted Answer

これがあなたの質問に答えるかどうかはわかりませんが、当時 C 言語を使用して非常に単純な入力オートコンプリートコードを作成しました。これには機械学習とニューラルネットワークを実装していないため、確率計算などは行いません。部分文字列チェックアルゴリズムを使用して、入力に一致する最初のインデックスをチェックします。

一致データを「dict.txt」ファイルに提供できます。

/* Auto-complete input function in c
    @authors: James Vausch
    @date: 2018-5-23

    - This is a bona-fide self-created program which aims to
      stimulate an input auto-suggest or auto-complete function
      in C language. This is open source so you can use the code
      freely. However if you will use this, just acknowledge the
      creator as a sign of respect.
    - I'd also like to acknowledge Code with C team whom where I
      I got an answer how to have a colored output instead of 
      using system("color #"). Link down below

      https://www.codewithc.com/change-text-color-in-codeblocks-console-window/


    - THE GENERAL IDEA IS; WE READ A FILE WITH DICTIONARY WORDS
      OR SHALL WE SAY ALL WORDS. WE RUN A WHILE LOOP THAT WILL
      GET CHARACTER FROM THE USER USING "getch()" FUNCTION THEN
      STORE IT IN A CHARACTER ARRAY THEN IS PASSED ON A FUNCTION
      THAT CHECKS IF THE ANY DICTIONARY WORDS HAS A SUBSTRING
      THAT IS THE USER INPUT. IF YES(0), THE FUNCTION WILL COPY
      THE FOLLOWING STRING FROM THE DICTIONARY ARRAY THEN STORED
      IN A TEMP CHAR ARRAY AND PROCESSED. THE PROCESSING SHOULD
      BE SIMPLE. WE RUN A LOOP IN WHICH WILL CHECK THE AMOUNT OF
      CHARACTERS IN THE MATCHED STRING, THEN WE'LL RUN A LOOP
      THAT WILL SORT THE WORDS DECREMENTALLY BASED ON THE AMOUNT
      OF CHARACTERS OF THE INPUT SUBSTRING. THEN PRINT THE
      PROCESSED STRING ON THE FRONT OF THE INPUT STRING THEN RUN
      A LOOP BASED ON THE AMOUNT OF CHARACTERS PRESENT OR STRING
      LENGTH OF THE PROCESSED STRING PLUS 10 EXTRA CHARACTERS 
      ALONG WITH PRINTING SOME BACKWARD TRAVERSE CARET FUNCTION 
      TO MAKE THE CARET STAY WHERE IT SHOULD BE ALONG WITH 
      INPUTTING. SIMPLE.

    - <EXAMPLE>
        INPUT: COM
        AFTER LOOP RUN: MATCHED WITH WORD "COMMAND"
        AFTER LOOP RUN: INPUT HAS 3 CHARACTERS
        LOOP SEQUENCE:
            LOOP 0: OMMAND
            LOOP 1: MMAND
            LOOP 2: MAND
        AFTER LOOP: MAND
        PRINT: "MAND" AFTER INPUT BUT KEEP CARET ON THE INPUT "COM"

    NOTE:
    - You'll need the "dict.txt" file or you can create one and
      put some stuff there
    - Since C Programs run on.. say a terminal, I have not much of a way to
      efficiently make a way to use arrow keys for the job.
    - you should type your INPUT in LOWERCASE since pressing "Shift_Key + M"
      is equivalent to pressing the VK_Right(right arrow key) as well as
      the other arrow keys
    - the right arrow key has an ascii equivalent of <-32><77>, 77 = M
    - to complete the input, you'll need to press right arrow key
    - the left arrow key has an ascii equivalent of <-32><75>, 75 = K
    - to remove auto-complete suggestion, press left arrow key

    TO ADD:
    - UP arrow key and DOWN arrow key to cycle through suggestions
*/

//#include <headers.h> //My personal header file
#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <windows.h>

void main(){
    SetColor(6);
    start();
}

void start(){
    int rep = 0;
    char dictFile[] = "dict.txt";
    loadDictionaryEntries(dictFile);

    char inp[50];
    printf("\nAuto Complete Program : C");
    while(rep == 0){
        printf("\nInput: ");
        autoCompleteInput(inp);
        if(strcasecmp(inp, "exit") == 0){
            break;
        }
        printf("\nOutput: %s", inp);
    }
    printf("\n");
    system("pause");
}

int dictEntryCount = 0;
struct allWords{
    char entry[100];
}dictionary[60000];

//============================================================================//
void loadDictionaryEntries(char directory[]){
    FILE *file;
    int dex = 0;
    char str[100];
    if(file = fopen(directory, "r")){
        printf("File accessed.\n");
        while(!feof(file)){
            fscanf(file, "%s", str);
            //UN-COMMENT line 109 to check if the program is reading from file
            //printf("Adding entry %d: \"%s\" to dictionary\n",dex + 1, str);
            strcpy(dictionary[dex].entry, str);
            dex++;
            dictEntryCount++;
        }
        fclose(file);
        printf("[ADDED %d WORDS TO DICTIONARY]\n", dictEntryCount);
    }else{
        printf(" File cannot be accessed.");
        fclose(file);
    }
}

void printArray(){
    for(int i = 0; i < dictEntryCount; i++){
        printf("Index %d: %s\n", i + 1, dictionary[i].entry);
    }
}

//============================================================================//
void autoCompleteInput(char input[]){
    char matchedWord[100]; //STORAGE FOR THE WORD THAT MATCHES INPUT
    char ch; //STORAGE FOR EACH CHARACTER THAT THE USER INPUTS
    int i = 0; //COUNTER
    int words;
    while(i != 200){ //LOOP TO GET EACH CHARACTER FROM KEYBOARD PRESS
        SetColor(6);
        ch = getch();
        clsx(strlen(matchedWord));
        if(ch == 13){ //CONDITION TO CHECK IF INPUT IS "ENTER" KEY
            break; //BREAKS LOOP IF "ENTER IS PRESSED"
        }else if(ch == 8){ //CONDITION TO CHECK IF INPUT IS "BACKSPACE"
            if(i == 0){ //IF INPUT IS NULL, DO NOTHING, DONT ERASE ANYTHING
                //DO NOTHING
            }else{ //IF INPUT IS NOT NULL, ENABLE ERASING
                clsx(strlen(matchedWord));
                bksp();
                i--;
                input[i] = '\0';
                if(i > 2){
                    if(matchToDictionary(input, matchedWord) == 0){
                        words = 0;
                        processMatchedWord(i, matchedWord);
                        SetColor(8);
                        printf("%s", matchedWord);
                        words = getArrSizeChar(matchedWord);
                        for(int x = 0; x < words; x++){
                            printf("\b");
                        }
                    }
                }
            }
        }else if(ch == 77){ //CONDITION TO CHECK IF INPUT IS RIGHT ARROW KEY
            printf("%s", matchedWord); //PRINT SUGESTED WORD WITH CARET AT FRONT
            strcat(input, matchedWord); //CONCATENATE SUGGESTION TO INPUT
            i = i + words - 1; //SETS INDEX AT THE END OF INPUT
            words = 0; //
        }else if(ch == 75){ //CONDITION TO CHECK IS INPUT IS LEFT ARROW KEY
            clsx(strlen(matchedWord)); //ERASE SUGGESTION
            i--; //DECREMENT INDEX
        }else{ //IF CONDITIONS ABOVE ARE NOT MET, DO THIS
            input[i] = ch; //INSERT CH AT THE INDEX OF INPUT
            printf("%c", ch); //PRINT CHARACTER
            input[i + 1] = '\0'; //SET END OF CURRENT INPUT TO NULL
            i++;
            if(i >= 2){
                if(matchToDictionary(input, matchedWord) == 0){
                    words = 0;
                    processMatchedWord(i, matchedWord);
                    SetColor(8);
                    printf("%s", matchedWord);
                    words = getArrSizeChar(matchedWord);
                    for(int x = 0; x < words; x++){
                        printf("\b");
                    }
                }else{
                    clsx(strlen(matchedWord));
                }
            }
        }

    }
    input[i] = '\0'; //NULL ENDING VALUE TO PREVENT UNNECESSARY CHARACTERS
}

int getArrSizeChar(char array[]){
    int size = 0;
    while(array[size] != '\0'){size++;}
    return size;
}

void clsx(int maxVal){
    for(int i = 0; i < maxVal + 10; i++){
        printf(" ");
    }
    for(int i = 0; i < maxVal + 10; i++){
        printf("\b");
    }
}

int matchToDictionary(char input[], char matchedWord[]){
    int found = 0;
    int dex = dictEntryCount; //LIMIT OF ARRAY / ARRAY BOUND/S
    //while(dictionary[dex] != '\0'){ //LOOP TO DETERMINE ARRAY BOUND
        //printf("%d", dex);
        //dex++; //INCREMENT IF INDEX OF ARRAY IS NOT NULL
    //}
    //printf("%d", dex);
    for(int i = 0; i < dex; i++){ //LOOP TROUGH ALL INDEXES OF DICTIONARY
        //CHECKS IF THE INDEX OF DICTIONARY HAS A SUBSTRING INPUT
        //printf(" Matching %s and %s\n", dictionary[i], input);
        if(containsIgnoreCase(dictionary[i].entry, input) == 0){
            //CHECKS IF THE INDEX OF DICTIONARY TOTALLY MATCHES THE INPUT
            //IT IS TO PREVENT ERRORS IN AUTO-COMPLETING PROCESS
            if(strcasecmp(dictionary[i].entry, input) == 1){
                //IF NOT, STORE INDEX OF DICTIONARY TO MATCHED WORD
                strcpy(matchedWord, dictionary[i].entry);
                found++;
                break; //BREAK LOOP
            }
        }
    }

    if(found == 1){
        return 0;
    }else{
        return 1;
    }
}

void processMatchedWord(int rep, char str[]){
    int lim = 0;
    int i;
    char temp[50];
    while(str[lim] != '\0'){
        lim++;
    }

    while(rep != 0){
        for(i = 0; i < lim; i++){
            str[i] = str[i + 1];
        }
        rep--;
    }
}

//===================================================================//
void bksp(){
    printf("\b "); //first backsapce to print an emtpy character
    printf("\b"); //second backspace to erase printed character
}

int containsIgnoreCase(char str1[], char str2[]){
    char tmp1[100];
    char tmp2[100];
    toLowerCase(tmp1, str1);
    toLowerCase(tmp2, str2);
    int i, j = 0, k;

    for(i = 0; tmp1[i]; i++){
        if(tmp1[i] == tmp2[j]){
            for(k = i, j = 0; tmp1[k] && tmp2[j]; j++, k++){
                if(tmp1[k] != tmp2[j]){
                    break;
                }
            }
            if(!tmp2[j]){
                return 0;
            }
        }
    }

    return 1;
}

void toLowerCase(char destination[], char source[]){
    int lim = 0;
    int i;

    while(source[lim] != '\0'){
        lim++;
    }

    for(i = 0; i < lim; i++){
        destination[i] = tolower(source[i]);
    }
    destination[i] = '\0';
}


/*Console Colors: Windows

    0 = Black        8 = Gray
    1 = Blue         9 = LBlue
    2 = Green       10 = LGreen
    3 = Aqua        11 = LAqua
    4 = Red         12 = LRed
    5 = Purple      13 = LPurple
    6 = Yellow      14 = LYellow
    7 = White       15 = Bright White
}*/

void SetColor(int ForgC){ //CODE SNIPPET FROM WWW.CODEWITHC.COM
    WORD wColor;
    //This handle is needed to get the current background attribute

    HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
    CONSOLE_SCREEN_BUFFER_INFO csbi;
    //csbi is used for wAttributes word

    if(GetConsoleScreenBufferInfo(hStdOut, &csbi)){
        //To mask out all but the background attribute, and to add the color
        wColor = (csbi.wAttributes & 0xF0) + (ForgC & 0x0F);
        SetConsoleTextAttribute(hStdOut, wColor);
    }
    return;
}

null の初期化された一致を持ち、ユーザーからの入力を配列またはファイルに保存するプログラムを参照している場合、ユーザーが同じ単語を入力すると、プログラムはそれを前の入力と一致させます。

algorithm - オートコンプリートのアルゴリズム?

9 に答える 9

Related

Reference