regex - 文字列一致アルゴリズムが必要

Question

データ：

x.txt,簡易テキストファイル(約1MB) y.txt辞書ファイル(約1Lakh単語)。

y.txt の単語が x.txt に存在するかどうかを調べる必要があります。

実行にかかる時間を短縮するアルゴリズムと、それに適した言語が必要です。

PS: BRUTE FORCE METHOD 以外のアルゴリズムを提案してください。

正確な文字列の一致ではなく、パターンの一致が必要です。

例えば：

x.txt : 「The Old Buzzards は4 月 27 日に解散しました」

Y.txt : "確立"

出力は次のようになります: X.txt で確立されます: 行 1

ありがとうございました。

score 2 · Accepted Answer

仕事を終わらせるためにこれが必要なのか、それとも家での仕事なのか、私にははっきりしません。仕事を完了するためにそれが必要な場合は、次のようにします。

#!/usr/bin/bash
Y=`cat y.txt | tr '\n' '|'`
echo "${Y%|}"
grep -E "${Y%|}" x.txt
if [ "$?" -eq 0 ]
then
    echo "found"
else
    echo "no luck"
fi

ファイルからすべてのパターンを丸呑みし、正規表現を構築し (echo は正規表現を示します)、それを渡してgrep、有限状態オートマトンを構築します。テキスト内のすべての文字を最大 1 回比較するため、これはうまくいきます。宿題である場合は、Cormen et al 'Introduction to Algorithms' を参照するか、Dragon Book の最初の数章を参照することをお勧めします。

追加するのを忘れました: y.txt には 1 行に 1 つずつパターンを含める必要がありますが、良い副作用として、パターンは単一の単語である必要はありません。

score 0 · Accepted Answer

標準ライブラリにSet実装があるとします。ここにいくつかの擬似コードがあります。

dictionary = empty set

def populate_dict():
    for word in dict_file:
        add(dictionary, word)

def validate_text(text_file):
    for word in text_file:      ### O(|text_file|)
        if word in dictionary:  ### O(log |dictonary|)
            report(word)

populate_dict()
every_now_and_then(populate_dict)

これO(t * log d) により、ブルートフォースの代わりに、入力テキストファイルと辞書の長さであるO(t * d)場所tとがそれぞれ得られます。dより速くファイルを読み取ることができずO(t)、O(log d).

score 0 · Accepted Answer

これは、私がしばらく考えていた検索アルゴリズムです。基本的に、アルゴリズムは 2 つのステップで構成されます。

最初のステップでは、y.txt のすべての単語がツリーに挿入されます。ルートからリーフまでのツリー内のすべてのパスは単語です。葉は空です。

たとえば、単語「dog」と「day」のツリーは次のとおりです。

<root>--<d>-<a>-<y>-<>
          \-<o>-<g>-<>

アルゴリズムの 2 番目の部分は、ツリーの下方向の検索です。空の葉にたどり着いたとき、単語を見つけたことになります。

Groovy での実装。さらにコメントが必要な場合は、お問い合わせください

//create a tree to store the words in a compact and fast to search way
//each path of the tree from root to an empty leaf is a word
def tree = [:]
new File('y.txt').eachLine{ word->
    def t=tree
    word.each{ c ->
        if(!t[c]){
            t[c]=[:]
        }
        t=t[c]
    }
    t[0]=0//word terminator (the leaf)
}
println tree//for debug purpose
//search for the words in x.txt
new File('x.txt').eachLine{ str, line->
    for(int i=0; i<str.length(); i++){
        if(tree[str[i]]){
            def t=tree[str[i]]
            def res=str[i]
            def found=false
            for(int j=i+1; j<str.length(); j++){
                if(t[str[j]]==null){
                    if(found){
                        println "Found $res at line $line, col $i"
                        res=str[j]
                        found=false
                    }
                    break
                }else if(t[str[j]][0]==0){
                    found=true
                    res+=str[j]
                    t=t[str[j]]
                    continue
                }else{
                    t=t[str[j]]
                    res+=str[j]
                }
                found=false
            }
            if(found) println "Found $res at line $line, col $i"//I know, an ugly repetition, it's for words at the end of a line. I will fix this later
        }
    }
}

これは私の y.txt です

dog
day
apple
daydream

および x.txt

This is a beautiful day and I'm walking with my dog while eating an apple.
Today it's sunny.
It's a daydream

出力は次のとおりです。

$ groovy search.groovy
[d:[o:[g:[0:0]], a:[y:[0:0, d:[r:[e:[a:[m:[0:0]]]]]]]], a:[p:[p:[l:[e:[0:0]]]]]]
Found day at line 1, col 20
Found dog at line 1, col 48
Found apple at line 1, col 68
Found day at line 2, col 2
Found daydream at line 3, col 7

ツリーの深さは y.txt の単語数に依存しないため、このアルゴリズムは高速です。深さは、y.txt 内の最長の単語の長さと同じです。

regex - 文字列一致アルゴリズムが必要

データ：

3 に答える 3

Related

Reference