algorithm - 非常に長いテキスト内のリストまたは単語リストの許可をすべて検索

Question

与えられた単語リスト = { w1,w2,w3,w1,w2 }

長いテキストで上記の単語リストの順列をすべて検索します。

long text list = {これは長いテキスト w1 w2 w3 w4 およびw1 w2 w1 w2 w3です。これは、すべての単語 w1,w2,w2,w2,w2 が含まれているわけではないため、順列を持たないさらに別の長いテキストですが、これはスペースで区切られた順列w2 w2 w3 w1 w1 } です

この問題を解決する最も効率的なアルゴリズムは何ですか?

最初にリスト内の各一意の単語にタプル (一意の #、一意の素数 #) を割り当てて {w1 = [101, 5], w2 = [103, 7], w3 = [205, 11] }、合計を計算することを考えました割り当てられたタプルを使用したリスト全体の of : w1 [101 *5] + w2 [ 103 * 7] + w3 [ 205 * 11] + w1 [101 *5] + + w2 [ 103 * 7] = 4707

pudo コードは次のとおりです。

targetSum = 4707;
long sum = 0;
for (int i = 0;  i < Text.size(); i++){
     look up (unique #, unique prime #) 
     sum  + = ((unique # * unique prime) ;
     if(  i >  list.size() ){
         sum = sum – (look up (unique #, unique prime # for index 
                ( i – list.size()) and subtract tuple sum)
     }

    if(targetSum = = sum ){
        // this is possible match so hashMap lookup verify  again  that this reagion is actual match.
}

}

これにはより良いロジックやアルゴリズムがありますか?

アップデート：

パターンマッチング Z アルゴリズム (Z ボックス) についてさらに読んでいましたが、すべての順列が事前にわかっていない限り、Z ボックスまたは Z 配列がどのように改善されるかわかりません。もっと良い方法があるかどうかわかりませんか？

知識を共有していただきありがとうございます。

ありがとう、

バベシュ

score 1 · Accepted Answer

パターンを素数で識別するという考えは良いですが、異なる素数の積は一意であり、それらの合計ではありません。

その後、移動ウィンドウアプローチを使用できます。パターンの積と最初の 5 つの単語の積を計算します。次に、製品を左から分割し、右に乗算して、ウィンドウを移動します。パターンにないすべての単語は、中立値 1 を持ちます。

def permindex(text, pattern, start = 0):
    """Index of first permutation of the pattern in text"""

    if len(text) - start < len(pattern):
        return -1

    primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

    value = {}
    goal = 1
    for p in pattern:
        if not p in value:
            value[p] = primes.pop(0)

        goal *= value[p]

    prod = 1
    for t in text[start:start + len(pattern)]:
        prod *= value.get(t, 1)

    i = start

    for j in range(start + len(pattern), len(text)):

        if goal == prod:
            return i

        prod /= value.get(text[i], 1)
        prod *= value.get(text[j], 1)

        i += 1

    if goal == prod:
        return len(text) - len(pattern)

    return -1

これをあなたの問題に適用する：

import re

patt = "w1 w2 w3 w1 w2".split()

text = re.split("\W+", 
        """This is long text w1 w2 w3 w4 and w1 w2 w1 w2 w3. This 
        yet another long text which does not have permutation because 
        it does not contain all words w1,w2,w2,w2,w2 , but this is 
        permutation w2 w2 w3 w1 w1""")

p = permindex(text, patt)
while p >= 0:
    print p, text[p: p + len(patt)]
    p = permindex(text, patt, p + 1)

収量:

9 ['w1', 'w2', 'w1', 'w2', 'w3']
40 ['w2', 'w2', 'w3', 'w1', 'w1']

score 1 · Accepted Answer

単語が連続している必要がある場合は、探している単語とその数の辞書を作成することから始めます。の例で[w1, w2, w3, w1, w2]は、辞書には次のものが含まれます。

{w1, 2}
{w2, 2}
{w3, 1}

それを着信辞書と呼びます。

また、同じタイプ (つまり、単語、カウント) の空の辞書を作成します。それを発信辞書と呼びます。

次に、探している単語数のサイズのキューを作成します。キューは最初は空です。

次に、テキストを単語ごとに読み始め、次のようにします。

for each text_word in text
    if queue.count == number of words
        queue_word = remove word from queue
        if queue_word is in outgoing dictionary
            remove from outgoing
            add to incoming
        end if
    end if

    add text_word to queue
    if text_word is in incoming dictionary
        remove text_word from incoming dictionary
        add text_word to outgoing dictionary
        if incoming dictionary is empty
            you found a permutation
        end if
    end if

ここでの唯一の複雑な点は、「辞書への単語の追加」と「辞書への単語の削除」では、同じ単語が複数回出現する可能性を考慮に入れる必要があることです。したがって、削除は実際には次のようなものです。

count = dictionary[word].Count = 1
if (count == 0)
    dictionary.Remove(word)
else
    dictionary[word].Count = count

追加も同様です。

algorithm - 非常に長いテキスト内のリストまたは単語リストの許可をすべて検索

2 に答える 2

Related

Reference