java - 指定された単語を含む最短の部分文字列を見つけるメソッド:最適化が必要

Question

単語のリストを含む、特定の文字列の最短のサブセグメントを見つける必要があるプログラムがあります。私のプログラムは正しいのですが、実行時間枠 (5 秒) 内に配信できません。問題は、使用している複雑な（自明な）アルゴリズムが原因であると考えました。ネストされたループで構成され、list_of_words 配列の複数回のスキャンが必要です。検索関数のコードは次のとおりです。a[]単語ごとに格納された元の文字列をb[]含み、サブセグメントを形成するために検出される単語のリストを含みます。String gリスト内の単語を含む元の文字列からの単語によって形成された一時的なサブセグメントを格納します。

private static void search() // Function to find the subsegment with a required list of words
{
   int tail,x;//counters 
   String c[]=new String[b.length]; //initializing a temporary array to copy the list of words array.

   for(int i =0; i<a.length;i++)// looping throw original string array
    {
       System.arraycopy(b, 0, c, 0, b.length);//copying the list of words array to the temporary array

        for (int j=0;j<b.length;j++)//looping throw the temporary array
        { 
            x=0; //counter for temporary array

            if(a[i].equalsIgnoreCase(b[j]))//checking for a match with list of words
            {
                tail=i;
//adds the words starting from the position of the first found word(tail) till all the words from the list are found
                while((x<b.length)&&(tail<a.length))

                {
                    g=g+" "+a[tail];//adds the words in the string g

                    for(int k=0;k<c.length;k++) //checks for the found word from the temporary array and replaces it with ""    
                    {
                        if(c[k].equalsIgnoreCase(a[tail]))
                        {
                            c[k]=""; x++;
                        }
                    }
                    tail++;

                }
                if((tail==a.length)&&(x!=b.length))//checks if the string g does not contains all the listed word
                {
                    g="";
                }
                else
                    {
                    count(g);g="";//check for the shortest string.
                    }
            }
        }

    }print();
}

サンプル：

元の文字列 :これはテストです。これはプログラミングのテストです。これはプログラミングテストです。

見つけられる言葉：これ、テスト、プログラミング。

サブセグメント :

これはテストですこれはプログラミングです

プログラミングのテストです

プログラミングテストプログラミングテストこれ

プログラミングをテストするこれをテストする

プログラミングテストこれ

最短サブセグメント: プログラミングテストこれ

データ構造またはループ構造の変更、さらにはそれらを最適化するアルゴリズムの変更に関する提案は役に立ちます。

score 7 · Accepted Answer

動的計画法ソリューション:

探している単語ごとに最後の位置変数を用意します。

探している個別の単語の合計数を取得します (減少することはありません。最大 = 探している単語の数)。

入力内の単語位置ごとに:

探している単語のリストに単語が存在する場合は、その単語の最後の位置を更新します。
更新された最後の位置が初期化されていない場合は、合計カウントを増やします。
合計数が最大数に等しい場合は、最後の位置をループして最小の位置を見つけます。現在の位置とその値の間の距離は、部分文字列の長さになります。これらの値を記録し、すべての位置から最適なものを見つけます。

最適化は、最小の位置を見つけるのにかかる時間を短縮するために、最後の位置にヒープを用意することです (単語を指定してヒープへのポインターの高速検索を可能にする何らかの構造 (おそらくハッシュまたはツリーマップ) と一緒に使用する必要があります) ）。

例：

入力：This is a test. This is a programming test. a programming test this is

探している：this, test, a, programming

                1    2  3  4     5    6  7  8           9     10 11          12   13   14
                This is a  test. This is a  programming test. a  programming test this is
this         -1 1    1  1  1     5    5  5  5           5     5  5           5    13   13
test         -1 -1   -1 -1 4     4    4  4  4           9     9  9           12   12   12
a            -1 -1   -1 3  3     3    3  7  7           7     10 10          10   10   10
programming  -1 -1   -1 -1 -1    -1   -1 -1 8           8     8  11          11   11   11
Count        0  1    1  2  3     3    3  3  4           4     4  4           4    4    4
Substr len   NA NA   NA NA NA    NA   NA NA 5           5     6  7           8    4    5
Shortest len NA NA   NA NA NA    NA   NA NA 5           5     5  5           5    4    4

最良の結果: a programming test this、長さ = 4。

複雑さの分析:

n入力内の単語数kと探している単語数をとします。

このアルゴリズムは入力を 1 回だけ通過し、各ステップで(ヒープ最適化を使用して) 操作を実行しますO(log k)。getMin

したがって、合計所要時間はO(n log k)です。

重複の処理:

探している単語で重複が許可されている場合 (およびターゲットシーケンスがすべての出現に一致する必要がある場合)、上記のアルゴリズムはそのままでは機能しませんが、簡単な修正は、それぞれの単語に独自のポインターのヒープを持たせることです。元のヒープ (このヒープの値は元のヒープの値と同じです)。このヒープの最大サイズは、探している単語内のその単語の出現回数です。

score 4 · Accepted Answer

これが私に起こる実装です。

//Implementing here with two List<String>
//Should be easy enough to use arrays, or streams, or whatever.
public static int getShortestSubseqWith(List<String> text, List<String> words) {
    int minDistance = Integer.MAX_VALUE;
    //Create a map of the last known position of each word
    Map<String, Integer> map = new HashMap();
    for (String word : words) {
        map.put(word, -1);
    }
    String word;
    //One loop through the main search string
    for (int position = 0; position < text.size(); position++){
        word = text.get(position);
        //If the current word found is in the list we're looking for
        if (map.containsKey(word)) {
            //Update the map
            map.put(word, position);
            //And if the current positions are the closest seen so far, update the min value.
            int curDistance = getCurDistance(map);
            if (curDistance < minDistance)
                minDistance = curDistance;
        }
    }
    return minDistance;
}

//Get the current distance between the last known position of each value in the map
private static int getCurDistance(Map<String, Integer> map) {
    int min = Integer.MAX_VALUE;
    int max = 0;
    for (Integer value : map.values()) {
        if (value == -1)
            return Integer.MAX_VALUE;
        else {
            max = Math.max(max,value);
            min = Math.min(min,value);
        }
    }
    return max - min;
}

ここでの主なパフォーマンスへの影響は、ヒットが比較的まばらで、検索する用語のリストが比較的小さい場合、検索対象のループだけtextです。ヒットが非常に頻繁にある場合は、より頻繁に実行されるため、パフォーマンスが低下する可能性がありますgetCurDistance。

score 2 · Accepted Answer

もう 1 つのアプローチは、b[] 内の各単語を a[] 内の出現インデックスにマップすることです。

Map<Integer, List<Integer>> occurrence = new HashMap<Integer, List<Integer>>();
for(int idx = 0; idx < a.length; idx++)
{
  int bIdx = ... retrieve the index of the word a[idx] in b or -1 if it doesn't exist;

  if(bIdx >= 0)
  {
    List<Integer> bIdxOccurs = occurrence.get(bIdx);
    //some code to initially create the lists
    bIdxOccurs.add(idx);
  }
}

次に、インデックスが互いに最も近いマップ内の各単語から出現の組み合わせを見つけます。単純な方法は、すべての組み合わせを生成し、最小のインデックスと最大のインデックスの間の距離を比較することですが、より高速な方法があるかもしれません。そこを考えなきゃいけない…

最後に、最短シーケンスの見つかった最小インデックスと最大インデックスの間にあるすべての単語を a[] から取得します。

score 1 · Accepted Answer

String[] a; // larger string
String[] b; // list of words to search

int index = -1;

for (int i = 0; i < a.length - b.length; i++)
{
    HashSet<String> set = new HashSet<String>(b.length);
    for (String s : b)
        set.add(s);

    boolean found = true;

    for (int j = 0; j < b.length; j++)
    {
        if (set.contains(a[i+j]))
            set.remove(a[i+j]);
        else
        {
            found = false;
            break;
        }
    }
    if (found)
    {
        index = i;
        break;
    }
}

特定の単語の複数のインスタンスを保持できる場合は、簡単になります。これは、b の各単語が一意であることを前提としています。

score 1 · Accepted Answer

この問題は、最小ウィンドウ幅の問題の代替として見ることができます。文字ではなく、ここでは言葉です。

デューケリングの解法とほぼ同じです。唯一のアドオンは、注文で見つかった単語を追跡するための LinkedHashMap の使用です。Java ソリューションは、ここにあります。

これが私のpython実装です


import collections
def minsubstring(sentence, words):
    sentence = sentence.split(' ')
    mintillnow = sentence
    words = set(words.split(' '))
    found = collections.defaultdict(lambda : [-1,-1])#position of word in the sentence and order of the word
    linked = [] # this together with 'found' provides the functionality of LinkedHashMap
    for i, word in enumerate(sentence):
        if word in words:
            found[word][0] = i
            if found[word][1] != -1:#if this word is already seen, remove it from linked list
                del(linked[found[word][1]])
            linked.append(word)#append the newly found word to the tail of the linked list
            # probably the worst part in this code, updating the indexes back to the map
            for i, wword in enumerate(linked):
                found[wword][1] = i
            # if found all the words, then check if the substring is smaller than the one till now and update
            if len(linked) == len(words):
                startPos = found[linked[0]][0]
                endPos = found[linked[-1]][0]
                if (endPos - startPos + 1) < len(mintillnow):
                    mintillnow = sentence[startPos:endPos + 1]
    return ' '.join(mintillnow)

テスト結果


>>> minsubstring('This is a test. This is a programming test. a programming test this is. ','this test a programming')
'a programming test this'

score 0 · Accepted Answer

public final class MaxStringWindow {

    private MaxStringWindow() {}

    private static void addStringCount(Map<String, Integer> map, String str) {
        if (!map.containsKey(str)) {
            map.put(str, 1);
        } else {
            int val = map.get(str);
            map.put(str, val + 1);
        }
    }

    private static Map<String, Integer> toFindMap(List<String> strList) {
        final Map<String, Integer> toFind  = new HashMap<String, Integer>();
        for (String stri : strList) {
            addStringCount(toFind, stri);
        }
        return toFind;
    }


    public static int minWindowSize(String sentence, List<String> strList) {
        final Map<String, Integer> toFind = toFindMap(strList);
        final Map<String, Integer> hasFound  = new HashMap<String, Integer>();

        int matchCtr = 0;
        boolean matchFound = false;
        String currLeftMostString = null;

        int j = 0; // the trailing position of the sliding window
        int i = 0; // the leading position of the sliding window.

        int min = Integer.MAX_VALUE;

        String[] words = sentence.split(" "); 

        for (i = 0; i < words.length; i++) {

            if (!toFind.containsKey(words[i])) {
                continue;
            }

            if (!matchFound) {
                currLeftMostString = words[i];
                matchFound = true;
                j = i;  
            }

            addStringCount(hasFound, words[i]);

            matchCtr++;

            // check if match has been completed.
            if (matchCtr >= strList.size()) {
                if ((i - j + 1) < min) {
                    min = i - j + 1;
                }
            }

            // does the first element exceed value ?
            if (hasFound.get(currLeftMostString) > toFind.get(currLeftMostString)) {
                // advance the left pointer, such the window (i-j) is as small as possible.    
                while (!toFind.containsKey(words[j]) || hasFound.get(words[j]) > toFind.get(words[j])) {
                    if (hasFound.containsKey(words[j])) {
                        int val = hasFound.get(words[j]);
                        hasFound.put(words[j], val - 1);
                    } 
                    j++;
                }
                currLeftMostString = words[j];
            }   
        }


        if (matchCtr < strList.size()) {
            throw new IllegalArgumentException("The subset is not found in the input string.");
        }

        // note: here we dont do (i-j+1) since i has been incremented additionally in a for loop.
        return min > (i - j) ? i - j : min;
    }

}

score 0 · Accepted Answer

一致しなくなるまで頭と尾のポインターを内側に動かし続け、もう一方も同じようにして、内側に動かなくなるまでプロセス全体を繰り返すことで、それを行うことができると思います。後でコーディングしてみるかもしれません。

score 0 · Accepted Answer

より効率的なアルゴリズムの概要を説明します。

文字列を連結しないでください。代わりに、単語ごとに length() + 1 というように、追加する文字を数えます。

サブリストの場合、開始単語、終了単語、文字数を保存します。

短いリストが見つかったら、上記の値を置き換えます。

特定の要素で始まる最初のサブリストを検索し、サブリストの上記の定義 (開始、終了、文字数) を返すメソッドを記述します。

最初の単語を使用して上記のメソッドを呼び出します。値を保存します。開始単語 + 1 を使用してメソッドを呼び出します。短い値が見つかったら、すすぎ、保存を繰り返します。

サブリストの最初の単語が検索語の 1 つでなければならないという事実を利用して、それを改善することもできます。start + 1 から始めて、欠落している要素が 1 つだけであるため、すべての要素ではなく単純にその要素を探すことができます (最初に一致する単語を見つけるには all を使用する必要があります)。サブリストの最後の単語の前にある場合は、サブリストが小さくなります。エンディングの後にある場合は、新しいエンディングです。

これははるかに複雑ですが、潜在的にはるかに高速です。一般的なトレードオフ。

java - 指定された単語を含む最短の部分文字列を見つけるメソッド:最適化が必要

8 に答える 8

Related

Reference