java - Efficient way to search for a set of strings in a string in Java

Question

I have a set of elements of size about 100-200. Let a sample element be X.

Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}

For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.

X is present in P iff for all s belong to X, s is a substring of P.

The set of elements is available for pre-processing.

I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:

Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring

Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.

Please note that the set of elements is available for pre-processing and I want the solution in java.

score 2 · Accepted Answer

You can use regex directly:

Pattern regex = Pattern.compile(
    "^               # Anchor search to start of string\n" +
    "(?=.*s1)        # Check if string contains s1\n" +
    "(?=.*s2)        # Check if string contains s2\n" +
    "(?=.*s3)        # Check if string contains s3", 
    Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

foundMatch is true if all three substrings are present in the string.

Note that you might need to escape your "needle strings" if they could contain regex metacharacters.

score 1 · Accepted Answer

特定のアプローチが実際には遅すぎることに気付く前に、コードを時期尚早に最適化しているようです。

文字列のセットの優れた特性は、文字列にのすべての要素がサブ文字列として含まれている必要があることです。つまり、その要素の1つがに含まれていないX場合、すぐに失敗する可能性があります。これは、他の方法よりも時間を節約するアプローチになる可能性があります。特に、の要素が通常数文字より長く、繰り返し文字が含まれていないか、数文字しか含まれていない場合はなおさらです。たとえば、正規表現エンジンは、繰り返しのない文字（たとえば、海岸）を含む5つの長さの文字列の存在をチェックするときに、100の長さの文字列で20文字をチェックするだけで済みます。そして、あなたは本当に100-200の要素を持っているので、できれば本当に早く失敗したいと思っています。XPXX

私の提案は、文字列を長さの順に並べ替え、各文字列を順番にチェックし、1つの文字列が見つからない場合は早期に停止することです。

score 1 · Accepted Answer

Looks like a perfect case for the Rabin–Karp algorithm:

Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.

score 0 · Accepted Answer

1つの方法は、可能なすべての部分文字列を生成し、これをセットに追加することです。これはかなり非効率的です。

代わりに、任意のポイントから最後まですべての文字列をNavigableSetに作成し、最も近いものを検索できます。最も近い一致が探している文字列で始まる場合、部分文字列の一致があります。

static class SubstringMatcher {
    final NavigableSet<String> set = new TreeSet<String>();

    SubstringMatcher(Set<String> strings) {
        for (String string : strings) {
            for (int i = 0; i < string.length(); i++)
                set.add(string.substring(i));
        }
        // remove duplicates.
        String last = "";
        for (String string : set.toArray(new String[set.size()])) {
            if (string.startsWith(last))
                set.remove(last);
            last = string;
        }
    }

    public boolean findIn(String s) {
        String s1 = set.ceiling(s);
        return s1 != null && s1.startsWith(s);
    }
}

public static void main(String... args) {
    Set<String> strings = new HashSet<String>();
    strings.add("hello");
    strings.add("there");
    strings.add("old");
    strings.add("world");
    SubstringMatcher sm = new SubstringMatcher(strings);
    System.out.println(sm.set);
    for (String s : "ell,he,ow,lol".split(","))
        System.out.println(s + ": " + sm.findIn(s));
}

プリント

[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

score 0 · Accepted Answer

おそらく、文字列のセット（辞書）からオートマトン（トライのような）を構築し、このオートマトンを使用して入力文字列を辞書に一致させようとするAho-Corasickアルゴリズムを探しています。

score 0 · Accepted Answer

When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.

The algorithm to index a string would look like that (untested):

HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();

for (int pos = 0; pos < string.length(); pos++) {
    for (int sublen=0; sublen < string.length-pos; sublen++) {
         String substring = string.substr(pos, sublen);
         Set<String> stringsForThisKey = indexes.get(substring);
         if (stringsForThisKey == null) {
             stringsForThisKey = new HashSet<String>();
             indexes.put(substring, stringsForThisKey);
         }
         stringsForThisKey.add(string);
    }
}

Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.

But the result would be constant-speed access to the list of strings in which a specific string occurs.

score 0 · Accepted Answer

You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here

I have used proprietary implementations (that I no longer even have access to) and they are very fast.

java - Efficient way to search for a set of strings in a string in Java

7 に答える 7

Related

Reference