python - 文字列のリストから特定の文字列に一致する最適なサブセットを見つける

Question

私は文字列を持っています

s = "mouse"

および文字列のリスト

sub_strings = ["m", "o", "se", "e"]

s に一致するリストの sub_strings の最適かつ最短の一致サブセットは何かを見つける必要があります。これを行う最善の方法は何ですか？理想的な結果は ["m", "o", "se"].

score 2 · Accepted Answer

import difflib
print difflib.get_close_matches(target_word,list_of_possibles)

残念ながら、上記の例では機能しません。代わりにレーベンスタイン距離を使用できます...

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

sub_strings = ["mo", "m,", "o", "se", "e"]
s="mouse"
print sorted(sub_strings,key = lambda x:levenshtein_distance(x,s))[0]

これにより、常にターゲットに「最も近い」単語が得られます（最も近い定義の場合）

http://www.korokithakis.net/posts/finding-the-levenshtein-distance-in-python/ から盗まれたレーベンシュタイン関数

score 2 · Accepted Answer

正規表現を使用できます。

import re

def matches(s, sub_strings):
    sub_strings = sorted(sub_strings, key=len, reverse=True)
    pattern = '|'.join(re.escape(substr) for substr in sub_strings)
    return re.findall(pattern, s)

これは少なくとも短くて速いですが、必ずしも最適な組み合わせが見つかるとは限りません。それは貪欲すぎる。例えば、

matches("bears", ["bea", "be", "ars"])

返す["bea"]必要があるときに返します["be", "ars"]。

コードの説明:

最初の行は部分文字列を長さで並べ替え、最も長い文字列がリストの先頭に表示されるようにします。これにより、正規表現が短いものよりも長い一致を優先するようになります。
|2 行目は、「または」を意味する記号で区切られたすべての部分文字列からなる正規表現パターンを作成します。
3 行目は、re.findall関数を使用して、指定された string 内のパターンのすべての一致を検索しますs。

score 2 · Accepted Answer

このソリューションは、ユーザーRunning Wildによるこの回答に基づいています。Stefan Behnel によるパッケージを使用して、 Aho-Corasick アルゴリズムを使用してターゲット内の部分文字列のすべての一致を効率的に見つけ、動的計画法を使用して答えを見つけます。acora

import acora
import collections

def best_match(target, substrings):
    """
    Find the best way to cover the string `target` by non-overlapping
    matches with strings taken from `substrings`. Return the best
    match as a list of substrings in order. (The best match is one
    that covers the largest number of characters in `target`, and
    among all such matches, the one using the fewest substrings.)

    >>> best_match('mouse', ['mo', 'ou', 'us', 'se'])
    ['mo', 'us']
    >>> best_match('aaaaaaa', ['aa', 'aaa'])
    ['aaa', 'aa', 'aa']
    >>> best_match('abracadabra', ['bra', 'cad', 'dab'])
    ['bra', 'cad', 'bra']
    """
    # Find all occurrences of the substrings in target and store them
    # in a dictionary by their position.
    ac = acora.AcoraBuilder(*substrings).build()
    matches = collections.defaultdict(set)
    for match, pos in ac.finditer(target):
        matches[pos].add(match)

    n = len(target)
    # Array giving the best (score, list of matches) found so far, for
    # each initial substring of the target.
    best = [(0, []) for _ in xrange(n + 1)]
    for i in xrange(n):
        bi = best[i]
        bj = best[i + 1]
        if bi[0] > bj[0] or bi[0] == bj[0] and len(bi[1]) < bj[1]:
            best[i + 1] = bi
        for m in matches[i]:
            j = i + len(m)
            bj = best[j]
            score = bi[0] + len(m)
            if score > bj[0] or score == bj[0] and len(bi[1]) < len(bj[1]):
                best[j] = (score, bi[1] + [m])
    return best[n][1]

python - 文字列のリストから特定の文字列に一致する最適なサブセットを見つける

3 に答える 3

Related

Reference