python - n 個の間違いがある B の A の部分文字列を見つける

Question

特定の数のエラー (n) を許容しながら、A が B の部分文字列であるかどうかを判断する必要があります。A と B はどちらも文字列で、n は整数です。最大の問題は、内側の while ループです。必要なものを出力するようにコーディングする方法がわかりません。B をループして、n 個の間違いがある A の一致を探します。A が「abcd」で B が「bdefabddghl」の場合、出力は、A が位置 4 で 1 つの間違いで見つかったということになります。これは私が今持っているコードです。内側の while ループをコーディングする方法を知る必要があるだけです。どんな助けでも大歓迎です。

def subsequence(A,B,n):
answer = -1             # assume worst case for there-exists type of loop
j = 0                   # index b
while j<=len(B)-len(A) and answer==-1:  # there exists j in b loop
    bx = j              # assume a is found in b starting at b[j]
    i = 0               # assume best case for a for-all type of loop
    axy = n             # accumulator for n
        while i<len(A) and bx==j and axy > 0:   # for all i in a
        if A[i] == B[j-i] and axy > 0:
            bx = j
            axy = n         # accumulator for n
        if A[i] != B[j-i] and axy > 0:
            axy -= 1
        i+=1
        # while i
    j+=1
# while j
end = "A best match with " + str(n-axy) + " errors was found starting at position " + str(bx)."
return end
print subsequence("abcd","bcjabddec",3)

ありがとうございました

score 0 · Accepted Answer

私はあなたがこの演習、またはそれに非常に似たものに取り組んでいると仮定しています: http://www.cs.hofstra.edu/~cscccl/csc15p/dnalab.txt

最初の例が特に役立つかどうかはわかりません。理解するのは非常に難しいため、指定された方法で適応するのは非常に困難です。このような問題に取り組む方法について、いくつかの推奨事項を示します。

問題を解決しやすい小さな問題に分割します。
各関数をテストして期待する結果が得られるコードを記述します。
変数名と関数名には、それぞれの意味が明確になるような名前を付けてください。これにより、自分自身とコードを見せる相手の両方がコードを理解しやすくなります。

あなたがしようとしているのは、文字列 A を文字列 B に沿ってスライドさせ、各位置でどれだけ一致するかをテストし、見つかった最良の一致を覚えておくことです。問題を分離する簡単な部分は、2 つの文字列の特定の配置でどの程度一致するかを測定することです。

ソリューションの一般的な形についての私の提案は次のとおりです。

def string_distance(A, B, index_b):
    '''Count how many substitutions are required to
    change string A into the string beginning at
    index_b in string B.'''
    return 999 # FIXME

def subsequence(A, B, n):
    '''Search for the substring of B that is equal to
    A except for the minimal number of substitutions,
    and return a tuple of the index of that substring
    in B and the number of substitutions required,
    unless there is no such substring with n or fewer
    substitutions, in which case return the tuple of
    (-1, -1).'''
    answer_distance = -1
    answer_index = -1
    index = 0
    while index <= len(B) - len(A):
        substitutions_required = string_distance(A, B, index)
        # Does a match at this location need n or fewer substitutions?
        is_close_enough = False # FIXME
        # Is a match at this location better than the best match so far?
        is_better = False # FIXME
        if is_close_enough and is_better:
            answer_distance = substitutions_required
            answer_index = index
        index += 1
    return answer_index, answer_distance

基本的なテストは次のようになります。

def test_string_distance():
    test_data = [
        # a         b          index_b   expected
        ("ABC",    "ABCDEF",   0,        0),
        ("ABX",    "ABCDEF",   0,        1),
        ("XYZ",    "ABCDEF",   0,        3),
        ("XBC",    "ABCDEF",   0,        1),
        ("CEE",    "ABCDEF",   2,        1),
        ("DEF",    "ABCDEF",   3,        0),
        ("AAAAA",  "BBBBBBBB", 3,        5),
        ("BAAAA",  "BBBBBBBB", 3,        4),
        ("ABAAB",  "BBBBBBBB", 3,        3),
    ]
    for a, b, index_b, expected in test_data:
        result = string_distance(a, b, index_b)
        if result != expected:
            print "string_distance({}, {}, {}) was {} but should be {}".format(
                    a, b, index_b, result, expected)

def test_subsequence():
    test_data = [
        # A         B               n   expected
        ("AXY",    "AYAXXXAAYYAX",  3,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  2,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  1,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  0,  (-1,-1)),
        ("XAAY",   "AYAXXXAAYYAX",  2,  (5,0)),
        ("XAAY",   "XXAXAAXAAY",    2,  (6,0)),
        ("ABCDEF", "ZZAABAXCDEEEF", 3,  (5,2)),
        ("ABCDEF", "ZZAABAXCDEEEF", 2,  (5,2)),
        ("ABCDEF", "ZZAABAXCDEEEF", 1,  (-1,-1)),
        ("ABCDEF", "ZZAABXBCDEEEF", 3,  (5,2)),
    ]
    for a, b, n, expected in test_data:
        result = subsequence(a, b, n)
        if result != expected:
            print "test_subsequence({}, {}, {}) was {} but should be {}".format(
                    a, b, n, result, expected)

if __name__ == '__main__':
    test_string_distance()
    test_subsequence()

最初に、これらのテストに合格する string_distance の実装を見つけます。次に、サブシーケンスを機能させます。

score 0 · Accepted Answer

あなたはおそらくdifflib.get_close_matches機能を見てみるべきです。それはほとんど同じことをします。本当に実装したい場合は、同じ長さのすべてのサブシーケンスのリストを作成し、それらを一致数でソートするだけで、はるかに簡単に実装できます。

def subsequence(a, b, n):

    sb = [(i, b[i:i+len(a)]) for i in xrange(len(b) - len(a) + 1)]

    def count_matches(s):
        return sum(int(x == y) for (x, y) in zip(a, s[1]))

    sb.sort(key=count_matches, reverse=True)

    best = sb[0]

    e = len(a) - count_matches(best)
    i = best[0]
    w = best[1]

    print 'A best match with %d errors was found starting at position %d' % (e, i)

python - n 個の間違いがある B の A の部分文字列を見つける

2 に答える 2

Related

Reference