python - 2 つの文字列の間で共通の部分文字列を見つける

Question

2 つの文字列を比較して、一致したものを保持し、比較が失敗した場所で分割したいと思います。

だから私は2つの文字列を持っている場合 -

string1 = apples
string2 = appleses

answer = apples

別の例として、文字列には複数の単語が含まれている可能性があります。

string1 = apple pie available
string2 = apple pies

answer = apple pie

これを行う簡単なPythonの方法があると確信していますが、解決できません。助けと説明をいただければ幸いです。

score 40 · Accepted Answer

def common_start(sa, sb):
    """ returns the longest common substring from the beginning of sa and sb """
    def _iter():
        for a, b in zip(sa, sb):
            if a == b:
                yield a
            else:
                return

    return ''.join(_iter())

>>> common_start("apple pie available", "apple pies")
'apple pie'

または少し奇妙な方法：

def stop_iter():
    """An easy way to break out of a generator"""
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

どちらがより読みやすいかもしれません

def terminating(cond):
    """An easy way to break out of a generator"""
    if cond:
        return True
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))

score 40 · Accepted Answer

os.path.commonprefixまた、文字で機能するため、任意の文字列に使用できると考えるかもしれません。

import os
common = os.path.commonprefix(['apple pie available', 'apple pies'])
assert common == 'apple pie'

関数名が示すように、これは 2 つの文字列の共通のプレフィックスのみを考慮します。

score 17 · Accepted Answer

その最長共通部分文字列問題と呼ばれます。ここでは、単純で理解しやすいが非効率的な解決策を紹介します。このアルゴリズムの複雑さは O(N^2) であるため、大きな文字列に対して正しい出力を生成するには長い時間がかかります。

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

print longestSubstringFinder("apple pie available", "apple pies")
print longestSubstringFinder("apples", "appleses")
print longestSubstringFinder("bapples", "cappleses")

出力

apple pie
apples
apples

score 6 · Accepted Answer

Evo のと同じですが、比較する文字列の数は任意です。

def common_start(*strings):
    """ Returns the longest common substring
        from the beginning of the `strings`
    """
    def _iter():
        for z in zip(*strings):
            if z.count(z[0]) == len(z):  # check all elements in `z` are the same
                yield z[0]
            else:
                return

    return ''.join(_iter())

score 2 · Accepted Answer

試す：

import itertools as it
''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

両方の文字列の先頭から比較します。

score -1 · Accepted Answer

これは最も効率的な方法ではありませんが、私が思いついた方法であり、機能します。改善できる人がいればお願いします。行列を作成し、文字が一致する場所に 1 を置きます。次に、行列をスキャンして 1 の最長の対角線を見つけ、開始位置と終了位置を追跡します。次に、開始位置と終了位置を引数として、入力文字列の部分文字列を返します。

注: これは、最も長い共通部分文字列を 1 つだけ検索します。複数ある場合は、配列を作成して結果を格納し、それを返すことができます。また、大文字と小文字が区別されるため、(アップルパイ、アップルパイ) は pple パイを返します。

def longestSubstringFinder(str1, str2):
answer = ""

if len(str1) == len(str2):
    if str1==str2:
        return str1
    else:
        longer=str1
        shorter=str2
elif (len(str1) == 0 or len(str2) == 0):
    return ""
elif len(str1)>len(str2):
    longer=str1
    shorter=str2
else:
    longer=str2
    shorter=str1

matrix = numpy.zeros((len(shorter), len(longer)))

for i in range(len(shorter)):
    for j in range(len(longer)):               
        if shorter[i]== longer[j]:
            matrix[i][j]=1

longest=0

start=[-1,-1]
end=[-1,-1]    
for i in range(len(shorter)-1, -1, -1):
    for j in range(len(longer)):
        count=0
        begin = [i,j]
        while matrix[i][j]==1:

            finish=[i,j]
            count=count+1 
            if j==len(longer)-1 or i==len(shorter)-1:
                break
            else:
                j=j+1
                i=i+1

        i = i-count
        if count>longest:
            longest=count
            start=begin
            end=finish
            break

answer=shorter[int(start[0]): int(end[0])+1]
return answer

score -1 · Accepted Answer

最初に、部分文字列を生成するためにitertools ペアワイズレシピから適合されたヘルパー関数。

import itertools
def n_wise(iterable, n = 2):
    '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ...

    n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...'''
    a = itertools.tee(iterable, n)
    for x, thing in enumerate(a[1:]):
        for _ in range(x+1):
            next(thing, None)
    return zip(*a)

次に、部分文字列を反復処理する関数が、最初に最長のものを繰り返し、メンバーシップをテストします。（効率は考慮していません）

def foo(s1, s2):
    '''Finds the longest matching substring
    '''
    # the longest matching substring can only be as long as the shortest string
    #which string is shortest?
    shortest, longest = sorted([s1, s2], key = len)
    #iterate over substrings, longest substrings first
    for n in range(len(shortest)+1, 2, -1):
        for sub in n_wise(shortest, n):
            sub = ''.join(sub)
            if sub in longest:
                #return the first one found, it should be the longest
                return sub

s = "fdomainster"
t = "exdomainid"
print(foo(s,t))

>>> 
domain
>>>

python - 2 つの文字列の間で共通の部分文字列を見つける

20 に答える 20

Related

Reference