python - 単語を切断しない最長共通部分文字列 - python

Question

以下を考えると、最長の共通部分文字列を見つけることができます：

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[アウト]：

foo bar

しかし、最長の共通部分文字列が英語の単語の境界を尊重し、単語を切り詰めないようにするにはどうすればよいでしょうか? たとえば、次の文:

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

s2 から単語を分割するため、望ましくないフォローを出力します。kappa

a foo bar

目的の出力は次のとおりです。

foo bar

単語境界に関して最長の共通部分文字列を取得する ngram の方法も試しましたが、ngrams を計算せずに文字列を処理する他の方法はありますか? （答えを見てください）

score 1 · Accepted Answer

必要なことは、単語の開始と終了のチェックを追加することだけです。

mその後、有効な試合終了についてのみ更新します。

そのようです：

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    # current character in s1
    x_char = s1[x - 1]
    # we are at the beginning of a word in s1 if
    #   (we are at the beginning of s1) or 
    #   (previous character is a space)
    x_word_begin = (x == 1) or (s1[x - 2] == " ")
    # we are at the end of a word in s1 if
    #   (we are at the end of s1) or 
    #   (next character is a space)
    x_word_end = (x == len(s1)) or (s1[x] == " ")
    for y in xrange(1, 1 + len(s2)):
      # current character in s2
      y_char = s2[y - 1]
      # we are at the beginning of a word in s2 if
      #   (we are at the beginning of s2) or 
      #   (previous character is a space)
      y_word_begin = (y == 1) or (s2[y - 2] == " ")
      # we are at the end of a word in s2 if
      #   (we are at the end of s2) or 
      #   (next character is a space)
      y_word_end = (y == len(s2)) or (s2[y] == " ")
      if x_char == y_char:
        # no match starting with x_char
        if m[x - 1][y - 1] == 0:
          # a match can start only with a space
          #   or at the beginning of a word
          if x_char == " " or (x_word_begin and y_word_begin):
              m[x][y] = m[x - 1][y - 1] + 1
        else:
          m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          # the match can end only with a space
          #   or at the end of a word
          if x_char == " " or (x_word_end and y_word_end):
            longest = m[x][y]
            x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

score 1 · Accepted Answer

コードに受け入れ条件を追加するだけです。

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)

score 1 · Accepted Answer

これは、私が最初に信用したよりも興味深い問題です。考えてみると、4 つの結果が考えられます。

些細なケース、文字列全体が境界なしで一致します（最初の例）
先頭で単語の境界を越える (2 番目の例)
最後に単語の境界を越える
両端に単語境界を設定する

これで、コードが些細なケースを処理するので、それを活用できます。残っているのは、他のケースのいくつかのチェックで結果をラップすることだけです。では、これらのチェックはどのように見えるべきでしょうか? あなたの失敗例を見てみましょう：

string 1 = "this is a foo bar sentence ."
string 2 = "what a kappa foo bar black sheep ?"
output string = "a foo bar"

したがって、文字列のfind観点からは、との両方でこれらすべての文字をその順序で見つけることができますが、スペースの周りのすべてをリストに分割し、リストを順番に検索すると、一致するもののみが一致します。string1string2string1

今、私は主にCの人なので、それを関数に書きたいと思います:

def full_string(str1, str2, chkstr):
  l1 = str1.split()
  l2 = str2.split()
  chkl = chkstr.split()
  return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
          any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))

この関数を使用すると、2 つの文字列のいずれかに、結果のすべての単語がlongest_common_substring(s1, s2)順番に含まれていないかどうかを確認できます。完全。したがって、最後のステップは、これら 2 つの関数を組み合わせて、上記の 4 つのケースのそれぞれをチェックすることです。

def longest_whole_substring(s1, s2):
  subs = longest_common_substring(s1, s2)
  if not full_string(s1, s2, subs):
    if full_string(s1, s2, ' '.join(subs.split()[1:])):
      subs = ' '.join(subs.split()[1:])
    elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
      subs = ' '.join(subs.split()[:-1])
    else:
      subs = ' '.join(subs.split()[1:-1])
  return subs

これで、関数longest_whole_substring(s1, s2)は単語を切り捨てずに最長の部分文字列全体を提供します。それぞれのケースでテストしてみましょう。

些細なこと:

>>> a = 'this is a foo bar bar foo string'
>>> b = 'foo bar'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

先頭の単語境界:

>>> b = 's a foo bar'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar '

最後の単語境界:

>>> b = 'foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

そして、両端に Word 境界があります:

>>> b = 's a foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar'

いい感じ！

score 1 · Accepted Answer

私は再帰的にそれをしました：

def common_phrase(self, longer, shorter):
""" recursively find longest common substring, consists of whole words only and in the same order """
if shorter in longer:
    return shorter
elif len(shorter.split()) > 1:
    common_phrase_without_last_word = common_phrase(shorter.rsplit(' ', 1)[0], longer)
    common_phrase_without_first_word = common_phrase(shorter.split(' ', 1)[1], longer)
    without_first_is_longer = len(common_phrase_without_last_word) < len(common_phrase_without_first_word)

    return ((not without_first_is_longer) * common_phrase_without_last_word +
            without_first_is_longer * common_phrase_without_first_word)
else:
    return ''

適用する前に、2 つの文字列を「短い」と「長い」に分類するだけです。

if len(str1) > len(str2):
    longer, shorter = str1, str2 
else:
    longer, shorter = str2, str1

score 0 · Accepted Answer

ngram の方法は次のとおりです。

def ngrams(text, n):
  return [text[i:i+n] for i in xrange(len(text)-n)]

def longest_common_ngram(s1, s2):
  s1ngrams = list(chain(*[[" ".join(j) for j in ngrams(s1.split(), i)] 
                          for i in range(1, len(s1.split()))]))
  s2ngrams = list(chain(*[[" ".join(j) for j in ngrams(s2.split(), i)]
                          for i in range(1, len(s2.split()))]))

  return set(s1ngrams).intersection(set(s2ngrams))

score 0 · Accepted Answer

最長の共通部分文字列を見つける効率的な方法の 1 つは、サフィックスツリーです ( http://en.wikipedia.org/wiki/Suffix_treeおよびhttp://en.wikipedia.org/wiki/Longest_common_substring_problemを参照)。文字の代わりに単語を使用して接尾辞ツリーを作成できない理由はわかりません。その場合、ツリーから抽出された最長共通部分列はトークン境界を尊重します。このアプローチは、1 つの固定文字列と多数の他の文字列の間で共通の部分文字列を見つけたい場合に特に効率的です。

Python サフィックスツリーの実装のリストについては、python: library for generalized suffix treesに対する受け入れられた回答を参照してください。

python - 単語を切断しない最長共通部分文字列 - python

9 に答える 9

Related

Reference