python - 文字列の同じ部分を見つける

Question

次のような文字列があります：abcgdfabc

次のようにしたい: 入力: 文字列、例:

abcgdfabc

出力: 辞書 (キーは「単語」、値は表示される時間)、

abc:2
gdf:1

words は "words" の最大長です。貪欲に一致する必要があります。

私はそれに多くの時間を費やしましたが、理解できません。文字列は 5000 を超えています。これはゲノムです。その関係を調べたいのですが、データをより明確にするためにそのような辞書を初めて見つけなければならないときは、助けてください。

score 2 · Accepted Answer

この正規表現は、英数字のグループを検索し、その後に任意の数の他の文字が続き、さらに単独で検索します。次に、重複を削除してこのリストを反復処理し、これらの文字グループとその出現回数を示します。

import re

s = "eg,abcgdfabc"
for word in set(re.findall(r'(\w+)(\w*?\1)+', s)):
    print word, s.count(word)

版画

abc 2

ただし、単語が正確にわからない場合は、別の候補がありますが、次の文字列で繰り返される単語が 1 つだけ見つかります。

abcdeabcecd
abc  abc    <- this will be found
  cd     cd <- this won't be found

score 1 · Accepted Answer

これは醜い解決策です：

def parse(s,L=None):
    do_return=L is None
    if(not s):
        return
    if(do_return):
        L=[]
    substr=s[0]
    for i in range(1,len(s)-1):
        if s[:i] in s[i:]:
            substr=s[:i]
        else:
            L.append(substr)
            parse(s.replace(substr,''),L=L)
            break
    else:
        L.append(s)

    if(do_return):
        LL=[(ss,s.count(ss)) for ss in L] #Count the number of times each substring appears
        LLL=[]
        #Now some of our (unmatched) substrings will be adjacent to each other.
        #We should merge all adjacent unmatched strings together.
        while LL:
            LLL.append(LL.pop(0))
            while LLL[-1][1] == 1 and LL: #check if next is unmatched
                if(LL[0][1]==1): #unmatched, merge and remove
                    LLL[-1]=(LLL[-1][0]+LL[0][0],1)
                    LL.pop(0)
                else: #matched, keep on going.
                    break
        d={}
        for k,v in LLL:
            d[k]=v

        return d


S='eg,abcgdfabc'
print parse(S)  #{ 'e':1, 'g':2, ',':1, 'abc': 2, 'df', 1}

もちろん、g は 2 回一致するため (貪欲であるため)、これは期待どおりには機能しません ...

常に 3 つのグループで繰り返し処理したい場合、これは非常に簡単 (かつきれい) になります。

from collections import defaultdict
def parse(s,stride=3):
    d=defaultdict(lambda:0)
    while s:
        key=s[:stride]
        d[key]+=1
        s=s[stride:]

    #if you need a regular dictionary:  dd={}; dd.update(d); return dd      
    return d

score 0 · Accepted Answer

Python2.7以降を使用している場合

>>> from itertools import islice
>>> from collections import Counter
>>> def split_steps(step, sequence):
...    it = iter(sequence)
...    bits = ''.join(islice(it,step))
...    while bits:
...        yield bits
...        bits = ''.join(islice(it,step))
...
>>> Counter(split_steps(3,'abcdgfabc')).most_common()
[('abc', 2), ('dgf', 1)]

python - 文字列の同じ部分を見つける

3 に答える 3

Related

Reference