python - 文字列内の最初の非繰り返し文字を見つける最良の方法

Question

次のような文字列の最初の非繰り返し文字を見つけるための最良のスペースと時間効率の良い解決策は何でしょうaabccbdcbeか？

ここでの答えはdです。ですから、私を驚かせるのは、2つの方法でそれを行うことができるということです。

すべてのインデックスについて、iはi-1回ループし、その文字が再び出現するかどうかを確認します。ただし、これは効率的ではありません。このメソッドの拡張はO（N ^ 2）です。ここで、Nは文字列の長さです。
もう1つの考えられる良い方法は、重み（出現回数）に基づいて文字を並べ替えることができるように、ツリーまたはその他のdsを形成できる場合です。これにより、構造を形成するために、文字列を通る長さNのループが1つだけ必要になる可能性があります。これは、O（N）+ O（ツリーまたは任意のdsを構築する時間）です。

score 18 · Accepted Answer

これが非常に簡単なO(n)解決策です：

def fn(s):
  order = []
  counts = {}
  for x in s:
    if x in counts:
      counts[x] += 1
    else:
      counts[x] = 1 
      order.append(x)
  for x in order:
    if counts[x] == 1:
      return x
  return None

文字列を1回ループします。新しい文字に出くわしたときはcounts、値をで格納し1、に追加しますorder。以前に見た文字に出くわしたとき、その値を。でインクリメントしますcounts。最後に、値がinorderの文字が見つかるまでループして、それを返します。1counts

score 8 · Accepted Answer

リスト内包表記では、文字が1回だけ表示される場合、表示される順序で文字が表示されます。

In [61]: s = 'aabccbdcbe'

In [62]: [a for a in s if s.count(a) == 1]
Out[62]: ['d', 'e']

次に、これの最初のエントリを返します。

In [63]: [a for a in s if s.count(a) == 1][0]
Out[63]: 'd'

最初のエントリだけが必要な場合は、ジェネレータも機能します。

In [69]: (a for a in s if s.count(a) == 1).next()
Out[69]: 'd'

score 7 · Accepted Answer

文字列から繰り返し文字を削除すると、操作回数が大幅に減ると思います。例えば：

s = "aabccbdcbe"
while s != "":
    slen0 = len(s)
    ch = s[0]
    s = s.replace(ch, "")
    slen1 = len(s)
    if slen1 == slen0-1:
        print ch
        break;
else:
    print "No answer"

score 5 · Accepted Answer

検索の速度は、いくつかの要因によって異なります。

文字列の長さ
1回限りのキャラクターが存在しない位置
この位置の後の文字列のサイズ
文字列に出現するさまざまな文字の数

。

次のコードでは、最初に、2つの文字列から、という名前の1回限り発生する文字のグループを使用して文字列s
を定義し、連結します。ここで： random.choice()unik
s1s2s1 + s2

s1nwo一度だけ出現する文字がない長さの文字列です
s2nwi1回限りの文字が含まれる長さの文字列です

。

#### creation of s from s1 and s2 #########

from random import choice

def without(u,n):
    letters = list('abcdefghijklmnopqrstuvwxyz')
    for i in xrange(n):
        c = choice(letters)
        if c not in unik:
            yield c

def with_un(u,n):
    letters = list('abcdefghijklmnopqrstuvwxyz')
    ecr = []
    for i in xrange(n):
        c = choice(letters)
        #ecr.append('%d %s  len(letters) == %d' % (i,c,len(letters)))
        yield c
        if c in unik:
            letters.remove(c)
    #print '\n'.join(ecr)

unik = 'ekprw'
nwo,nwi = 0,500
s1 = ''.join(c for c in without(unik,nwo))
s2 = ''.join(c for c in with_un(unik,nwi))
s = s1 + s2

if s1:
    print '%-27ss2 : %d chars' % ('s1 : %d chars' % len(s1),len(s2))
    for el in 'ekprw':
        print ('s1.count(%s) == %-12ds2.count(%s) == %d'
               % (el,s1.count(el),el,s2.count(el)))
    others = [c for c in 'abcdefghijklmnopqrstuvwxyz' if c not in unik]
    print 's1.count(others)>1 %s' % all(s1.count(c)>1 for c in others)
else:
    print "s1 == ''     len(s2) == %d" % len(s2)
    for el in 'ekprw':
        print ('   -         s2.count(%s) == %d'
               % (el,s2.count(el)))
print 'len of s  == %d\n' % len(s)

次に、ベンチマークがあります。数値を
変えると、速度への影響がわかります。nwonwi

### benchmark of three solutions #################

from time import clock


# Janne Karila
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
    pass
te = clock()
c = OrderedCounter(s)
rjk = (item for item, count in c.iteritems() if count == 1).next()
tf = clock()-te
print 'Janne Karila  %.5f    found: %s' % (tf,rjk)

# eyquem
te = clock()
candidates = set(s)
li = []
for x in s:
    if x in candidates:
        li.append(x)
        candidates.remove(x)
    elif x in li:
        li.remove(x)
rey = li[0]
tf = clock()-te
print 'eyquem        %.5f    found: %s' % (tf,rey)

# TyrantWave
te = clock()
rty = (a for a in s if s.count(a) == 1).next()
tf = clock()-te
print 'TyrantWave    %.5f    found: %s' % (tf,rty)

。

いくつかの結果

nullの長さの場合、 nwo s1=0およびnwi=50：

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 1
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila  0.00077    found: e
eyquem        0.00013    found: e
TyrantWave    0.00005    found: e

TyrantWaveのソリューションは、最初の1つの文字が文字列の最初の位置ですばやく検出されるため、より高速です。

。

nwo=300およびnwi=50の場合（ 1回限り発生する文字の出現はの構築中に保持されなかったため、
以降は401文字。関数without（）を参照）s1s1

s1 : 245 chars             s2 : 50 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 295

Janne Karila  0.00167    found: e
eyquem        0.00030    found: e
TyrantWave    0.00042    found: e

今回のTyrantWaveのソリューションは、最初の部分のすべての文字の出現をカウントする必要があるため、つまり、1回限りの文字がない（2番目の部分にある）ため、私のソリューションよりもs長くs1なりs2ます
。私のソリューションでより短い時間を取得nwoします。nwi

。

nwo=300およびnwi=5000の場合

s1 : 240 chars             s2 : 5000 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 5240

Janne Karila  0.01510    found: p
eyquem        0.00534    found: p
TyrantWave    0.00294    found: p

の長さが長くなるとs2、TyrantWaveのソリューションが再び優れたものになります。

。

あなたが望むものを結論付ける

。

編集

ローマの素晴らしいアイデア！
ベンチマークにRomanのソリューションを追加したところ、勝ちました。

私はまた、彼の解決策を改善するためにいくつかの小さな変更を加えました。

# Roman Fursenko
srf = s[:]
te = clock()
while srf != "":
    slen0 = len(srf)
    ch = srf[0]
    srf = srf.replace(ch, "")
    slen1 = len(srf)
    if slen1 == slen0-1:
        rrf = ch
        break
else:
    rrf = "No answer"
tf = clock()-te
print 'Roman Fursenko %.6f    found: %s' % (tf,rrf)

# Roman Fursenko improved
srf = s[:]
te = clock()
while not(srf is ""):
    slen0 = len(srf)
    srf = srf.replace(srf[0], "")
    if len(srf) == slen0-1:
        rrf = ch
        break
else:
    rrf = "No answer"
tf = clock()-te
print 'Roman improved %.6f    found: %s' % (tf,rrf)

print '\nindex of %s in the string :  %d' % (rty,s.index(rrf))

。

結果は次のとおりです。

。

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 1
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila   0.0032538    found: r
eyquem         0.0001249    found: r
TyrantWave     0.0000534    found: r
Roman Fursenko 0.0000299    found: r
Roman improved 0.0000263    found: r

index of r in the string :  1

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 0
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila   0.0008183    found: a
eyquem         0.0001285    found: a
TyrantWave     0.0000550    found: a
Roman Fursenko 0.0000433    found: a
Roman improved 0.0000391    found: a

index of a in the string :  4

>>

s1 : 240 chars             s2 : 50 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 0
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 290

Janne Karila   0.0016390    found: e
eyquem         0.0002956    found: e
TyrantWave     0.0004112    found: e
Roman Fursenko 0.0001428    found: e
Roman improved 0.0001277    found: e

index of e in the string :  242

s1 : 241 chars             s2 : 5000 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 5241

Janne Karila   0.0148231    found: r
eyquem         0.0053283    found: r
TyrantWave     0.0030166    found: r
Roman Fursenko 0.0007414    found: r
Roman improved 0.0007230    found: r

index of r in the string :  250

私はRomanのコードのおかげで何かを学びました：
s.replace()新しい文字列を作成し、そのため、それは遅い方法だと思いました。
しかし、どのような理由で、それは本当に速い方法かわかりません。

。

編集2

Oinの解決策は最悪です：

# Oin
from operator import itemgetter
seen = set()
only_appear_once = dict()
te = clock()
for i, x in enumerate(s):
  if x in seen and x in only_appear_once:
    only_appear_once.pop(x)
  else:
    seen.add(x)
    only_appear_once[x] = i
  fco = min(only_appear_once.items(),key=itemgetter(1))[0]
tf = clock()-te
print 'Oin            %.7f    found: %s' % (tf,fco)

結果

s1 == ''     len(s2) == 50
Oin            0.0007124    found: e
Janne Karila   0.0008057    found: e
eyquem         0.0001252    found: e
TyrantWave     0.0000712    found: e
Roman Fursenko 0.0000335    found: e
Roman improved 0.0000335    found: e

index of e in the string :  2


s1 : 237 chars             s2 : 50 chars
Oin            0.0029783    found: k
Janne Karila   0.0014714    found: k
eyquem         0.0002889    found: k
TyrantWave     0.0005598    found: k
Roman Fursenko 0.0001458    found: k
Roman improved 0.0001372    found: k

index of k in the string :  246


s1 : 236 chars             s2 : 5000 chars
Oin            0.0801739    found: e
Janne Karila   0.0155715    found: e
eyquem         0.0044623    found: e
TyrantWave     0.0027548    found: e
Roman Fursenko 0.0007255    found: e
Roman improved 0.0007199    found: e

index of e in the string :  244

score 3 · Accepted Answer

collections.Counter効率的にカウント（*）し、collections.OrderedDictアイテムが最初に表示された順序を記憶します。多重継承を使用して、利点を組み合わせてみましょう。

from collections import Counter, OrderedDict

class OrderedCounter(Counter, OrderedDict):
    pass

def first_unique(iterable):
    c = OrderedCounter(iterable)
    for item, count in c.iteritems():
        if count == 1:
            return item

print first_unique('aabccbdcbe')
#d            
print first_unique('abccbdcbe')            
#a

Counterスーパークラスを使用しdictてカウントを保存します。メソッド解決順序の間およびメソッド解決順序でのclass OrderedCounter(Counter, OrderedDict)挿入を定義し、挿入順序を記憶する機能を追加します。OrderedDictCounterdict

（*）これはO（n）であり、その意味で効率的ですが、ベンチマークが示すように、最速のソリューションではありません。

score 0 · Accepted Answer

goodこれは、文字のbadセットと文字のセット（複数回表示される）を使用したアプローチです。

import timeit
import collections
import operator
import random

s = [chr(i) for i in range(ord('a'), ord('z')) for j in range(100)] + ['z']

random.shuffle(s)
s = ''.join(s)

def good_bad_sets(s):
    setbad = set()
    setgood = set()
    for char in s:
        if(char not in setbad):
            if(char in setgood):
                setgood.remove(char)
                setbad.add(char)
            else:
                setgood.add(char)
    return s[min([s.index(char) for char in setgood])] if len(s) > 0 else None

def app_once(s):
    seen = set()
    only_appear_once = set()
    for i in s:
      if i in seen:
        only_appear_once.discard(i)
      else:
        seen.add(i)
        only_appear_once.add(i)
    return s[min([s.index(char) for char in only_appear_once])] if len(only_appear_once) > 0 else None

print('Good bad sets: %ss' % timeit.Timer(lambda : good_bad_sets(s)).timeit(100))
print('Oin\'s approach: %ss' % timeit.Timer(lambda : app_once(s)).timeit(100))
print('LC: %ss' % timeit.Timer(lambda : [a for a in s if s.count(a) == 1][0]).timeit(100))

私はそれをLCアプローチと比較しました、そしてどこかでおよそ50文字goodとbadセットアプローチがより速くなります。このアプローチとOinおよびLCの比較：

Good bad sets: 0.0419239997864s
Oin's approach: 0.0803039073944s
LC: 0.647999048233s

score -1 · Accepted Answer

したがって、問題の定義から、O（n）ソリューションが必要であることは明らかです。つまり、リストを1回だけ確認する必要があります。カウントの形式を使用するすべてのソリューションは、その操作でリストを再度通過するため、間違っています。したがって、カウントを自分で追跡する必要があります。

その文字列に文字しか含まれていない場合は、ストレージについて心配する必要はなく、文字をdictのキーとして使用できます。そのdictの値は、文字列sの文字のインデックスになります。最後に、辞書の値の最小値を計算して、どれが最初であるかを確認する必要があります。これは、最初のリストよりも（おそらく）短いリストでのO（n）操作です。

合計は引き続きO（c * n）であるため、O（n）になります。

from operator import itemgetter

seen = set()
only_appear_once = dict()

for i, x in enumerate(s):
  if x in seen and x in only_appear_once:
    only_appear_once.pop(x)
  else:
    seen.add(x)
    only_appear_once[x] = i

first_count_of_one = only_appear_once[min(only_appear_once.values(), key=itemgetter(1))]

python - 文字列内の最初の非繰り返し文字を見つける最良の方法

7 に答える 7

編集

編集2

Related

Reference