python - Python: Find first non-matching character

Question

Under Python, when you want to obtain the index of the first occurrence of a substring or character within a list, you use something like this:

s.find("f")

However, I'd like to find the index of the first character within the string that does not match. Currently, I'm using the following:

iNum = 0
for i, c in enumerate(line):
  if(c != mark):
    iNum = i
    break

Is there a more efficient way to do this, such as a built-in function I don't know about?

score 9 · Accepted Answer

たとえば、次のような正規表現を使用できます。

>>> import re
>>> re.search(r'[^f]', 'ffffooooooooo').start()
4

[^f]を除く任意の文字と一致し、Match オブジェクト ( によって返される)fのメソッドは、一致が発生したインデックスを提供します。start()re.search()

正規表現が一致しない場合に発生する、の結果がis notであるfことを確認するためにチェックする必要がある空の文字列または文字列のみを処理できることを確認します。例えば：re.search()None

first_index = -1
match = re.search(r'[^f]', line)
if match:
    first_index = match.start()

正規表現を使用したくない場合は、現在の方法よりも優れた方法はありません。のようなものを使用できますが、空行または文字のみで構成される行を処理するには、これをandブロックnext(i for i, c in enumerate(line) if c != mark)でラップする必要があります。tryexcept StopIterationmark

score 1 · Accepted Answer

私はこれと同じ問題を抱えていて、ここで解決策のタイミングを調べました（他のオプションよりも大幅に遅い@wwiiのmap/list-compのものを除く）。元のバージョンの Cython バージョンも追加しました。

これらはすべて Python v2.7 で作成およびテストしました。(Unicode 文字列の代わりに) バイト文字列を使用していました。Python v3 でバイト文字列を操作するために、正規表現メソッドに別のものが必要かどうかはわかりません。「マーク」は、null バイトになるようにハードコードされています。これは簡単に変更できました。

バイト文字列全体がヌルバイトの場合、すべてのメソッドは -1 を返します。これらはすべて IPython でテストされています (% で始まる行は特別です)。

import re

def f1(s): # original version
    for i, c in enumerate(s):
        if c != b'\0': return i
    return -1

def f2(s): # @ChristopherMahan's version
    i = 0
    for c in s:
        if c != b'\0': return i
        i += 1
    return -1

def f3(s): # @AndrewClark's alternate version
    # modified to use optional default argument instead of catching StopIteration
    return next((i for i, c in enumerate(s) if c != b'\0'), -1)

def f4(s): # @AndrewClark's version
    match = re.search(br'[^\0]', s)
    return match.start() if match else -1

_re = re.compile(br'[^\0]')
def f5(s): # @AndrewClark's version w/ precompiled regular expression
    match = _re.search(s)
    return match.start() if match else -1

%load_ext cythonmagic
%%cython
# original version optimized in Cython
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def f6(bytes s):
    cdef Py_ssize_t i
    for i in xrange(len(s)):
        if s[i] != b'\0': return i
    return -1

タイミング結果:

s = (b'\x00' * 32) + (b'\x01' * 32) # test string

In [11]: %timeit f1(s) # original version
100000 loops, best of 3: 2.48 µs per loop

In [12]: %timeit f2(s) # @ChristopherMahan's version
100000 loops, best of 3: 2.35 µs per loop

In [13]: %timeit f3(s) # @AndrewClark's alternate version
100000 loops, best of 3: 3.07 µs per loop

In [14]: %timeit f4(s) # @AndrewClark's version
1000000 loops, best of 3: 1.91 µs per loop

In [15]: %timeit f5(s) # @AndrewClark's version w/ precompiled regular expression
1000000 loops, best of 3: 845 ns per loop

In [16]: %timeit f6(s) # original version optimized in Cython
1000000 loops, best of 3: 305 ns per loop

全体的に、@ChristopherMahan のバージョンは、元のバージョンよりもわずかに高速です (明らかにenumerate、独自のカウンターを使用するよりも遅いです)。next（@AndrewClarkの代替バージョン）メソッドを使用すると、1行の形式で本質的に同じものであるにもかかわらず、元のメソッドよりも遅くなります。

正規表現 (@AndrewClark のバージョン) を使用すると、特に正規表現をプリコンパイルする場合に、ループよりも大幅に高速になります!

次に、Cython を使用できる場合は、それが断然高速です。正規表現の使用が遅いというOPの懸念は検証されていますが、Pythonのループはさらに遅くなります。Cython のループは非常に高速です。

score 0 · Accepted Answer

今、私はこれらの2つがどうやってうまくいくのか興味があります。

>>> # map with a partial function
>>> import functools
>>> import operator
>>> f = functools.partial(operator.eq, 'f')
>>> map(f, 'fffffooooo').index(False)
5
>>> # list comprehension
>>> [c == 'f' for c in 'ffffoooo'].index(False)
4
>>>

score 0 · Accepted Answer

ここにワンライナーがあります：

> print([a == b for (a_i, a) in enumerate("compare_me") for
(b_i, b) in enumerate("compar me") if a_i == b_i].index(False))
> 6
> "compare_me"[6]
> 'e'

python - Python: Find first non-matching character

5 に答える 5

Related

Reference