python - Pythonでは、文字列に特定の文字のみが含まれているかどうかを確認する方法は?

Question

Pythonでは、文字列に特定の文字のみが含まれているかどうかを確認する方法は?

a..z、0..9、および .. のみを含む文字列をチェックする必要があります。(ピリオド) であり、他の文字はありません。

各文字を反復処理して、文字が a..z または 0..9、または ..9 であることを確認できます。しかし、それは遅いでしょう。

正規表現でそれを行う方法が今はわかりません。

これは正しいです？より単純な正規表現またはより効率的なアプローチを提案できますか?

#Valid chars . a-z 0-9 
def check(test_str):
    import re
    #http://docs.python.org/library/re.html
    #re.search returns None if no position in the string matches the pattern
    #pattern to search for any character other then . a-z 0-9
    pattern = r'[^\.a-z0-9]'
    if re.search(pattern, test_str):
        #Character other then . a-z 0-9 was found
        print 'Invalid : %r' % (test_str,)
    else:
        #No character other then . a-z 0-9 was found
        print 'Valid   : %r' % (test_str,)

check(test_str='abcde.1')
check(test_str='abcde.1#')
check(test_str='ABCDE.12')
check(test_str='_-/>"!@#12345abcde<')

'''
Output:
>>> 
Valid   : "abcde.1"
Invalid : "abcde.1#"
Invalid : "ABCDE.12"
Invalid : "_-/>"!@#12345abcde<"
'''

score 82 · Accepted Answer

これは、単純な純粋な Python の実装です。パフォーマンスが重要でない場合に使用する必要があります (将来の Google 社員のために含まれています)。

import string
allowed = set(string.ascii_lowercase + string.digits + '.')

def check(test_str):
    set(test_str) <= allowed

パフォーマンスに関しては、反復がおそらく最速の方法です。正規表現はステートマシンを反復処理する必要があり、セット等価ソリューションは一時的なセットを構築する必要があります。ただし、違いはそれほど重要ではありません。この関数のパフォーマンスが非常に重要な場合は、switch ステートメントを使用して C 拡張モジュールとして記述します (ジャンプテーブルにコンパイルされます)。

これは、スペースの制約により if ステートメントを使用する C 実装です。ほんの少しの余分な速度が絶対に必要な場合は、スイッチケースを書き出してください。私のテストでは、非常にうまく機能します (正規表現に対するベンチマークで 2 秒対 9 秒)。

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject *check(PyObject *self, PyObject *args)
{
        const char *s;
        Py_ssize_t count, ii;
        char c;
        if (0 == PyArg_ParseTuple (args, "s#", &s, &count)) {
                return NULL;
        }
        for (ii = 0; ii < count; ii++) {
                c = s[ii];
                if ((c < '0' && c != '.') || c > 'z') {
                        Py_RETURN_FALSE;
                }
                if (c > '9' && c < 'a') {
                        Py_RETURN_FALSE;
                }
        }

        Py_RETURN_TRUE;
}

PyDoc_STRVAR (DOC, "Fast stringcheck");
static PyMethodDef PROCEDURES[] = {
        {"check", (PyCFunction) (check), METH_VARARGS, NULL},
        {NULL, NULL}
};
PyMODINIT_FUNC
initstringcheck (void) {
        Py_InitModule3 ("stringcheck", PROCEDURES, DOC);
}

それを setup.py に含めます。

from distutils.core import setup, Extension
ext_modules = [
    Extension ('stringcheck', ['stringcheck.c']),
],

使用：

>>> from stringcheck import check
>>> check("abc")
True
>>> check("ABC")
False

score 53 · Accepted Answer

最終（？）編集

注釈付きの対話型セッションを使用して、関数にまとめられた回答:

>>> import re
>>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
...     return not bool(search(strg))
...
>>> special_match("")
True
>>> special_match("az09.")
True
>>> special_match("az09.\n")
False
# The above test case is to catch out any attempt to use re.match()
# with a `$` instead of `\Z` -- see point (6) below.
>>> special_match("az09.#")
False
>>> special_match("az09.X")
False
>>>

注：この回答のさらに下に re.match() を使用した比較があります。さらなるタイミングは、より長い文字列で match() が勝つことを示しています。最終的な答えが True の場合、match() は search() よりもはるかに大きなオーバーヘッドがあるようです。これは不可解です (おそらく、None ではなく MatchObject を返すためのコストです)。

==== Earlier text ====

[以前に]受け入れられた回答は、いくつかの改善を使用できます。

(1) プレゼンテーションは、インタラクティブな Python セッションの結果であるかのように見えます。

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True

しかし、 match() は返されませんTrue

(2) match() で使用する場合^、パターンの先頭にあるは冗長であり、ない同じパターンよりもわずかに遅くなります。^

(3) どんな再パターンに対しても無意識のうちに生の文字列の使用を自動的に助長するべきです

(4) ドット/ピリオドの前のバックスラッシュは冗長です

(5) OPのコードより遅い！

prompt>rem OP's version -- NOTE: OP used raw string!

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.43 usec per loop

prompt>rem OP's version w/o backslash

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.44 usec per loop

prompt>rem cleaned-up version of accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
100000 loops, best of 3: 2.07 usec per loop

prompt>rem accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
100000 loops, best of 3: 2.08 usec per loop

(6)間違った答えを出すことができる!!

>>> import re
>>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
True # uh-oh
>>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
False

score 42 · Accepted Answer

より単純なアプローチ？もう少しPythonic？

>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False

確かに最も効率的ではありませんが、読みやすいことは確かです。

score 16 · Accepted Answer

編集: AZ を除外するように正規表現を変更しました

正規表現ソリューションは、これまでのところ最速の純粋な python ソリューションです

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902

他のソリューションとの比較:

>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447

空の文字列を許可する場合は、次のように変更します。

reg=re.compile('^[a-z0-9\.]*$')
>>>reg.match('')
False

リクエストに応じて、回答の他の部分を返します。ただし、以下は AZ 範囲を受け入れることに注意してください。

isalnumを使用できます

test_str.replace('.', '').isalnum()

>>> 'test123.3'.replace('.', '').isalnum()
True
>>> 'test123-3'.replace('.', '').isalnum()
False

編集 isalnumを使用すると、 set ソリューションよりもはるかに効率的です

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766

EDIT2 ジョンは、上記が機能しない例を示しました。エンコードを使用してこの特殊なケースを克服するソリューションを変更しました

test_str.replace('.', '').encode('ascii', 'replace').isalnum()

それでも、設定されたソリューションよりもほぼ 3 倍高速です。

timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318

私の意見では、この問題を解決するには正規表現を使用するのが最善です

score 5 · Accepted Answer

これはすでに十分な回答が得られていますが、事後にこれに出くわした人々のために、これを達成するためのいくつかの異なる方法のプロファイリングを行いました. 私の場合、大文字の16進数が必要だったので、必要に応じて変更してください。

ここに私のテスト実装があります:

import re

hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')

def test_set(input):
    return set(input) <= hex_digits

def test_not_any(input):
    return not any(c not in hex_digits for c in input)

def test_re_match1(input):
    return bool(re.compile(r'^[A-F0-9]+\Z').match(input))

def test_re_match2(input):
    return bool(hex_match.match(input))

def test_re_match3(input):
    return bool(re.match(r'^[A-F0-9]+\Z', input))

def test_re_search1(input):
    return not bool(re.compile(r'[^A-F0-9]').search(input))

def test_re_search2(input):
    return not bool(hex_search.search(input))

def test_re_search3(input):
    return not bool(re.match(r'[^A-F0-9]', input))

そして、Mac OS X 上の Python 3.4.0 でのテスト:

import cProfile
import pstats
import random

# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]

# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()

# run the test functions against each item in tests. 
# this takes a little time; be patient
for t in tests:
    for tf in [test_set, test_not_any, 
               test_re_match1, test_re_match2, test_re_match3,
               test_re_search1, test_re_search2, test_re_search3]:
        _ = tf(t)

# stop collecting stats
test_pr.disable()

# we create our own pstats.Stats object to filter 
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)

# normally, stats are printed with the format %8.3f, 
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
    return "%11.6f" % x

def _print_title(self):
    print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
    print('filename:lineno(function)', file=self.stream)

pstats.f8 = _f8
pstats.Stats.print_title = _print_title

# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")

次の結果が得られました。

         13.428 秒で 50335004 回の関数呼び出し

   並べ替え: 累積時間、関数名
   制限によりリストが 20 から 8 に減少

   ncalls tottime percall cumtime percall filename:lineno(関数)
    10000 0.005233 0.000001 0.367360 0.000037 :1(test_re_match2)
    10000 0.006248 0.000001 0.378853 0.000038 :1(test_re_match3)
    10000 0.010710 0.000001 0.395770 0.000040 :1(test_re_match1)
    10000 0.004578 0.000000 0.467386 0.000047 :1(test_re_search2)
    10000 0.005994 0.000001 0.475329 0.000048 :1(test_re_search3)
    10000 0.008100 0.000001 0.482209 0.000048 :1(test_re_search1)
    10000 0.863139 0.000086 0.863139 0.000086 :1(テストセット)
    10000 0.007414 0.000001 9.962580 0.000996 :1(test_not_any)

どこ：

呼び出し: その関数が呼び出された回数
総時間: サブ機能に費やされた時間を除いて、指定された機能に費やされた合計時間
パーコール: tottime を ncalls で割った商
カムタイム: このサブ機能とすべてのサブ機能で費やされた累積時間
パーコール: cumtime をプリミティブ呼び出しで割った商

私たちが実際に気にかけている列は cumtime と percall で、関数の入口から出口までにかかった実際の時間を示しています。ご覧のとおり、正規表現の一致と検索はそれほど違いはありません。

毎回コンパイルする場合は、わざわざ正規表現をコンパイルしない方が高速です。毎回コンパイルするよりも 1 回コンパイルする方が約 7.5% 高速ですが、コンパイルしない場合よりも 2.5% しか高速ではありません。

test_set は re_search の 2 倍遅く、re_match の 3 倍遅い

test_not_any は、test_set よりも桁違いに遅かった

TL;DR : re.match または re.search を使用

score 2 · Accepted Answer

うーん... データのセットを比較する必要がある場合は、python Sets を使用します。文字列は、非常に高速に文字のセットとして表すことができます。ここでは、文字列が電話番号を許可されているかどうかをテストします。最初の文字列は許可されますが、2 番目の文字列は許可されません。すばやく簡単に動作します。

In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()

Out[17]: 0.8106249139964348

In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 фыв');p.issubset(allowed)").timeit()

Out[18]: 0.9240323599951807

回避できる場合は、正規表現を使用しないでください。

score 0 · Accepted Answer

私の場合、文字だけではなく、特定の単語 (この例では「test」など) が含まれているかどうかも確認する必要があったため、別のアプローチ:

input_string = 'abc test'
input_string_test = input_string
allowed_list = ['a', 'b', 'c', 'test', ' ']

for allowed_list_item in allowed_list:
    input_string_test = input_string_test.replace(allowed_list_item, '')

if not input_string_test:
    # test passed

そのため、許可された文字列 (文字または単語) は入力文字列から切り取られます。入力文字列に許可された文字列のみが含まれている場合は、空の文字列のままにする必要があるため、を渡す必要がありif not input_stringます。

python - Pythonでは、文字列に特定の文字のみが含まれているかどうかを確認する方法は?

8 に答える 8

Related

Reference