python - PythonでUTF-8バイト配列のバイトがASCII a-zA-Zであるかどうかを確認する方法

Question

UTF-8 bytearray を想定して、任意の個々のバイトが a-zA-Z の文字範囲にあるかどうかを確認する方法は、これらの文字が 1 バイトで表されていることを知っていますか? これらの文字はASCIIアルファベット文字の整数値に対応し、UTF-8では1バイトであり、マルチバイト文字の個々のバイトはこれらの文字のいずれかの整数値と決して一致しないため、バイトの整数値をチェックするのが最も速いようです.そして最も安全です。

これは私にとってはうまくいきますが、最も効率的ですか？

def isAsciiAlphaByte(c):
    return ((c>96 and c<123) or (c> 64 and c<91))

isAsciiAlphaByte(b"abc"[0])
>>> True

score 2 · Accepted Answer

.isalpha()bytearray 全体とそのスライス (1 バイトのスライスを含む) を呼び出すことができます。

>>> a = b"azAZ 123"
>>> b = bytearray(a)
>>> b.isalpha() # not all bytes in the array are ascii letters
False
>>> b[:4].isalpha() # but the first 4 bytes are alphabetic ([a-zA-Z])
True
>>> b[0:1].isalpha() # you need to use the slice notation even for a single byte
True

上記は、utf-8 が可変幅文字エンコーディングであるにもかかわらず、マルチバイト文字の個々のバイトが文字の ascii 範囲に属していないという事実を利用しています。

.isalpha()また、メソッドがロケールbytearrayに依存しないことも前提としています。たとえば、 b"abа".isalpha()Python 2 ではロケールに依存しています。

個々のバイトをテストする場合:

>>> from curses.ascii import isalpha
>>> b[0]
97
>>> isalpha(b[0]) # it accepts either integer (byte) or a string
True

score 1 · Accepted Answer

を使用reduceして、シーケンスを 1 つの値に減らすことができます。ここでは、次のすべてのバイトをand呼び出した後にバイナリを適用しています。str.isalphabytearray

ba = bytearray('test data')
reduce(lambda x,y: x and y, (chr(b).isalpha() for b in ba))

でも本当に

str(ba).isalpha()

うまくいくでしょう。

score 0 · Accepted Answer

この関数は、より高速なソリューションのように見えます。timeit ベンチマークによると、約 50% 高速であり、cProfile ベンチマークによると、約 40% 高速です。いずれにしても ch(b).isalpha() は非常に高速で、別の関数を書く手間を省きます。したがって、両方とも正常に動作します。

def isalphabyte(c):
   return ((c>96 and c<123) or (c> 64 and c<91))
a=bytearray(b"azAZ 123")
isalphabyte(a[0])
20: True
isalphabyte(a[4]) 
False

>>> timeit.timeit('for i in range(1000000): chr(b"abc"[0]).isalpha()',number=1)
36: 0.31040439769414263
>>> timeit.timeit('for i in range(1000000): isalphabyte(b"abc"[0])',"from __main__ import isalphabyte",number=1)
37: 0.22895044913212814

>>> cProfile.run('for i in range(1000000): chr(b"abc"[0]).isalpha()')
         2000003 function calls in 0.571 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.364    0.364    0.571    0.571 <string>:1(<module>)
  1000000    0.156    0.000    0.156    0.000 {built-in method chr}
        1    0.000    0.000    0.571    0.571 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000000    0.051    0.000    0.051    0.000 {method 'isalpha' of 'str' objects}


>>> cProfile.run('for i in range(1000000): isalphabyte(b"abc"[0])')
         1000003 function calls in 0.335 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.133    0.000    0.133    0.000 <pyshell#74>:1(isalphabyte)
        1    0.202    0.202    0.335    0.335 <string>:1(<module>)
        1    0.000    0.000    0.335    0.335 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

python - PythonでUTF-8バイト配列のバイトがASCII a-zA-Zであるかどうかを確認する方法

5 に答える 5

Related

Reference