python - シンプルな Python チャレンジ: データバッファーでの最速のビット単位 XOR

Question

チャレンジ：

2 つの同じサイズのバッファーに対してビットごとの XOR を実行します。strこれは伝統的に python のデータバッファーの型であるため、バッファーはpython 型である必要があります。結果の値をとして返しますstr。これをできるだけ速く行います。

入力は 2 つの 1 メガバイト (2**20 バイト) の文字列です。

課題は、python または既存のサードパーティの python モジュールを使用して、私の非効率的なアルゴリズムを大幅に打ち負かすことです (ルールの緩和: または独自のモジュールを作成します)。わずかな増加は役に立ちません。

from os import urandom
from numpy import frombuffer,bitwise_xor,byte

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

aa=urandom(2**20)
bb=urandom(2**20)

def test_it():
    for x in xrange(1000):
        slow_xor(aa,bb)

score 37 · Accepted Answer

パフォーマンスの比較: numpy と Cython と C と Fortran と Boost.Python (pyublas)

| function               | time, usec | ratio | type         |
|------------------------+------------+-------+--------------|
| slow_xor               |       2020 |   1.0 | numpy        |
| xorf_int16             |       1570 |   1.3 | fortran      |
| xorf_int32             |       1530 |   1.3 | fortran      |
| xorf_int64             |       1420 |   1.4 | fortran      |
| faster_slow_xor        |       1360 |   1.5 | numpy        |
| inline_xor             |       1280 |   1.6 | C            |
| cython_xor             |       1290 |   1.6 | cython       |
| xorcpp_inplace (int32) |        440 |   4.6 | pyublas      |
| cython_xor_vectorised  |        325 |   6.2 | cython       |
| inline_xor_nocopy      |        172 |  11.7 | C            |
| xorcpp                 |        144 |  14.0 | boost.python |
| xorcpp_inplace         |        122 |  16.6 | boost.python |
#+TBLFM: $3=@2$2/$2;%.1f

結果を再現するには、 http: //gist.github.com/353005をダウンロードして入力しますmake(依存関係をインストールするには、次のように入力します: sudo apt-get install build-essential python-numpy python-scipy cython gfortran、依存関係 forBoost.Pythonは、pyublas手動で操作する必要があるため、含まれていません) 。

どこ：

slow_xor()OPの質問からです
faster_slow_xor()、、inline_xor()は@Torsten Marekの回答inline_xor_nocopy()からのものです
cython_xor()@gnibblerの回答cython_vectorised()からのものです

そしてxor_$type_sig()、次のとおりです。

! xorf.f90.template
subroutine xor_$type_sig(a, b, n, out)
  implicit none
  integer, intent(in)             :: n
  $type, intent(in), dimension(n) :: a
  $type, intent(in), dimension(n) :: b
  $type, intent(out), dimension(n) :: out

  integer i
  forall(i=1:n) out(i) = ieor(a(i), b(i))

end subroutine xor_$type_sig

次のように Python から使用されます。

import xorf # extension module generated from xorf.f90.template
import numpy as np

def xor_strings(a, b, type_sig='int64'):
    assert len(a) == len(b)
    a = np.frombuffer(a, dtype=np.dtype(type_sig))
    b = np.frombuffer(b, dtype=np.dtype(type_sig))
    return getattr(xorf, 'xor_'+type_sig)(a, b).tostring()

`xorcpp_inplace()`(Boost.Python、pyublas):

xor.cpp :

#include <inttypes.h>
#include <algorithm>
#include <boost/lambda/lambda.hpp>
#include <boost/python.hpp>
#include <pyublas/numpy.hpp>

namespace { 
  namespace py = boost::python;

  template<class InputIterator, class InputIterator2, class OutputIterator>
  void
  xor_(InputIterator first, InputIterator last, 
       InputIterator2 first2, OutputIterator result) {
    // `result` migth `first` but not any of the input iterators
    namespace ll = boost::lambda;
    (void)std::transform(first, last, first2, result, ll::_1 ^ ll::_2);
  }

  template<class T>
  py::str 
  xorcpp_str_inplace(const py::str& a, py::str& b) {
    const size_t alignment = std::max(sizeof(T), 16ul);
    const size_t n         = py::len(b);
    const char* ai         = py::extract<const char*>(a);
    char* bi         = py::extract<char*>(b);
    char* end        = bi + n;

    if (n < 2*alignment) 
      xor_(bi, end, ai, bi);
    else {
      assert(n >= 2*alignment);

      // applying Marek's algorithm to align
      const ptrdiff_t head = (alignment - ((size_t)bi % alignment))% alignment;
      const ptrdiff_t tail = (size_t) end % alignment;
      xor_(bi, bi + head, ai, bi);
      xor_((const T*)(bi + head), (const T*)(end - tail), 
           (const T*)(ai + head),
           (T*)(bi + head));
      if (tail > 0) xor_(end - tail, end, ai + (n - tail), end - tail);
    }
    return b;
  }

  template<class Int>
  pyublas::numpy_vector<Int> 
  xorcpp_pyublas_inplace(pyublas::numpy_vector<Int> a, 
                         pyublas::numpy_vector<Int> b) {
    xor_(b.begin(), b.end(), a.begin(), b.begin());
    return b;
  }
}

BOOST_PYTHON_MODULE(xorcpp)
{
  py::def("xorcpp_inplace", xorcpp_str_inplace<int64_t>);     // for strings
  py::def("xorcpp_inplace", xorcpp_pyublas_inplace<int32_t>); // for numpy
}

次のように Python から使用されます。

import os
import xorcpp

a = os.urandom(2**20)
b = os.urandom(2**20)
c = xorcpp.xorcpp_inplace(a, b) # it calls xorcpp_str_inplace()

score 37 · Accepted Answer

初挑戦

SSE2組み込み関数を使用するscipy.weaveと、わずかな改善が得られます。コードをディスクからロードしてキャッシュする必要があるため、最初の呼び出しは少し遅くなりますが、その後の呼び出しは高速です。

import numpy
import time
from os import urandom
from scipy import weave

SIZE = 2**20

def faster_slow_xor(aa,bb):
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
    return b.tostring()

code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
  xmm1 = _mm_loadu_si128(pa); // must use unaligned access 
  xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
  _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
  ++pa;
  ++pb;
}
"""

def inline_xor(aa, bb):
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    arr_size = a.shape[0]
    weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
    return b.tostring()

2 回目の試行

コメントを考慮して、コードを再検討して、コピーを回避できるかどうかを確認しました。文字列オブジェクトのドキュメントを間違って読んだことが判明したため、2 回目の試行を行います。

support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
    const char* end = in1 + n;
    while (in1 < end) {
       *out = *in1 ^ *in2;
       ++in1; 
       ++in2;
       ++out;
    }
}
"""

code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);

const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;

memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);

const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
    xmm1 = _mm_loadu_si128(pa);
    xmm2 = _mm_loadu_si128(pb);
    _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
    ++pa;
    ++pb;
    ++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""

def inline_xor_nocopy(aa, bb):
    real_size = len(aa)
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.frombuffer(bb, dtype=numpy.uint64)
    return weave.inline(code2, ["a", "b", "real_size"], 
                        headers = ['"emmintrin.h"'], 
                        support_code = support)

違いは、文字列が C コード内で割り当てられることです。SSE2 命令で要求されるように 16 バイト境界でアラインすることは不可能であるため、最初と最後のアラインされていないメモリ領域は、バイト単位のアクセスを使用してコピーされます。

Pythonオブジェクトをweaves . はコピーしないのでこれで問題ありませんが、メモリが 16 バイトにアラインされていないため、より高速なの代わりに使用する必要があります。strstd::stringfrombuffer_mm_loadu_si128_mm_load_si128

を使用する代わりに_mm_store_si128、を使用します_mm_stream_si128。これにより、すべての書き込みができるだけ早くメインメモリにストリーミングされます。この方法では、出力配列が貴重なキャッシュラインを使い果たすことはありません。

タイミング

タイミングに関してはslow_xor、最初の編集のエントリは私の改善されたバージョン (インラインビット単位 xor、uint64) を参照していたので、その混乱を取り除きました。slow_xor元の質問のコードを参照します。すべてのタイミングは、1000 回の実行に対して行われます。

slow_xor: 1.85秒 (1x)
faster_slow_xor：1.25秒（1.48倍）
inline_xor：0.95秒（1.95倍）
inline_xor_nocopy：0.32秒（5.78倍）

コードは gcc 4.4.3 を使用してコンパイルされており、コンパイラが実際に SSE 命令を使用していることを確認しました。

score 17 · Accepted Answer

これがcythonの私の結果です

slow_xor   0.456888198853
faster_xor 0.400228977203
cython_xor 0.232881069183
cython_xor_vectorised 0.171468019485

Cythonでのベクトル化により、コンピューターのforループが約25％削減されますが、Python文字列（returnステートメント）の作成に半分以上の時間が費やされています-配列のように（合法的に）余分なコピーを回避できるとは思いませんnullバイトが含まれています。

違法な方法は、Python文字列を渡してその場で変更し、関数の速度を2倍にすることです。

xor.py

from time import time
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64
import pyximport; pyximport.install()
import xor_

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

def faster_xor(aa,bb):
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

aa=urandom(2**20)
bb=urandom(2**20)

def test_it():
    t=time()
    for x in xrange(100):
        slow_xor(aa,bb)
    print "slow_xor  ",time()-t
    t=time()
    for x in xrange(100):
        faster_xor(aa,bb)
    print "faster_xor",time()-t
    t=time()
    for x in xrange(100):
        xor_.cython_xor(aa,bb)
    print "cython_xor",time()-t
    t=time()
    for x in xrange(100):
        xor_.cython_xor_vectorised(aa,bb)
    print "cython_xor_vectorised",time()-t

if __name__=="__main__":
    test_it()

xor_.pyx

cdef char c[1048576]
def cython_xor(char *a,char *b):
    cdef int i
    for i in range(1048576):
        c[i]=a[i]^b[i]
    return c[:1048576]

def cython_xor_vectorised(char *a,char *b):
    cdef int i
    for i in range(131094):
        (<unsigned long long *>c)[i]=(<unsigned long long *>a)[i]^(<unsigned long long *>b)[i]
    return c[:1048576]

score 10 · Accepted Answer

An easy speedup is to use a larger 'chunk':

def faster_xor(aa,bb):
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

with uint64 also imported from numpy of course. I timeit this at 4 milliseconds, vs 6 milliseconds for the byte version.

score 5 · Accepted Answer

最も高速なビット単位の XOR は "^" です。「bitwise_xor」よりもはるかに速く入力できます;-)

score 1 · Accepted Answer

配列データ型で高速な操作を行いたい場合は、Cython (cython.org) を試す必要があります。正しい宣言を行うと、純粋な C コードにコンパイルできるはずです。

score 1 · Accepted Answer

文字列としての答えがどれほど必要ですか? Python 文字列は不変 (かつ可変) であるため、このc.tostring()メソッドはデータを新しい文字列にコピーする必要があることに注意してください。Python 2.6 と 3.1 には型があり、変更可能であることを除いて( Python 3.x では) のように機能します。ccbytearraystrbytes

もう 1 つの最適化は、outパラメータ toを使用しbitwise_xorて、結果を格納する場所を指定することです。

私のマシンで私は得る

slow_xor (int8): 5.293521 (100.0%)
outparam_xor (int8): 4.378633 (82.7%)
slow_xor (uint64): 2.192234 (41.4%)
outparam_xor (uint64): 1.087392 (20.5%)

この投稿の最後にあるコードを使用してください。特に、事前に割り当てられたバッファーを使用する方法は、新しいオブジェクトを作成するよりも 2 倍高速であることに注意してください (4 バイト ( uint64) チャンクで操作する場合)。これは、チャンクごとに 2 つの操作 (xor + コピー) を実行する低速の方法と、高速の 1 (xor のみ) の方法と一致しています。

また、FWIWa ^ bはと同等bitwise_xor(a,b)であり、とa ^= b同等bitwise_xor(a, b, a)です。

したがって、外部モジュールを作成せずに5倍のスピードアップ:)

from time import time
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64

def slow_xor(aa, bb, ignore, dtype=byte):
    a=frombuffer(aa, dtype=dtype)
    b=frombuffer(bb, dtype=dtype)
    c=bitwise_xor(a, b)
    r=c.tostring()
    return r

def outparam_xor(aa, bb, out, dtype=byte):
    a=frombuffer(aa, dtype=dtype)
    b=frombuffer(bb, dtype=dtype)
    c=frombuffer(out, dtype=dtype)
    assert c.flags.writeable
    return bitwise_xor(a, b, c)

aa=urandom(2**20)
bb=urandom(2**20)
cc=bytearray(2**20)

def time_routine(routine, dtype, base=None, ntimes = 1000):
    t = time()
    for x in xrange(ntimes):
        routine(aa, bb, cc, dtype=dtype)
    et = time() - t
    if base is None:
        base = et
    print "%s (%s): %f (%.1f%%)" % (routine.__name__, dtype.__name__, et,
        (et/base)*100)
    return et

def test_it(ntimes = 1000):
    base = time_routine(slow_xor, byte, ntimes=ntimes)
    time_routine(outparam_xor, byte, base, ntimes=ntimes)
    time_routine(slow_xor, uint64, base, ntimes=ntimes)
    time_routine(outparam_xor, uint64, base, ntimes=ntimes)

score 0 · Accepted Answer

最速の方法（スピードワイズ）は、Max. S推奨。Cで実装します。

このタスクをサポートするコードは、かなり簡単に記述できるはずです。新しい文字列を作成し、xor を実行するモジュール内の 1 つの関数にすぎません。それで全部です。そのようなモジュールを 1 つ実装すると、コードをテンプレートとして簡単に使用できます。または、Python 用の単純な拡張モジュールを実装する他の誰かから実装されたモジュールを使用して、タスクに必要のないものをすべて破棄することさえあります。

本当に複雑な部分は、RefCounter-Stuff を正しく行うことです。しかし、それがどのように機能するかを理解すると、それは扱いやすくなります-また、手元のタスクは非常に単純であるためです（メモリを割り当てて、それを返します-パラメータには触れないでください（Ref-wise））。

score 0 · Accepted Answer

セージのビットセットの対称的な違いを試すことができます。

http://www.sagemath.org/doc/reference/sage/misc/bitset.html

python - シンプルな Python チャレンジ: データ バッファーでの最速のビット単位 XOR

11 に答える 11

パフォーマンスの比較: numpy と Cython と C と Fortran と Boost.Python (pyublas)

xorcpp_inplace()(Boost.Python、pyublas):

初挑戦

2 回目の試行

タイミング

Related

Reference

python - シンプルな Python チャレンジ: データバッファーでの最速のビット単位 XOR

`xorcpp_inplace()`(Boost.Python、pyublas):