python - Boolean operations on scipy.sparse matrices

Question

I have a set of sparse matrices filled with boolean values that I need to perform logical operations on (mostly element-wise OR).

as in numpy, summing matrices with dtype='bool' gives the element-wise OR, however there's a nasty side-effect:

>>> from scipy import sparse
>>> [a,b] = [sparse.rand(5,5,density=0.1,format='lil').astype('bool')
...  for x in range(2)]
>>> b
<5x5 sparse matrix of type '<class 'numpy.bool_'>'
    with 2 stored elements in LInked List format>
>>> a+b
<5x5 sparse matrix of type '<class 'numpy.int8'>'
    with 4 stored elements in Compressed Sparse Row format>

The data type gets changed to 'int8', which causes problems for future operations. This could be gotten around with by saying:

(a+b).astype('bool')

But I get the impression that all this type changing would cause a performance hit.

Why is the dtype of the result different from the operands?
And is there a better way to do logical operations on sparse matrices in python?

score 5 · Accepted Answer

スパース行列では論理演算はサポートされていませんが、「ブール値」に戻すことはそれほど費用がかかりません。実際、LIL形式の行列を使用している場合、パフォーマンスの変動により、変換にマイナスの時間がかかるように見える場合があります。

a = scipy.sparse.rand(10000, 10000, density=0.001, format='lil').astype('bool')
b = scipy.sparse.rand(10000, 10000, density=0.001, format='lil').astype('bool')

In [2]: %timeit a+b
10 loops, best of 3: 61.2 ms per loop

In [3]: %timeit (a+b).astype('bool')
10 loops, best of 3: 60.4 ms per loop

LIL行列を追加する前にCSR形式に変換されていることに気付いたかもしれませんが、戻り形式を見てください。そもそもCSR形式をすでに使用している場合は、変換のオーバーヘッドがより顕著になります。

In [14]: %timeit a+b
100 loops, best of 3: 2.28 ms per loop

In [15]: %timeit (a+b).astype(bool)
100 loops, best of 3: 2.96 ms per loop

CSR（およびCSC）行列にはdata、スパース行列の実際の非ゼロエントリを保持する1D配列である属性があるため、スパース行列を再キャストするコストは、行列の非ゼロエントリの数に依存します。その大きさ：

a = scipy.sparse.rand(10000, 10000, density=0.0005, format='csr').astype('int8')
b = scipy.sparse.rand(1000, 1000, density=0.5, format='csr').astype('int8')

In [4]: %timeit a.astype('bool') # a is 10,000x10,000 with 50,000 non-zero entries
10000 loops, best of 3: 93.3 us per loop

In [5]: %timeit b.astype('bool') # b is 1,000x1,000 with 500,000 non-zero entries
1000 loops, best of 3: 1.7 ms per loop

score 4 · Accepted Answer

You can easily express Boolean operations by the following means. Then it works with sparse matrices.

a.multiply(b) #AND
a+b           #OR
(a>b)+(a<b)   #XOR
a>b           #NOT

So Boolean operations are supported.

python - Boolean operations on scipy.sparse matrices

2 に答える 2

Related

Reference