python - Pythonで2D配列から列を選択する最速の方法は?

Question

Python で 2D 配列の列内の特定の要素の出現回数をカウントする関数のパフォーマンスを改善しようとしています。時間は cProfile からのもので、count()357595 回の呼び出しで 0.08 秒しかかかっていないこともわかります。

for ループが最速です (357595 回の呼び出しで .375 秒):

def count_column(grid, j, element):
    count = 0
    for x in range(0, len(grid)):
        if grid[x][j] == element:
            count += 1
    return count

リスト内包表記は無視できるほど遅くなります (357595 呼び出しで .400 秒):

def count_column(grid, j, element):
    return [x[j] for x in grid].count(element)

Zip は非常に遅い (357595 呼び出しで .741 秒):

def validate_column(grid, j, element):
    return zip(*grid)[j].count(element)

これを行うためのより高速な方法はありますか、または配列をフラット化するのが最善の方法でしょうchain.from_iterableか?

score 1 · Accepted Answer

Here's the timings I got for a bunch of different variations:

cc_explicit 5000 0.00290012359619
cc_explicit_xrange 5000 0.00145506858826
cc_filter 5000 0.00117516517639
cc_genexp 5000 0.00100994110107
cc_ifilter 5000 0.00170707702637
cc_izip 1 3.21103000641
cc_listcomp 5000 0.000788927078247
cc_zip 5000 12.1080589294

The code and test driver is at http://pastebin.com/WSAUqTyv

Since zip and izip were so slow, I took them out of the equation and ran a 500000x10 test with the rest:

cc_explicit 500000 0.105982065201
cc_explicit_xrange 500000 0.103507995605
cc_filter 500000 0.0856108665466
cc_genexp 500000 0.0679898262024
cc_ifilter 500000 0.144966125488
cc_listcomp 500000 0.0396680831909

So, the fastest solution was listcomp. But when I threw random data and larger rows at it, genexp and explicit_xrange both sometimes beat listcomp, and they're all pretty close in most of the tests, and genexp uses much less memory, so I'd go with that:

def cc_genexp(grid, j, element):
    return sum(x[j] == element for x in grid)

score 0 · Accepted Answer

ここで私の 2 セントを提供できる場合は、すでに提案されているように、Numpyも調べる必要があります。または、非標準ライブラリを扱いたくない場合は、をチェックしてくださいcollections.Counter()。はい、多額の初期費用がかかりますが、同じデータセット内のさまざまな値のカウントを行っていることに気付いた場合、初期投資が報われることがあります。

python - Pythonで2D配列から列を選択する最速の方法は?

2 に答える 2

Related

Reference