python - このカウントプログラムのパフォーマンスを改善するにはどうすればよいですか?

Question

ファイルは次のようになります。

最初のフィールドは ID ですin range(0, 200000000)。2 番目のフィールドはタイプを表し、これはin range(1, 5)です。また、タイプ 1 とタイプ 2 は共通のカテゴリS1に属し、タイプ 3 とタイプ 4 は共通のカテゴリに属しS2ます。1 つの ID に、タイプの異なる複数のレコードが含まれる場合があります。ファイルのサイズは約 200MB です。

問題は、タイプ 1 または 2 のレコードを持つ ID の数と、タイプ 3 または 4 のレコードを持つ ID の数を数えることです。

私のコード:

def gen(path):
    line_count = 0
    for line in open(path):
        tmp = line.split()
        id = int(tmp[0])
        yield id, int(tmp[1])

max_id = 200000000
S1 = bitarray.bitarray(max_id)
S2 = bitarray.bitarray(max_id)
for id, type in gen(path):
    if type != 3 and type != 4:
        S1[id] = True
    else:
        S2[id] = True

print S1.count(), S2.count()

答えは出ますが、動作が少し遅いと思います。より速く実行するにはどうすればよいですか?

編集: ファイルに重複したレコードがあります。そして、S1 (タイプ 1 とタイプ 2) と S2 (タイプ 3 とタイプ 4) を区別する必要があるだけです。たとえば、1440927 1and1440927 2は 1 回だけカウントされますが、S1 に属しているため 2 回ではありません。そのため、ID を保存する必要があります。

score 3 · Accepted Answer

ファイルに対して反復子を使用しています。これは、一度に数行しかバッファリングしないことを意味します。バッファーが空になるたびに、ディスクをシークする必要があり、プログラムは待機する必要があります。

200MB はメモリに簡単に収まるので、すべての行を取得すると速度が向上します。

def gen(path):
    # load all the lines, 
    lines = open(path).readlines() 
    split = (line.split() for line in lines)
    return ((int(x), int(y)) for x,y in split)

score 2 · Accepted Answer

十分なメモリがある場合は、dict代わりに使用できますbitarray.bitarray。それはより速いかもしれません：

S1, S2 = {}, {} # dicts are slightly faster than `set()`
with open(path) as f:
     for i, line in enumerate(f, 1):
         id, sep, type = line.partition(" ")
         if type == "1" or type == "2":
            S1[id] = True
         elif type == "3" or type == "4":
            S2[id] = True
         else:
            print "WARNING: unknown type: %r in line %d: %r" % (type, i, line)
print len(S1), len(S2)

または、最初に行を並べ替えることができます。

def gettype(line):
    return line[-1]

S1, S2 = 0, 0
with open(path) as f:
     lines = f.read().splitlines()

lines.sort(key=gettype)
for type, group in itertools.groupby(lines, gettype):
    ids = (line.partition(" ")[0] for line in group)
    if type == "1" or type == "2":
       S1 += len(set(ids))
    elif type == "3" or type == "4":
       S2 += len(set(ids))
    else:
       assert 0, (type, list(ids))

print S1, S2

2 番目のアプローチの漸近的な複雑さはさらに悪化します。

line_profilerを使用して、ボトルネックがどこにあるかを調べることができます。

score 2 · Accepted Answer

あなたはPythonに縛られていますか？

egrep -e "[12]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l

egrep -e "[34]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l

これらの 2 つのコマンドは、filename.txt の各行の末尾にある ("1" または "2") と ("3" または "4") の出現回数をカウントしますが、最初のフィールドの重複は無視します。

おそらく Python よりも高速です…</p>

python - このカウント プログラムのパフォーマンスを改善するにはどうすればよいですか?

3 に答える 3

Related

Reference

python - このカウントプログラムのパフォーマンスを改善するにはどうすればよいですか?