python - このPythonログ解析コードを最適化する

Question

私のラップトップでの4.2GB入力ファイルのこのコードの実行時間は48秒です。入力ファイルはタブで区切られ、各値は引用符で囲まれています。各レコードは改行で終わります。例：'"val1"\t"val2"\t"val3"\t..."valn"\n'

10スレッドでマルチプロセッシングを使用してみました。1つは行をキューに入れ、8つは個々の行を解析して出力キューを埋め、もう1つは出力キューを以下に示すdefaultdictに減らしますが、コードの実行には300秒かかりました。次の6倍以上長い：

from collections import defaultdict
def get_users(log):
    users = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.
    for (i, line) in enumerate(f): 
        if i % 1000000 == 0: print "Line %d" % i # progress notification

        l = line.split('\t')
        if l[ix_profile] != '"7"': # "7" indicates a bad value
            # use list slicing to remove quotes
            users[l[ix_user][1:-1]] += 1 

    f.close()
    return users

forループからprintステートメント以外のすべてを削除して、I/Oバウンドではないことを確認しました。そのコードは9秒で実行されました。これは、このコードの実行速度の下限を検討します。

私はこれらの5GBのファイルを処理する必要があるので、実行時のわずかな改善（印刷物を削除できます！）でも役立ちます。私が実行しているマシンには4つのコアがあるので、マルチスレッド/マルチプロセスコードを上記のコードよりも高速に実行する方法があるかどうか疑問に思います。

アップデート：

マルチプロセッシングコードを次のように書き直しました。

from multiprocessing import Pool, cpu_count
from collections import defaultdict

def parse(line, ix_profile=10, ix_user=9):
    """ix_profile and ix_user predetermined; hard-coding for expedience."""
    l = line.split('\t')
    if l[ix_profile] != '"7"':
        return l[ix_user][1:-1]

def get_users_mp():
    f = open('20110201.txt')
    h = f.readline() # remove header line
    pool = Pool(processes=cpu_count())
    result_iter = pool.imap_unordered(parse, f, 100)
    users = defaultdict(int)
    for r in result_iter:
        if r is not None:
            users[r] += 1
    return users

それは26秒で実行され、1.85倍のスピードアップです。悪くはありませんが、4コアで、私が望んでいたほどではありません。

score 4 · Accepted Answer

正規表現を使用します。

テストでは、プロセスのコストのかかる部分がstr.split（）の呼び出しであると判断されます。おそらく、行ごとにリストと文字列オブジェクトの束を作成する必要があるのはコストがかかります。

まず、行と一致する正規表現を作成する必要があります。何かのようなもの：

expression = re.compile(r'("[^"]")\t("[^"]")\t')

expression.match（line）.groups（）を呼び出すと、最初の2つの列が2つの文字列オブジェクトとして抽出され、それらを直接使用してロジックを実行できます。

これは、対象の2つの列が最初の2つの列であると想定しています。そうでない場合は、正しい列に一致するように正規表現を微調整する必要があります。コードはヘッダーをチェックして、列が配置されている場所を確認します。これに基づいて正規表現を生成することもできますが、実際には列は常に同じ場所にあると思います。それらがまだ存在することを確認し、行に正規表現を使用するだけです。

編集

コレクションからimportdefaultdictimport re

def get_users(log):
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('\'', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')

    assert ix_user < ix_profile

このコードは、ユーザーがプロファイルの前にいることを前提としています

    keep_field = r'"([^"]*)"'

この正規表現は単一の列をキャプチャします

    skip_field = r'"[^"]*"'

この正規表現は列と一致しますが、結果はキャプチャされません。（括弧がないことに注意してください）

    fields = [skip_field] * len(h)
    fields[ix_profile] = keep_field
    fields[ix_user] = keep_field

すべてのフィールドのリストを作成し、関心のあるフィールドのみを保持します

    del fields[max(ix_profile, ix_user)+1:]

気になるフィールドの後にすべてのフィールドを削除します（一致するのに時間がかかり、気にしない）

    regex = re.compile(r"\t".join(fields))

実際に正規表現を作成します。

    users = defaultdict(int)
    for line in f:
        user, profile = regex.match(line).groups()

2つの値を引き出し、ロジックを実行します

        if profile != "7": # "7" indicates a bad value
            # use list slicing to remove quotes
            users[user] += 1 

    f.close()
    return users

score 2 · Accepted Answer

unixまたはcygwinを実行している場合、次の小さなスクリプトは、ユーザーIDのwhere profile！=7の頻度を生成します。

ユーザーIDをカウントするようにawkで更新

#!/bin/bash

FILENAME="test.txt"

IX_PROFILE=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.type | cut -f1`
IX_USER=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.id | cut -f1`
# Just the userids
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2

# userids counted:
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | sort | uniq -c

# Count using awk..?
sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | awk '{ count[$1]++; } END { for (x in count) { print x "\t" count[x] } }'

score 1 · Accepted Answer

ログファイルがタブで区切られていることを確認すると、csvモジュールをdialect='excel-tab'引数付きで使用して、パフォーマンスと読みやすさを向上させることができます。つまり、もちろん、はるかに高速なコンソールコマンドの代わりにPythonを使用する必要がある場合です。

score 1 · Accepted Answer

正規表現を使用すると、分割する必要のない行の末尾を無視することで速度を大幅に上げることができる場合は、おそらくより簡単なアプローチが役立つ可能性があります。

[snip)
ix_profile = h.index('profile.type')
ix_user = h.index('profile.id')
maxsplits = max(ix_profile, ix_user) + 1 #### new statement ####
# If either ix_* is the last field in h, it will include a newline. 
# That's fine for now.
for (i, line) in enumerate(f): 
    if i % 1000000 == 0: print "Line %d" % i # progress notification
    l = line.split('\t', maxsplits) #### changed line ####
[snip]

あなたのデータにそれを渦巻かせてください。

score 0 · Accepted Answer

Winston Ewertとほぼ同じアイデア、つまり正規表現を作成することを考えていたことに気づきました。

しかし、私の正規表現：

の場合ix_profile < ix_userと同様にix_profile > ix_user
正規表現はユーザーの列のみをキャプチャします。プロファイルの列は、この列に「7」'"(?!7")[^\t\r\n"]*"'が存在する場合は一致しないサブパターンと一致します。したがって、唯一のグループが定義された正しいユーザーのみを取得します

。

さらに、マッチングと抽出のいくつかのアルゴリズムをテストしました。

1）re.finditer（）を使用

2）re.match（）と40フィールドに一致する正規表現を使用

3）withe re.match（）およびmax（ix_profile、ix_user）+1フィールドのみに一致する正規表現

4）3と同様ですが、defaultdictインスタンスの代わりに単純な辞書を使用します

時間を測定するために、私のコードは、その内容に関してあなたが提供した情報に基づいてファイルを作成します。

。

次の4つの関数を4つのコードでテストしました。

1

def get_users_short_1(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = 40*['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]+"'
    glo[ix_user] = '"([^\t"]*)"'
    glo[39] = '"[^\t\r\n]*"'
    regx = re.compile('^'+'\t'.join(glo),re.MULTILINE)

    content = f.read()
    for mat in regx.finditer(content):
        users_short[mat.group(1)] += 1

    f.close()
    return users_short

2

def get_users_short_2(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = 40*['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]*"'
    glo[ix_user] = '"([^\t"]*)"'
    regx = re.compile('\t'.join(glo))


    for line in f:
        gugu = regx.match(line)
        if gugu:
            users_short[gugu.group(1)] += 1
    f.close()
    return users_short

3

def get_users_short_3(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]*"'
    glo[ix_user] = '"([^\t"]*)"'
    regx = re.compile('\t'.join(glo))

    for line in f:
        gugu = regx.match(line)
        if gugu:
            users_short[gugu.group(1)] += 1

    f.close()
    return users_short

4

最速のように見える完全なコード4：

import re
from random import choice,randint,sample
import csv
import random
from time import clock

choi = 1
if choi:
    ntot = 1000
    chars = 'abcdefghijklmnopqrstuvwxyz0123456789'
    def ry(a=30,b=80,chars=chars,nom='abcdefghijklmnopqrstuvwxyz'):
        if a==30:
            return ''.join(choice(chars) for i in xrange(randint(30,80)))
        else:
            return ''.join(choice(nom) for i in xrange(randint(8,12)))

    num = sample(xrange(1000),200)
    num.sort()
    print 'num==',num
    several = [e//3 for e in xrange(0,800,7) if e//3 not in num]
    print
    print 'several==',several

    with open('biggy.txt','w') as f:
        head = ('aaa','bbb','ccc','ddd','profile.id','fff','ggg','hhhh','profile.type','iiii',
                'jjj','kkkk','lll','mmm','nnn','ooo','ppp','qq','rr','ss',
                'tt','uu','vv','ww','xx','yy','zz','razr','fgh','ty',
                'kfgh','zer','sdfs','fghf','dfdf','zerzre','jkljkl','vbcvb','kljlk','dhhdh')
        f.write('\t'.join(head)+'\n')
        for i in xrange(1000):
            li = [ ry(a=8).join('""') if n==4 else ry().join('""')
                   for n in xrange(40) ]
            if i in num:
                li[4] = '@#~&=*;'
                li[8] = '"7"'
            if i in several:
                li[4] = '"BRAD"'
            f.write('\t'.join(li)+'\n')



from collections import defaultdict
def get_users(log):
    users = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.
    for (i, line) in enumerate(f): 
        #if i % 1000000 == 0: print "Line %d" % i # progress notification

        l = line.split('\t')
        if l[ix_profile] != '"7"': # "7" indicates a bad value
            # use list slicing to remove quotes

            users[l[ix_user][1:-1]] += 1 
    f.close()
    return users




def get_users_short_4(log):
    users_short = {}
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]*"'
    glo[ix_user] = '"([^\t"]*)"'
    regx = re.compile('\t'.join(glo))

    for line in f:
        gugu = regx.match(line)
        if gugu:
            gugugroup = gugu.group(1)
            if gugugroup in users_short:
                users_short[gugugroup] += 1
            else:
                users_short[gugugroup] = 1

    f.close()
    return users_short




print '\n\n'

te = clock()
USERS = get_users('biggy.txt')
t1 = clock()-te

te = clock()
USERS_short_4 = get_users_short_4('biggy.txt')
t2 = clock()-te



if choi:
    print '\nlen(num)==',len(num),' : number of lines with ix_profile==\'"7"\''
    print "USERS['BRAD']==",USERS['BRAD']
    print 'then :'
    print str(ntot)+' lines - '+str(len(num))+' incorrect - '+str(len(several))+\
          ' identical + 1 user BRAD = '+str(ntot - len(num)-len(several)+1)    
print '\nlen(USERS)==',len(USERS)
print 'len(USERS_short_4)==',len(USERS_short_4)
print 'USERS == USERS_short_4 is',USERS == USERS_short_4

print '\n----------------------------------------'
print 'time of get_users() :\n', t1,'\n----------------------------------------'
print 'time of get_users_short_4 :\n', t2,'\n----------------------------------------'
print 'get_users_short_4() / get_users() = '+str(100*t2/t1)+ ' %'
print '----------------------------------------'

このコード4の結果の1つは、次の例です。

num== [2, 12, 16, 23, 26, 33, 38, 40, 43, 45, 51, 53, 84, 89, 93, 106, 116, 117, 123, 131, 132, 135, 136, 138, 146, 148, 152, 157, 164, 168, 173, 176, 179, 189, 191, 193, 195, 199, 200, 208, 216, 222, 224, 227, 233, 242, 244, 245, 247, 248, 251, 255, 256, 261, 262, 266, 276, 278, 291, 296, 298, 305, 307, 308, 310, 312, 314, 320, 324, 327, 335, 337, 340, 343, 350, 356, 362, 370, 375, 379, 382, 385, 387, 409, 413, 415, 419, 433, 441, 443, 444, 446, 459, 462, 474, 489, 492, 496, 505, 509, 511, 512, 518, 523, 541, 546, 548, 550, 552, 558, 565, 566, 572, 585, 586, 593, 595, 601, 609, 610, 615, 628, 632, 634, 638, 642, 645, 646, 651, 654, 657, 660, 662, 665, 670, 671, 680, 682, 687, 688, 690, 692, 695, 703, 708, 716, 717, 728, 729, 735, 739, 741, 742, 765, 769, 772, 778, 790, 792, 797, 801, 808, 815, 825, 828, 831, 839, 849, 858, 859, 862, 864, 872, 874, 890, 899, 904, 906, 913, 916, 920, 923, 928, 941, 946, 947, 953, 955, 958, 959, 961, 971, 975, 976, 979, 981, 985, 989, 990, 999]

several== [0, 4, 7, 9, 11, 14, 18, 21, 25, 28, 30, 32, 35, 37, 39, 42, 44, 46, 49, 56, 58, 60, 63, 65, 67, 70, 72, 74, 77, 79, 81, 86, 88, 91, 95, 98, 100, 102, 105, 107, 109, 112, 114, 119, 121, 126, 128, 130, 133, 137, 140, 142, 144, 147, 149, 151, 154, 156, 158, 161, 163, 165, 170, 172, 175, 177, 182, 184, 186, 196, 198, 203, 205, 207, 210, 212, 214, 217, 219, 221, 226, 228, 231, 235, 238, 240, 249, 252, 254, 259, 263]




len(num)== 200  : number of lines with ix_profile=='"7"'
USERS['BRAD']== 91
then :
1000 lines - 200 incorrect - 91 identical + 1 user BRAD = 710

len(USERS)== 710
len(USERS_short_4)== 710
USERS == USERS_short_4 is True

----------------------------------------
time of get_users() :
0.0788686830309 
----------------------------------------
time of get_users_short_4 :
0.0462885646081 
----------------------------------------
get_users_short_4() / get_users() = 58.690677756 %
----------------------------------------

しかし、結果は多かれ少なかれ変動します。私が得ました：

get_users_short_1() / get_users() = 82.957476637 %
get_users_short_1() / get_users() = 82.3987686867 %
get_users_short_1() / get_users() = 90.2949842932 %
get_users_short_1() / get_users() = 78.8063007461 %
get_users_short_1() / get_users() = 90.4743181768 %
get_users_short_1() / get_users() = 81.9635560003 %
get_users_short_1() / get_users() = 83.9418269406 %
get_users_short_1() / get_users() = 89.4344442255 %


get_users_short_2() / get_users() = 80.4891442088 %
get_users_short_2() / get_users() = 69.921943776 %
get_users_short_2() / get_users() = 81.8006709304 %
get_users_short_2() / get_users() = 83.6270772928 %
get_users_short_2() / get_users() = 97.9821084403 %
get_users_short_2() / get_users() = 84.9307558629 %
get_users_short_2() / get_users() = 75.9384820018 %
get_users_short_2() / get_users() = 86.2964748485 %


get_users_short_3() / get_users() = 69.4332754744 %
get_users_short_3() / get_users() = 58.5814726668 %
get_users_short_3() / get_users() = 61.8011476831 %
get_users_short_3() / get_users() = 67.6925083362 %
get_users_short_3() / get_users() = 65.1208124156 %
get_users_short_3() / get_users() = 72.2621727569 %
get_users_short_3() / get_users() = 70.6957501222 %
get_users_short_3() / get_users() = 68.5310031226 %
get_users_short_3() / get_users() = 71.6529128259 %
get_users_short_3() / get_users() = 71.6153554073 %
get_users_short_3() / get_users() = 64.7899044975 %
get_users_short_3() / get_users() = 72.947531363 %
get_users_short_3() / get_users() = 65.6691965629 %
get_users_short_3() / get_users() = 61.5194374401 %
get_users_short_3() / get_users() = 61.8396133666 %
get_users_short_3() / get_users() = 71.5447862466 %
get_users_short_3() / get_users() = 74.6710538858 %
get_users_short_3() / get_users() = 72.9651233485 %



get_users_short_4() / get_users() = 65.5224210767 %
get_users_short_4() / get_users() = 65.9023813161 %
get_users_short_4() / get_users() = 62.8055210129 %
get_users_short_4() / get_users() = 64.9690049062 %
get_users_short_4() / get_users() = 61.9050866134 %
get_users_short_4() / get_users() = 65.8127125992 %
get_users_short_4() / get_users() = 66.8112344201 %
get_users_short_4() / get_users() = 57.865635278 %
get_users_short_4() / get_users() = 62.7937713964 %
get_users_short_4() / get_users() = 66.3440149528 %
get_users_short_4() / get_users() = 66.4429530201 %
get_users_short_4() / get_users() = 66.8692388625 %
get_users_short_4() / get_users() = 66.5949137537 %
get_users_short_4() / get_users() = 69.1708488794 %
get_users_short_4() / get_users() = 59.7129743801 %
get_users_short_4() / get_users() = 59.755297387 %
get_users_short_4() / get_users() = 60.6436352185 %
get_users_short_4() / get_users() = 64.5023727945 %
get_users_short_4() / get_users() = 64.0153937511 %

。

私のコードよりも確かに強力なコンピューターを使用して、実際のファイルのコードでどのような結果が得られるのか知りたいです。ニュースをください。

。

編集1

と

def get_users_short_Machin(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    maxsplits = max(ix_profile, ix_user) + 1
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.
    for line in f: 
        #if i % 1000000 == 0: print "Line %d" % i # progress notification
        l = line.split('\t', maxsplits)
        if l[ix_profile] != '"7"': # "7" indicates a bad value
            # use list slicing to remove quotes
            users_short[l[ix_user][1:-1]] += 1 
    f.close()
    return users_short

私が持っている

get_users_short_Machin() / get_users() = 60.6771821308 %
get_users_short_Machin() / get_users() = 71.9300992989 %
get_users_short_Machin() / get_users() = 85.1695214715 %
get_users_short_Machin() / get_users() = 72.7722233685 %
get_users_short_Machin() / get_users() = 73.6311173237 %
get_users_short_Machin() / get_users() = 86.0848484053 %
get_users_short_Machin() / get_users() = 75.1661981729 %
get_users_short_Machin() / get_users() = 72.8888452474 %
get_users_short_Machin() / get_users() = 76.7185685993 %
get_users_short_Machin() / get_users() = 82.7007096958 %
get_users_short_Machin() / get_users() = 71.1678957888 %
get_users_short_Machin() / get_users() = 71.9845835126 %

簡単なdictの使用：

users_short = {}
.......
for line in f: 
    #if i % 1000000 == 0: print "Line %d" % i # progress notification
    l = line.split('\t', maxsplits)
    if l[ix_profile] != '"7"': # "7" indicates a bad value
        # use list slicing to remove quotes
        us = l[ix_user][1:-1]
        if us not in users_short:
            users_short[us] = 1
        else:
            users_short[us] += 1

実行時間は少し改善されますが、前回のコードよりも高いままです4

get_users_short_Machin2() / get_users() = 71.5959919389 %
get_users_short_Machin2() / get_users() = 71.6118864535 %
get_users_short_Machin2() / get_users() = 66.3832514274 %
get_users_short_Machin2() / get_users() = 68.0026407277 %
get_users_short_Machin2() / get_users() = 67.9853921552 %
get_users_short_Machin2() / get_users() = 69.8946203037 %
get_users_short_Machin2() / get_users() = 71.8260030248 %
get_users_short_Machin2() / get_users() = 78.4243267003 %
get_users_short_Machin2() / get_users() = 65.7223734428 %
get_users_short_Machin2() / get_users() = 69.5903935612 %

。

編集2

最速：

def get_users_short_CSV(log):
    users_short = {}
    f = open(log,'rb')
    rid = csv.reader(f,delimiter='\t')
    # Read header line
    h = rid.next()
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t\r\n"]*"'
    glo[ix_user] = '"([^\t\r\n"]*)"'
    regx = re.compile('\t'.join(glo))

    for line in f:
        gugu = regx.match(line)
        if gugu:
            gugugroup = gugu.group(1)
            if gugugroup in users_short:
                users_short[gugugroup] += 1
            else:
                users_short[gugugroup] = 1

    f.close()
    return users_short

結果

get_users_short_CSV() / get_users() = 31.6443901114 %
get_users_short_CSV() / get_users() = 44.3536176134 %
get_users_short_CSV() / get_users() = 47.2295100511 %
get_users_short_CSV() / get_users() = 45.4912200716 %
get_users_short_CSV() / get_users() = 63.7997241038 %
get_users_short_CSV() / get_users() = 43.5020255488 %
get_users_short_CSV() / get_users() = 40.9188320386 %
get_users_short_CSV() / get_users() = 43.3105062139 %
get_users_short_CSV() / get_users() = 59.9184895288 %
get_users_short_CSV() / get_users() = 40.22047881 %
get_users_short_CSV() / get_users() = 48.3615872543 %
get_users_short_CSV() / get_users() = 47.0374831251 %
get_users_short_CSV() / get_users() = 44.5268626789 %
get_users_short_CSV() / get_users() = 53.1690205938 %
get_users_short_CSV() / get_users() = 43.4022458372 %

。

編集3

get_users_short_CSV（）を、1000行ではなく10000行でテストしました。

len(num)== 2000  : number of lines with ix_profile=='"7"'
USERS['BRAD']== 95
then :
10000 lines - 2000 incorrect - 95 identical + 1 user BRAD = 7906

len(USERS)== 7906
len(USERS_short_CSV)== 7906
USERS == USERS_short_CSV is True

----------------------------------------
time of get_users() :
0.794919186656 
----------------------------------------
time of get_users_short_CSV :
0.358942826532 
----------------------------------------
get_users_short_CSV() / get_users() = 41.5618307521 %

get_users_short_CSV() / get_users() = 42.2769300584 %
get_users_short_CSV() / get_users() = 45.154631132 %
get_users_short_CSV() / get_users() = 44.1596819482 %
get_users_short_CSV() / get_users() = 30.3192350266 %
get_users_short_CSV() / get_users() = 34.4856637748 %
get_users_short_CSV() / get_users() = 43.7461535628 %
get_users_short_CSV() / get_users() = 41.7577246935 %
get_users_short_CSV() / get_users() = 41.9092878608 %
get_users_short_CSV() / get_users() = 44.6772360665 %
get_users_short_CSV() / get_users() = 42.6770989413 %

score 0 · Accepted Answer

多分あなたはすることができます

users[l[ix_user]] += 1

それ以外の

users[l[ix_user][1:-1]] += 1

最後にあるdictの引用符を削除します。時間を節約する必要があります。

マルチスレッドアプローチの場合：毎回ファイルから数千行を読み取り、それらの千行をスレッドに渡して処理してみてください。行ごとにそれを行うことは、あまりにも多くのオーバーヘッドのようです。

または、彼があなたがやろうとしていることと非常によく似た何かをしているように見えるので、この記事の解決策を読んでください。

score 0 · Accepted Answer

これは要点から少し外れているかもしれませんが、Pythonは複数のスレッドを処理するときに非常に奇妙な動作をします（特にスレッドがIOバウンドされていない場合は悪いです）。具体的には、シングルスレッドの場合よりも実行速度が大幅に低下する場合があります。これは、Pythonのグローバルインタープリターロック（GIL）を使用して、Pythonインタープリターで常に1つのスレッドしか実行できないようにする方法によるものです。

常に1つのスレッドのみが実際にインタープリターを使用できるという制約があるため、複数のコアがあるという事実は役に立ちません。実際、GILを取得しようとする2つのスレッド間の病理学的相互作用により、事態はさらに悪化する可能性があります。Pythonに固執したい場合は、次の2つのオプションのいずれかがあります。

Python 3.2を使用してみてください（またはそれ以降、3.0は機能しません）。これは、GILを処理する方法が大きく異なり、多くの場合にマルチスレッドの速度低下の問題を修正します。print古いステートメントを使用しているため、Python3シリーズを使用していないと想定しています。
スレッドの代わりにプロセスを使用します。プロセスは開いているファイル記述子を共有するため、実際にファイルを食べ始めたら、プロセス間で状態を渡す必要はありません（本当に必要な場合は、パイプまたはメッセージを使用できます）。プロセスはスレッドよりも作成に時間がかかるため、これにより初期起動時間がいくらか長くなりますが、GILの問題は回避できます。

この素晴らしく難解なPythonの詳細については、次のページでGILに関連する講演を参照してください：http ：//www.dabeaz.com/talks.html 。

python - このPythonログ解析コードを最適化する

7 に答える 7

1

2

3

4

編集1

編集2

編集3

Related

Reference