python - Pythonでのオブジェクトへの文字列のペアリングの高速化

Question

整数点を含むデータの行をペアにして、Pythonオブジェクトとして保存する効率的な方法を見つけようとしています。データは、コンマ区切りの文字列として表される座標点で構成さXれます。Yポイントは、などのようにペアにしてから(x_1, y_1), (x_2, y_2), ...、オブジェクトのリストとして保存する必要があります。各ポイントはオブジェクトです。以下の関数get_dataは、このサンプルデータを生成します。

def get_data(N=100000, M=10):
    import random
    data = []
    for n in range(N):
        pair = [[str(random.randint(1, 10)) for x in range(M)],
                [str(random.randint(1, 10)) for x in range(M)]]
        row = [",".join(pair[0]),
               ",".join(pair[1])]
        data.append(row)
    return data

私が今持っている解析コードは次のとおりです。

class Point:
    def __init__(self, a, b):
        self.a = a
        self.b = b

def test():
    import time
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    time_end = time.time()
    print "total time: ", (time_end - time_start)

現在、これは100,000ポイントで7秒近くかかりますが、これは非常に非効率的です。first_points非効率性の一部は、、、second_pointsおよびpaired_points-の計算とこれらのオブジェクトへの変換に起因するようです。

非効率性のもう1つの部分は、の蓄積であるようですall_point_sets。行を削除するall_point_sets.append(...)と、コードが約7秒から2秒になるようです。

これをどのようにスピードアップできますか？ありがとう。

フォローアップみんなの素晴らしい提案に感謝します-それらはすべて役に立ちました。ただし、すべての改善を行っても、100,000エントリを処理するのに約3秒かかります。この場合、それが単なる瞬間ではない理由と、それを瞬間的にする代替表現があるかどうかはわかりません。これをCythonでコーディングすると、状況が変わりますか？誰かがその例を提供できますか？再度、感謝します。

score 20 · Accepted Answer

多数のオブジェクトの作成を処理する場合、多くの場合、使用できる最大のパフォーマンス向上は、ガベージコレクターをオフにすることです。オブジェクトの「世代」ごとに、ガベージコレクターはメモリ内のすべてのライブオブジェクトをトラバースし、サイクルの一部であるがライブオブジェクトによってポイントされていないオブジェクトを探します。したがって、メモリ再利用の対象となります。いくつかの情報については、 DougHelmannのPyMOTWGCの記事を参照してください（詳細については、Googleといくつかの決定で見つけることができます）。ガベージコレクターは、デフォルトでは、作成されたが再利用されていないオブジェクトが700程度ごとに実行され、後続の世代の実行頻度はやや少なくなります（正確な詳細は忘れています）。

Pointクラスの代わりに標準のタプルを使用すると時間を節約でき（名前付きタプルを使用するとその中間になります）、賢いアンパックを使用すると時間を節約できますが、ロットを作成する前にgcをオフにすることで最大の利益を得ることができます知っているオブジェクトのgcを実行する必要はなく、後でオンに戻す必要はありません。

いくつかのコード：

def orig_test_gc_off():
    import time
    data = get_data()
    all_point_sets = []
    import gc
    gc.disable()
    time_start = time.time()
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "gc off total time: ", (time_end - time_start)

def test1():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    gc.disable()
    for index, row in enumerate(data):
        first_points, second_points = row
        curr_points = map(
            Point,
            [int(i) for i in first_points.split(",")],
            [int(i) for i in second_points.split(",")])
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 1 total time: ", (time_end - time_start)

def test2():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    gc.disable()
    time_start = time.time()
    for index, row in enumerate(data):
        first_points, second_points = row
        first_points = [int(i) for i in first_points.split(",")]
        second_points = [int(i) for i in second_points.split(",")]
        curr_points = [(x, y) for x, y in zip(first_points, second_points)]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 2 total time: ", (time_end - time_start)

orig_test()
orig_test_gc_off()
test1()
test2()

いくつかの結果：

>>> %run /tmp/flup.py
total time:  6.90738511086
gc off total time:  4.94075202942
variant 1 total time:  4.41632509232
variant 2 total time:  3.23905301094

score 15 · Accepted Answer

pypyで実行するだけで大きな違いが生まれます

$ python pairing_strings.py 
total time:  2.09194397926
$ pypy pairing_strings.py 
total time:  0.764246940613

gcを無効にするとpypyには役立ちませんでした

$ pypy pairing_strings.py 
total time:  0.763386964798

ポイントのnamedtupleはそれを悪化させます

$ pypy pairing_strings.py 
total time:  0.888827085495

itertools.imap、およびitertools.izipを使用する

$ pypy pairing_strings.py 
total time:  0.615751981735

メモ化されたバージョンのintとイテレータを使用してzipを回避する

$ pypy pairing_strings.py 
total time:  0.423738002777

これが私が完成させたコードです。

def test():
    import time
    def m_int(s, memo={}):
        if s in memo:
            return memo[s]
        else:
            retval = memo[s] = int(s)
            return retval
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    for xs, ys in data:
        point_set = []
        # Convert points from strings to integers
        y_iter = iter(ys.split(","))
        curr_points = [Point(m_int(i), m_int(next(y_iter))) for i in xs.split(",")]
        all_point_sets.append(curr_points)
    time_end = time.time()
    print "total time: ", (time_end - time_start)

score 9 · Accepted Answer

私は...するだろう

numpyこの問題には配列を使用します（Cythonこれでも十分な速度が得られない場合は、オプションになります）。
ポイントを単一のPointインスタンスとしてではなく、ベクトルとして保存します。
既存のパーサーに依存する
（可能であれば）データを一度解析してから、hdf5のようなバイナリ形式で保存してさらに計算します。これが最速のオプションになります（以下を参照）。

Numpyには、たとえばテキストファイルを読み取るための関数が組み込まれていますloadtxt。構造化配列にデータを格納している場合、必ずしも別のデータ型に変換する必要はありません。の上に構築されたライブラリであるPandasを使用しnumpyます。構造化データの処理と処理にはもう少し便利です。Pandas独自のファイルパーサーがありますread_csv。

それを計るために、私はあなたの元の問題のように（それはあなたに基づいていますget_data）ファイルにデータを書きました：

import numpy as np
import pandas as pd

def create_example_file(n=100000, m=20):
    ex1 = pd.DataFrame(np.random.randint(1, 10, size=(10e4, m)),
                       columns=(['x_%d' % x for x in range(10)] +
                                ['y_%d' % y for y in range(10)]))
    ex1.to_csv('example.csv', index=False, header=False)
    return

これは、pandas.DataFrame：のデータを読み取るために使用したコードです。

def with_read_csv(csv_file):
    df = pd.read_csv(csv_file, header=None,
                     names=(['x_%d' % x for x in range(10)] +
                            ['y_%d' % y for y in range(10)]))
    return df

（ファイルにヘッダーがないため、列名を作成する必要があると想定したことに注意してください。）

データの読み取りは高速で、メモリ効率が高くなり（この質問を参照）、データはデータ構造に格納され、高速でベクトル化された方法でさらに操作できます。

In [18]: %timeit string_to_object.with_read_csv('example.csv')
1 loops, best of 3: 553 ms per loop

私のシステムでは414ミリ秒かかる開発ブランチに新しいCベースのパーサーがあります。Point私のシステムではテストに2.29秒かかりますが、データがファイルから読み取られず、インスタンスを作成したため、実際には比較できません。

一度データを読み込んだことがある場合は、それをhdf5ファイルに保存できます。

In [19]: store = pd.HDFStore('example.h5')

In [20]: store['data'] = df

In [21]: store.close()

次回データが必要になったときは、このファイルからデータを読み取ることができます。これは非常に高速です。

In [1]: store = pd.HDFStore('example.h5')

In [2]: %timeit df = store['data']
100 loops, best of 3: 16.5 ms per loop

ただし、同じデータが複数回必要な場合にのみ適用されます。

大きなデータセットでベースの配列を使用numpyすると、さらに計算を行うときに利点があります。Cythonベクトル化された関数とインデックス付けを使用できる場合は必ずしも高速numpyではありませんが、本当に反復が必要な場合は高速になります（この回答も参照してください）。

score 8 · Accepted Answer

Numpyを使用したより高速な方法（約7倍のスピードアップ）：

import numpy as np
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))

パフォーマンスの比較：

def load_1(data):
    all_point_sets = []
    gc.disable()
    for xs, ys in data:
        all_point_sets.append(zip(map(int, xs.split(',')), map(int, ys.split(','))))
    gc.enable()
    return all_point_sets

def load_2(data):
    txt = ','.join(','.join(row) for row in data)
    arr = np.fromstring(txt, dtype=int, sep=',')
    return arr.reshape(100000, 2, 10).transpose((0,2,1))

load_1私のマシンでは1.52秒で実行されます。0.20load_2秒で実行され、7倍の改善。ここでの大きな注意点は、（1）すべての長さを事前に知っていること、および（2）すべての行にまったく同じ数のポイントが含まれていることです。これは出力には当てはまりますが、実際のデータセットには当てはまらない場合があります。get_data

score 7 · Accepted Answer

配列と、アクセス時にPointオブジェクトを遅延構築するホルダーオブジェクトを使用することで、50％の改善が得られました。また、ストレージ効率を高めるために、Pointオブジェクトを「スロット化」しました。ただし、タプルの方がおそらく良いでしょう。

可能であれば、データ構造を変更することも役立つ場合があります。しかし、これは決して瞬時ではありません。

from array import array

class Point(object):
    __slots__ = ["a", "b"]
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __repr__(self):
        return "Point(%d, %d)" % (self.a, self.b)

class Points(object):
    def __init__(self, xs, ys):
        self.xs = xs
        self.ys = ys

    def __getitem__(self, i):
        return Point(self.xs[i], self.ys[i])

def test3():
    xs = array("i")
    ys = array("i")
    time_start = time.time()
    for row in data:
        xs.extend([int(val) for val in row[0].split(",")])
        ys.extend([int(val) for val in row[1].split(",")])
    print ("total time: ", (time.time() - time_start))
    return Points(xs, ys)

しかし、大量のデータを処理するときは、通常、numpyのN次元配列（ndarray）を使用します。元のデータ構造を変更できるとしたら、それが最も速いでしょう。x、yペアを線形に読み取り、ndarrayの形状を変更するように構成できる場合。

score 6 · Accepted Answer

（〜10％のスピードアップ）を作成Pointします：namedtuple
```
from collections import namedtuple
Point = namedtuple('Point', 'a b')
```
反復中に解凍します（約2〜4％のスピードアップ）：
```
for xs, ys in data:
```
nzipを回避するために-argument形式を使用しますmap（約10％の高速化）：
```
curr_points = map(Point,
    map(int, xs.split(',')),
    map(int, ys.split(',')),
)
```

ポイントセットが短いことを考えると、ジェネレーターは固定オーバーヘッドが高いため、おそらくやり過ぎです。

score 6 · Accepted Answer

cythonは5.5倍スピードアップすることができます

$ python split.py
total time:  2.16252303123
total time:  0.393486022949

これが私が使用したコードです

split.py

import time
import pyximport; pyximport.install()
from split_ import test_


def get_data(N=100000, M=10):
    import random
    data = []
    for n in range(N):
        pair = [[str(random.randint(1, 100)) for x in range(M)],
                [str(random.randint(1, 100)) for x in range(M)]]
        row = [",".join(pair[0]),
               ",".join(pair[1])]
        data.append(row)
    return data

class Point:
    def __init__(self, a, b):
        self.a = a
        self.b = b

def test(data):
    all_point_sets = []
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    return all_point_sets

data = get_data()
for func in test, test_:
    time_start = time.time()
    res = func(data)
    time_end = time.time()
    print "total time: ", (time_end - time_start)

split_.pyx

from libc.string cimport strsep
from libc.stdlib cimport atoi

cdef class Point:
    cdef public int a,b

    def __cinit__(self, a, b):
        self.a = a
        self.b = b

def test_(data):
    cdef char *xc, *yc, *xt, *yt
    cdef char **xcp, **ycp
    all_point_sets = []
    for xs, ys in data:
        xc = xs
        xcp = &xc
        yc = ys
        ycp = &yc
        point_set = []
        while True:
            xt = strsep(xcp, ',')
            if xt is NULL:
                break
            yt = strsep(ycp, ",")
            point_set.append(Point(atoi(xt), atoi(yt)))
        all_point_sets.append(point_set)
    return all_point_sets

さらに突っ込んで、CPUリソースの一部をほぼ分解できます

         5% strsep()
         9% atoi()
        23% creating Point instances
        35% all_point_sets.append(point_set)

CythonがPythonオブジェクトをトロールする代わりに、csv（またはその他）ファイルから直接読み取ることができれば、改善が見込めると思います。

score 2 · Accepted Answer

.x自分の座標に.y属性としてアクセスできるようにすることにどの程度執着していますか？驚いたことに、私のテストでは、最大のシングルタイムシンクはへの呼び出しではなくlist.append()、オブジェクトの構築であったことが示されていPointます。タプルの4倍の時間がかかり、たくさんあります。Point(int(x), int(y))コード内のタプルに置き換えるだけで、合計(int(x), int(y))実行時間が50％以上短縮されます（WinXPのPython2.6）。おそらく、現在のコードにはまだこれを最適化する余地がありますか？

.xとを使用して座標にアクセスすることに本当に慣れている場合は.y、を使用してみてくださいcollections.namedtuple。プレーンタプルほど高速ではありませんが、コードのPairクラスよりもはるかに高速であるようです（別のタイミングベンチマークで奇妙な結果が得られたため、ヘッジしています）。

Pair = namedtuple("Pair", "x y")  # instead of the Point class
...
curr_points = [ Pair(x, y) for x, y in paired_points ]

このルートを使用する必要がある場合は、タプルからクラスを派生させることも効果があります（プレーンタプルよりも最小限のコスト）。ご要望があれば詳細をお知らせします。

PS @MattAndersonがオブジェクトタプルの問題についてずっと前に言及したのを私は見ます。ただし、ガベージコレクションを無効にする前であっても、これは（少なくとも私のボックスでは）大きな影響です。

               Original code: total time:  15.79
      tuple instead of Point: total time:  7.328
 namedtuple instead of Point: total time:  9.140

score 2 · Accepted Answer

あなたは数秒を剃ることができます：

class Point2(object):
    __slots__ = ['a','b']
    def __init__(self, a, b):
        self.a = a
        self.b = b

def test_new(data):
    all_point_sets = []
    for row in data:
        first_points, second_points = row
        r0 = map(int, first_points.split(","))
        r1 = map(int, second_points.split(","))
        cp = map(Point2, r0, r1)
        all_point_sets.append(cp)

それは私に与えた

In [24]: %timeit test(d)
1 loops, best of 3: 5.07 s per loop

In [25]: %timeit test_new(d)
1 loops, best of 3: 3.29 s per loop

スペースを事前に割り当てることで、断続的にさらに0.3秒短縮できましたall_point_setsが、それは単なるノイズである可能性があります。そしてもちろん、物事をより速くする昔ながらの方法があります：

localhost-2:coding $ pypy pointexam.py
1.58351397514

score 2 · Accepted Answer

データはタブ区切りのファイルで、コンマ区切りの整数のリストで構成されています。

サンプルを使用して、次のようなファイルget_data()を作成しました。.csv

1,6,2,8,2,3,5,9,6,6     10,4,10,5,7,9,6,1,9,5
6,2,2,5,2,2,1,7,7,9     7,6,7,1,3,7,6,2,10,5
8,8,9,2,6,10,10,7,8,9   4,2,10,3,4,4,1,2,2,9
...

次に、JSONを介してCに最適化された解析を悪用しました。

def test2():
    import json
    import time
    time_start = time.time()
    with open('data.csv', 'rb') as f:
        data = f.read()
    data = '[[[' + ']],[['.join(data.splitlines()).replace('\t', '],[') + ']]]'
    all_point_sets = [Point(*xy) for row in json.loads(data) for xy in zip(*row)]
    time_end = time.time()
    print "total time: ", (time_end - time_start)

私のボックスでの結果：元のtest()〜8秒、gcが無効になっている〜6秒、私のバージョン（I / Oを含む）ではそれぞれ〜6秒と〜4秒。つまり、約50％高速化されます。しかし、プロファイラーデータを見ると、最大のボトルネックはオブジェクトのインスタンス化自体にあることは明らかです。したがって、Matt Andersonの答えは、CPythonで最大の利益を得ることになります。

score 1 · Accepted Answer

できることがたくさんあるかどうかわかりません。

ジェネレータを使用して、余分なメモリ割り当てを回避できます。これにより、約5％のスピードアップが得られます。

first_points  = (int(p) for p in first_points .split(","))
second_points = (int(p) for p in second_points.split(","))
paired_points = itertools.izip(first_points, second_points)
curr_points   = [Point(x, y) for x,y in paired_points]

ループ全体を1つの大規模なリスト内包にまとめても、あまり効果はありません。

all_point_sets = [
    [Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
    for xs, ys in data
]

この大きなリストを繰り返し処理すると、ジェネレーターに変えることができます。これにより、CSVデータの解析コストが分散されるため、大きな先行ヒットが発生することはありません。

all_point_sets = (
    [Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
    for xs, ys in data
)

score 0 · Accepted Answer

長さ2000000の配列などの組み込み関数にかかる時間はごくわずかであるため、最も時間のかかる操作はzip(a,b)追加であると推測する必要があります。map(int, string.split(","))

したがって、問題に対処する適切な方法は、文字列を再帰的に連結すること
です。10要素の10文字列をより大きな文字
列に100要素の
10文字列1000要素の10文字列

そして最後にzip(map(int,huge_string_a.split(",")),map(int,huge_string_b.split(",")));

次に、追加と征服の方法に最適なベースNを見つけるために微調整します。

score 0 · Accepted Answer

ここには多くの良い答えがあります。ただし、これまで対処されていないこの問題の1つの側面は、Pythonのさまざまなイテレータ実装間のリストから文字列への時間コストの違いです。

Python.orgのエッセイ：list2strには、リストから文字列への変換に関して、さまざまなイテレータの効率をテストするエッセイがあります。同様の最適化問題に遭遇したが、データ構造とサイズが異なる場合、エッセイに示されている結果はすべて同じ速度でスケールアップするわけではないため、特定のユースケースでさまざまなイテレーターの実装をテストする価値があることを覚えておいてください。

python - Pythonでのオブジェクトへの文字列のペアリングの高速化

13 に答える 13

split.py

split_.pyx

Related

Reference