python - Pythonでこのreadlineループの速度を向上させる方法は？

Question

Databasedumpのいくつかの部分をテキスト形式でMySQLにインポートしていますが、問題は、興味深いデータの前に、興味深いものが非常に多くあることです。必要なデータを取得するために、このループを作成しました。

def readloop(DBFILE):
    txtdb=open(DBFILE, 'r')

sline = ""

# loop till 1st "customernum:" is found
while sline.startswith("customernum:  ") is False: 
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    data = []
    data.append(sline)
    sline = txtdb.readline()
    while sline.startswith("customernum:  ") is False:
        data.append(sline)
        sline = txtdb.readline()
        if len(sline) == 0:
            break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

テキストファイルはかなり大きいので、最初に必要なエントリまでループするだけでも非常に時間がかかります。これをもっと速くできるかどうか（または私が修正した方法全体が最善のアイデアではない場合）、誰もが考えていますか？

よろしくお願いします！

score 5 · Accepted Answer

このコードを書かないでください：

while condition is False:

ブール条件は、大声で叫ぶためのブール条件であるため、直接テスト（または否定してテスト）できます。

while not condition:

2番目のwhileループは、「while条件がTrue：」と記述されていません。最初のループで「isFalse」をテストする必要があると感じた理由がわかります。

disモジュールを引き出して、もう少し詳しく調べてみようと思いました。私のpyparsingの経験では、関数呼び出しは完全なパフォーマンスキラーであるため、可能であれば関数呼び出しを避けるとよいでしょう。これがあなたの元のテストです：

>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_GLOBAL              1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

CALL_FUNCTIONここでは2つの高価なことが起こりますLOAD_GLOBAL。LOAD_GLOBALFalseのローカル名を定義することで削減できます。

>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_FAST                1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

しかし、「is」テストを完全に削除するとどうなりますか？：

>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 UNARY_NOT
             13 RETURN_VALUE

LOAD_xxxとCOMPARE_OPを簡単に折りたたんでいUNARY_NOTます。「isFalse」は確かにパフォーマンスの原因にはなりません。

では、関数呼び出しをまったく行わずに、行を大幅に削除できるとしたらどうでしょうか。行の最初の文字が「c」でない場合、（'customernum'）で始まる方法はありません。それを試してみましょう：

>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_CONST               0 (0)
              6 BINARY_SUBSCR
              7 LOAD_CONST               1 ('c')
             10 COMPARE_OP               3 (!=)
             13 JUMP_IF_FALSE           14 (to 30)
             16 POP_TOP
             17 LOAD_FAST                0 (t)
             20 LOAD_ATTR                0 (startswith)
             23 LOAD_CONST               2 ('customernum')
             26 CALL_FUNCTION            1
             29 UNARY_NOT
        >>   30 RETURN_VALUE

（[0]を使用して文字列の最初の文字を取得しても、スライスは作成されないことに注意してください。これは実際には非常に高速です。）

ここで、「c」で始まる行が多数ない場合、ラフカットフィルターはすべてのかなり高速な命令を使用して行を削除できます。実際、「not t [0] =='c'」ではなく「t[0]！='c'」をテストすることで、無関係なUNARY_NOT命令を節約できます。

したがって、ショートカットの最適化に関するこの学習を使用して、このコードを変更することをお勧めします。

while sline.startswith("customernum:  ") is False:
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    ... do the rest of the customer data stuff...

これに：

for sline in txtdb:
    if sline[0] == 'c' and \ 
       sline.startswith("customernum:  "):
        ... do the rest of the customer data stuff...

.readline（）関数呼び出しも削除し、「for slineintxtdb」を使用してファイルを反復処理していることに注意してください。

Alexが最初の「customernum」行を見つけるためにまったく別のコードを提供していることはわかっていますが、大きくてもあいまいなブロック読み取りガンを引き出す前に、アルゴリズムの一般的な範囲内で最適化してみます。

score 2 · Accepted Answer

最適化の一般的な考え方は、「大きなブロックで」（ほとんどの場合、行構造を無視して）目的の最初の行を見つけてから、残りの行の処理に進むことです。やや厄介でエラーが発生しやすい（1つずつオフなど）ため、実際にテストする必要がありますが、一般的な考え方は次のとおりです...：

import itertools

def readloop(DBFILE):
  txtdb=open(DBFILE, 'r')
  tag = "customernum:  "
  BIGBLOCK = 1024 * 1024
  # locate first occurrence of tag at line-start
  # (assumes the VERY FIRST line doesn't start that way,
  # else you need a special-case and slight refactoring)
  blob = ''
  while True:
    blob = blob + txtdb.read(BIGBLOCK)
    if not blob:
      # tag not present at all -- warn about that, then
      return
    where = blob.find('\n' + tag)
    if where != -1:  # found it!
      blob = blob[where+1:] + txtdb.readline()
      break
    blob = blob[-len(tag):]
  # now make a by-line iterator over the part of interest
  thelines = itertools.chain(blob.splitlines(1), txtdb)
  sline = next(thelines, '')
  while sline.startswith(tag):
    data = []
    data.append(sline)
    sline = next(thelines, '')
    while not sline.startswith(tag):
      data.append(sline)
      sline = next(thelines, '')
      if not sline:
        break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

ここでは、このリファクタリングの「大きなアイデア」を超えたマイナーな機能強化のみを行い、可能な限り多くの構造をそのまま維持しようとしました。

score 1 · Accepted Answer

このインポートスクリプトを書いていると思いますが、テスト中に待つのは退屈なので、データは常に同じです。

スクリプトを1回実行すると、ジャンプ先のファイル内の実際の位置を検出できますprint txtdb.tell()。それらを書き留めて、検索コードを。に置き換えますtxtdb.seek( pos )。基本的に、それはファイルのインデックスを構築することです;-)

もう1つのより一般的な方法は、1行の数バイトだけでなく、一度に数MBの大きなチャンクでデータを読み取ることです。

score 0 · Accepted Answer

これは役立つかもしれません：Pythonパフォーマンスパート2：「AHref」ハイパーテキストの大きな文字列の解析

score 0 · Accepted Answer

ファイルについて詳しく教えてください。

file.seekを使用してバイナリ検索を実行できますか？中間点を探し、数行を読み、必要な部分の前か後かを判断し、繰り返します。これにより、O（n）検索がO（logn）に変わります。

python - Pythonでこのreadlineループの速度を向上させる方法は？

5 に答える 5

Related

Reference