python - 正規表現を処理する最速の方法

Question

私はPythonでログファイルを処理するスクリプトを持っています-それは値を解析し、それらを単純に。で結合しますtab。

p = re.compile(
    "([0-9/]+) ([0-9]+):([0-9]+):([0-9]+) I.*"+
    "worker\\(([0-9]+)\\)(?:@([^]]*))?.*\\[([0-9]+)\\] "+
    "=RES= PS:([0-9]+) DW:([0-9]+) RT:([0-9]+) PRT:([0-9]+) IP:([^ ]*) "+
    "JOB:([^!]+)!([0-9]+) CS:([\\.0-9]+) CONV:([^ ]*) URL:[^ ]+ KEY:([^/]+)([^ ]*)"
  )

for line in sys.stdin:
  line = line.strip()
  if len(line) == 0: continue
  result = p.match(line)
      if result != None:
    print "\t".join([x if x is not None else "." for x in result.groups()])

ただし、スクリプトの動作は非常に遅く、データの処理には長い時間がかかります。

どうすれば同じ動作をより速く達成できますか？Perl / SED / PHP / Bash / ...？

ありがとう

score 2 · Accepted Answer

入力を確認しないとわかりにくいですが、ログファイルはスペースで区切られたフィールドで構成されており、内部にスペースが含まれていないようです。その場合、最初に空白で分割して、個々のログフィールドを配列に入れることができます。すなわち

line.split()      #Split based on whitespace

また

line.split(' ')   #Split based on a single space character

その後、いくつかの小さな正規表現または単純な文字列操作を使用して、必要なフィールドからデータを抽出します。

ライン処理の大部分は単純なルールで実行されるため、はるかに効率的である可能性があります。潜在的なバックトラックの落とし穴はなく、間違いが含まれている可能性が低い、より読みやすいコードが得られます。

私はPythonを知らないので、完全なコード例を書き出すことはできませんが、それがPerlで採用するアプローチです。

score 1 · Accepted Answer

Im writing Perl, not Python, but recently i used this technique to parse very big logs:

Divide input file to chunks (for example, FileLen/NumProcessors bytes each).
Adjust start and end of every chunk to \n so you take full lines to each worker.
fork() to create NumProcessors workers, each of which reading own
bytes range from file and writes his own output file.
Merge output files if needed.

Sure, you should work to optimize the regexp too, for example less use .* cus it will create many backtraces, this is slow. But anyway, 99% you will have bottleneck on CPU by this regexp, so working on 8 CPUs should help.

score 1 · Accepted Answer

In Perl it is possible to use precompiled regexps which are much faster if you are using them many times.

http://perldoc.perl.org/perlretut.html#Compiling-and-saving-regular-expressions

"The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it."

If the data is large then it is worth to processing it paralel by split data into pieces. There are several modules in CPAN which makes this easier.

python - 正規表現を処理する最速の方法

3 に答える 3

Related

Reference