parsing - ログファイルの中かっこ内を除いて、スペースで区切られている - Python

Question

私は長年の読者で、初めての質問者です (優しくしてください)。

私は Unix Bash でかなり厄介な WHILE READ でこれを行ってきましたが、Python を学んでおり、より効果的なパーサールーチンを作成したいと考えています。

そのため、ほとんどがスペースで区切られているログファイルがたくさんありますが、スペースが含まれている可能性がある場所には角かっこが含まれています。区切り文字を探すときに中括弧内のコンテンツを無視する方法は?

（これを行うにはREライブラリが必要だと思います）

つまり、入力例:

[21/Sep/2014:13:51:12 +0000] serverx 192.0.0.1 identity 200 8.8.8.8 - 500 unavailable RESULT 546 888 GET http ://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium [somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; colon:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy) DoesanyonerememberAOL/1.0]

望ましい出力:

'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'

最初と最後のフィールド (角括弧内にあったもの) にまだスペースが残っていることに気がつくかもしれません。

ボーナスポイント 14 番目のフィールド (URL) は、常に次のいずれかの形式になります。

htp://google.com/path-data-might-be-here-and-can-contain-special-characters
google.com/path-data-might-be-here-and-can-contain-special-characters
xyz.abc.www.google.com/path-data-might-be-here-and-can-contain-special-characters
google.com:443
Google COM

ドメイン (つまり、xyz.abc.www.google.com または google.com) だけを含むデータに追加の列を追加したいと思います。

これまで、Unix AWK と IF ステートメントを使用して解析済みの出力を取得し、このフィールドを「/」で分割して、3 番目のフィールドが空白かどうかを確認してきました。存在する場合は、最初のフィールドを返します (存在する場合は : まで)。それ以外の場合は、3 番目のフィールドを返します)。これを行うためのより良い方法があれば、できれば上記と同じルーチンで、それを聞きたいです-最終的な出力は次のようになります。

'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'; **'www.google.com'**

脚注: サンプルで http を htp に変更したので、気を散らすリンクがたくさん作成されません。

score 1 · Accepted Answer

正規表現パターン\[[^\]]*\]|\S+はデータをトークン化しますが、複数単語の値から括弧を取り除きません。別の手順でそれを行う必要があります。

import re

def parse_line(line):
    values = re.findall(r'\[[^\]]*\]|\S+', line)
    values = [v.strip("[]") for v in values]
    return values

正規表現パターンのより詳細なバージョンを次に示します。

pattern = r"""(?x)   # turn on verbose mode (ignores whitespace and comments)
    \[       # match a literal open bracket '['
    [^\]]*   # match zero or more characters, as long as they are not ']'
    \]       # match a literal close bracket ']'
        |        # alternation, match either the section above or the section below
    \S+      # match one or more non-space characters
    """

values = re.findall(pattern, line) # findall returns a list with all matches it finds

parsing - ログファイルの中かっこ内を除いて、スペースで区切られている - Python

1 に答える 1

Related

Reference