python - アクセスログファイルをデータフレームに読み込む

Question

アクセスログファイルを処理して作業する必要があります。アクセスログのようなログファイルをデータフレームに読み込んで作業することはできますか? 作業したいタイムスタンプ、応答時間、およびリクエスト URL があります。

ログ行の例:

128.0.0.2 xml12.jantzens.dk - - [04/Mar/2013:07:59:29 +0100] 15625 "POST /servlet/XMLHandler HTTP/1.1" 200 516 "-" "dk.product.xml.client.transports.ServletBridge" "-"

更新: 通常の exp を使用して応答時間と要求を抽出しています。DFを追加してデータセットを作成しようとしています。

df2 = pd.DataFrame({ 'time' : pd.Timestamp(timestamp),
                     'reponsetime' : responsetime,
                     'requesturl' : requesturl })

score 0 · Accepted Answer

正規表現を使用し、データを何らかのタイプのメモリ構造にロードすることをお勧めします (これがデータフレームの意味だと思います)。

Kodos を使って正規表現を開発するのが大好きです: http://kodos.sourceforge.net/

上記のログスニペットでは、次の正規表現によって重要な部分の一部が分離されます。

^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"

Kodos もいくつかの役立つコードスニペットを作成します。

rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)""""
embedded_rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)""""
matchstr = """128.0.0.2 xml12.jantzens.dk - - [04/Mar/2013:07:59:29 +0100] 15625 "POST /servlet/XMLHandler HTTP/1.1" 200 516 "-" "dk.product.xml.client.transports.ServletBridge" "-""""

# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)

# method 2: using search function (w/ external flags)
match_obj = re.search(rawstr, matchstr)

# method 3: using search function (w/ embedded flags)
match_obj = re.search(embedded_rawstr, matchstr)

# Retrieve group(s) from match_obj
all_groups = match_obj.groups()

# Retrieve group(s) by index
group_1 = match_obj.group(1)
group_2 = match_obj.group(2)
group_3 = match_obj.group(3)
group_4 = match_obj.group(4)
group_5 = match_obj.group(5)
group_6 = match_obj.group(6)
group_7 = match_obj.group(7)
group_8 = match_obj.group(8)
group_9 = match_obj.group(9)
group_10 = match_obj.group(10)

# Retrieve group(s) by name
host = match_obj.group('host')
day = match_obj.group('day')
month = match_obj.group('month')
timestamp = match_obj.group('timestamp')

これに基づいて、ログをメモリにロードして処理を開始するのは非常に簡単です。

python - アクセスログファイルをデータフレームに読み込む

1 に答える 1

Related

Reference