python - 2 つの IP アドレスを持つ apache ログを解析する

Question

解析しようとしている Apache ログファイルがあります。apachelog、2つの答えhere、およびthisなど、いくつかの異なる方法を見つけました。これらの方法のいずれかを使用して、ログ内のほとんどの行を解析できました。ただし、一部の行には 2 つの IP アドレスがあります。

xxx.xx.xx.xxx, yy.yyy.yy.yyy - - [14/Feb/2013:03:55:21 +0000] "GET /alink HTTP/1.0" 200 90210 "http://www.google.com/search" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko; Google Web Preview) Chrome/22.0.1229 Safari/537.4"

言及された方法のいずれも、この行を正しく解析できませんでした。（apachelogのvirtualhostオプションも試しました）。助言がありますか？私が言及した後者の方法を使用しています（ただし、何でも開いています）。

parts = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.+)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referer>.*)"',               # referer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
    ]
    pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')

    for line in open(log):
        try:    
            m = pattern.match(line)
            if m:
                res = m.groupdict()
                data.append(res)
            if not m:
                print line
        except:
            print line

score 4 · Accepted Answer

リスト内の正規表現の最初のコンポーネントを変更して、ホストのコンマ区切りリストを使用できるようにすることができます。次の例の行で機能します。

import re
parts = [
    r'(?P<host>\S+(,\s*\S+)*)',         # comma-separated list of hosts                             
    r'\S+',                             # indent %l (unused)                
    r'(?P<user>\S+)',                   # user %u                           
    r'\[(?P<time>.+)\]',                # time %t                           
    r'"(?P<request>.+)"',               # request "%r"                      
    r'(?P<status>[0-9]+)',              # status %>s                        
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')     
    r'"(?P<referer>.*)"',               # referer "%{Referer}i"             
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"       
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')

test = 'xxx.xx.xx.xxx, yy.yyy.yy.yyy - - [14/Feb/2013:03:55:21 +0000] "GET /alink HTTP/1.0" 200 90210 "http://www.google.com/search" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML,like Gecko; Google Web Preview) Chrome/22.0.1229 Safari/537.4"'
m = pattern.match(test)
res = m.groupdict()

上記のコマンドの後に、res['host']が含まれますxxx.xx.xx.xxx, yy.yyy.yy.yyy。ホストアドレスが個別に必要な場合は、res['host'].split(',')アドレスのリストを取得するを使用できます。

python - 2 つの IP アドレスを持つ apache ログを解析する

1 に答える 1

Related

Reference