linux - awk パターンは複数の行に一致できますか?

Question

いくつかの複雑なログファイルがあり、それらを処理するためのツールを作成する必要があります。私は awk で遊んでいますが、awk がこれに適したツールかどうかはわかりません。

私のログファイルは OSPF プロトコルデコードの出力であり、さまざまなプロトコル pkt とその内容のテキストログと、それらの値で識別されるさまざまなプロトコルフィールドが含まれています。これらのファイルを処理し、特定の pkt に関連するログの特定の行のみを出力したいと考えています。各 pkt ログは、その pkt のエントリのさまざまな数の行で構成できます。

awk は、パターンに一致する単一の行を処理できるようです。目的の pkt を見つけることができますが、出力したい pkt であるかどうかを判断するには、次の行のパターンを照合する必要があります。

これを調べる別の方法は、ログファイル内のいくつかの行を分離し、いくつかの行のパターンマッチに基づいて特定の pkt の詳細である行を出力することです。

awk は行ベースのように見えるので、それが最適なツールかどうかはわかりません。

awkがこれを行うことができる場合、それはどのように行われますか? そうでない場合、これに使用するツールに関する提案はありますか?

score 25 · Accepted Answer

Awk はパターンの複数行の組み合わせを簡単に検出できますが、シーケンスを認識するためにコード内にステートマシンと呼ばれるものを作成する必要があります。

次の入力を検討してください。

how
second half #1
now
first half
second half #2
brown
second half #3
cow

これまで見てきたように、1 つのパターンを認識するのは簡単です。これで、前半の行がすぐ前にある場合にのみ後半を認識する awk プログラムを作成できます。(より洗練されたステートマシンを使用すると、任意の一連のパターンを検出できます。)

/second half/ {
  if(lastLine == "first half") {
    print
  }
}

{ lastLine = $0 }

これを実行すると、次のように表示されます。

second half #2

さて、この例はとてつもなく単純で、かろうじてステートマシンにすぎません。興味深い状態は、 ifステートメントの間だけ持続し、前の状態は、 lastLineの値に応じて暗黙的になります。より標準的な状態マシンでは、明示的な状態変数を保持し、既存の状態と現在の入力の両方に応じて状態から状態への遷移を行います。しかし、それほど多くの制御メカニズムは必要ないかもしれません。

score 12 · Accepted Answer

Awk は実際にはレコードベースです。デフォルトでは、行はレコードと見なされますが、RS (レコード区切り) 変数を使用して変更できます。

これにアプローチする 1 つの方法は、sed を使用して最初のパスを実行し (必要に応じて awk でも実行できます)、フォームフィードのような別の文字でレコードを分離することです。次に、行のグループを単一のレコードとして扱う awk スクリプトを記述できます。

たとえば、これがデータの場合:

animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat

フォームフィードを使用してレコードを分離するには:

$ cat data | sed $'s|^\(animal.*\)|\f\\1|'

これを awk に渡します。レコードを条件付きで印刷する例を次に示します。

$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
      BEGIN { RS="\f" }                                     
      /type: cat/ { print }'

出力:

animal 1
name: bill
type: cat

animal 2
name: ed
type: cat

編集: おまけとして、ここでは awk-ward ruby でそれを行う方法を示します (-014 は、フォームフィード (8 進コード 014) をレコードセパレータとして使用することを意味します):

$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
      ruby -014 -ne 'print if /type: cat/'

score 3 · Accepted Answer

私はときどき、sendmail ログでこの種のことを行います。

与えられた:

Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www@web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www@web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3@nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc@europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

次のようなスクリプトを使用します。

#!/usr/bin/awk -f

BEGIN {
  search=ARGV[1];  # Grab the first command line option
  delete ARGV[1];  # Delete it so it won't be considered a file
}

# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
  line[$6]=sprintf("%s\n%s", line[$6], $0);
}

# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
  show[$6];
}

# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
  for(qid in show) {
    print line[qid];
  }
}

次の出力を取得します。

$ mqsearch airtel /var/log/maillog

Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

ここでの考え方は、検索したい文字列の Sendmail キュー ID に一致するすべての行を出力するということです。もちろん、コードの構造はログファイルの構造の産物であるため、分析および抽出しようとしているデータに合わせてソリューションをカスタマイズする必要があります。

score 1 · Accepted Answer

1

awk '/pattern-start/,/pattern-end/'

参照

于 2013-01-16T03:28:37.553 に答える

score 1 · Accepted Answer

`pcregrep -M` works pretty well for this.

From pcregrep(1):

-M, --multiline

Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline characters and internal occurrences of ^ and $ characters. The output for a successful match may consist of more than one line, the last of which is the one in which the match ended. If the matched string ends with a newline sequence the output ends at the end of that line.

When this option is set, the PCRE library is called in “multiline” mode. There is a limit to the number of lines that can be matched, imposed by the way that pcregrep buffers the input file as it scans it. However, pcregrep ensures that at least 8K characters or the rest of the document (whichever is the shorter) are available for forward matching, and similarly the previous 8K characters (or all the previous characters, if fewer than 8K) are guaranteed to be available for lookbehind assertions. This option does not work when input is read line by line (see --line-buffered.)

linux - awk パターンは複数の行に一致できますか?

6 に答える 6

Related

Reference