regex - Powershell でのログ正規表現の最適化

Question

.log約 1 週間分のデータのテキストファイル (通常は 1 ポップあたり約 10 ～ 30 MB)を吐き出す 2 つの SMTP ゲートがあります。合計すると、通常、どちらも約 1.2 GB のサイズになります。

(2) ログディレクトリに読み取り専用の共有を設定し、ログエントリを解析しようとしていますSelect-String(たとえば、"bdole" からの電子メールが届いたかどうかを確認したいとします。数値、悪くないです。

ただし、「ログエントリ」全体を取得したい。私の最初の調査によると、ログの内容全体を一度に読み取り、それに対して正規表現を実行する必要があります。これが私がやっていることで、200 近くのファイルに対してです。

ただし、本当の問題は i/o ではないと思います。私は ~200 スレッド (ファイルごとに 1 つ) を生成し、20 スレッドで上限を設定しています。最初の 20 スレッドの実行には時間がかかります。デバッグコードを挿入して、シングルスレッドに戻りました。10 ～ 20 MB のファイルの内容を単純に正規表現すると、長い時間がかかるようです。

私が書いた正規表現は速度の点でどういうわけか非常に不十分であると思われます (一晩実行させても問題なく動作するという意味で機能します)。さらに、ネットワーク I/O はかなり低いです (最大で 0.6% の2Ggpbs 接続)、CPU/RAM は非常に高いです。

理想的なログエントリは次のようになります。

---- SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

信頼できる唯一の区切り文字は開始です----(場合によっては、で終わる/終わらない----)

「ログエントリ」の内容は、ブロックされた接続の通知など、非常に変化しやすい場合があります。

私が使用している正規表現

(?sm)----((?!----).*?)(log entry)((?!----).*?)(#USERINPUT#)((?!----).*?)----

where#USERINPUT#は、スクリプトに渡されたものに置き換えられています。

コードの解析 を使用してファイルパスのリストを取得した後gci

if ( !(Test-Path $path) ) {
            write-error "issue accessing $path"
        } else {
            try {
                $buffer = [io.file]::ReadAllText($path)
            }
            catch {
                $errArray += $path
                $_
            }
            [string[]]$matchBuffer = @()
            $matchBuffer += $entrySeperator
            $matchBuffer += $_
            $matchBuffer += $entrySeperator
            $matchBuffer += $buffer | Select-String $regex -AllMatches |
            % {$_.Matches} |
            % {$_.Value; $entrySeperator} 

            if ($errArray) {
                write-warning "There were errors, probably in accessing files. "
                $errArray
            }

            $fileName = (gi $path).Name
            sc -path $tmpDir\$fileName -value $matchBuffer
            $matchBuffer | Out-String

「ヒット」(LINE 21 の XXXX.LOG など) を解析し、逆方向に作業してコンテキストからログエントリを再構築する方が高速で優れているのではないかと考えています。

score 0 · Accepted Answer

説明

あなたの表現にはいくつかの問題があります。

一致正規表現の最初と最後にを含める----と、ログの次のエントリが失われる可能性があり、ログの最後のエントリが失われる
あなたの構造((?!----).*?)では、一致の量を制限しようとしているように見えます.*?。ただし、構文は 1 回だけチェックして、次の 4 文字----が一致しないことを確認し、.*?. この構造をに置き換えたほうがよいでしょう((?:(?!----).)*)。?このコンストラクトは自己終了するため、貪欲を防ぐためにを使用することを心配する必要はありません。([^\r\n]*?)悪いニュースは、最初の行の既知のエントリ (.*?)(?=^----|\Z)を照合し、ログの本文を照合するために単純に使用する場合に比べて、構造体の効率がわずかに低下することです。
----信頼できるテキストが常に行頭にあると仮定すると、行頭アンカーを含めることもできます^

(?m)^----\s(.*?)\s(log\sentry)\s(.*?)\s(mm\/dd\/yyyy\sHH:mm:ss)(?sm).*?^(.*?)(?=^----|\Z)

ここに画像の説明を入力

例

Powershell の例

$String = '---- 1 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 2 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 3 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----
---- 4 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 5 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 
---- 6 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
'
clear

[regex]$Regex = '(?m)^----\s(.*?)\s(log\sentry)\s(.*?)\s(mm\/dd\/yyyy\sHH:mm:ss)(?sm).*?^(.*?)(?=^----|\Z)'
# [regex]$Regex = '(?sm)----((?!----).*?)(log\sentry)((?!----).*?)(mm\/dd\/yyyy\sHH:mm:ss)((?!----).*?)'

# cycle through all matches
$intCount = 0
Measure-Command {
    $Regex.matches($String) | foreach {
            $intCount += 1
            Write-Host "[$intCount][0]=" $_.Groups[0].Value
            Write-Host "[$intCount][1]=" $_.Groups[1].Value
            Write-Host "[$intCount][2]=" $_.Groups[2].Value
            Write-Host "[$intCount][3]=" $_.Groups[3].Value
            Write-Host "[$intCount][4]=" $_.Groups[4].Value
            Write-Host "[$intCount][5]=" $_.Groups[5].Value

        } # next match
    } | select Milliseconds

出力

[1][0]= ---- 1 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[1][1]= 1 SMTPRS
[1][2]= log entry
[1][3]= made at
[1][4]= mm/dd/yyyy HH:mm:ss
[1][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[2][0]= ---- 2 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[2][1]= 2 SMTPRS
[2][2]= log entry
[2][3]= made at
[2][4]= mm/dd/yyyy HH:mm:ss
[2][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[3][0]= ---- 3 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----

[3][1]= 3 SMTPRS
[3][2]= log entry
[3][3]= made at
[3][4]= mm/dd/yyyy HH:mm:ss
[3][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----

[4][0]= ---- 4 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[4][1]= 4 SMTPRS
[4][2]= log entry
[4][3]= made at
[4][4]= mm/dd/yyyy HH:mm:ss
[4][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[5][0]= ---- 5 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 

[5][1]= 5 SMTPRS
[5][2]= log entry
[5][3]= made at
[5][4]= mm/dd/yyyy HH:mm:ss
[5][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 

[6][0]= ---- 6 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. [6][1]= 6 SMTPRS
[6][2]= log entry
[6][3]= made at
[6][4]= mm/dd/yyyy HH:mm:ss
[6][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 


Milliseconds
------------
16

残念ながら、私のシステムでは、この式の実行は少し遅くなりますが、実際のデータは使用していません。だから、これで改善が見られるかどうか興味があります

score 0 · Accepted Answer

そのようなログを解析するために正規表現は必ずしも必要ではありません。このようなものも同様に機能するはずです：

$userInput = "..."

$logfile = 'C:\path\to\your.log'

$entry = $null
$log = Get-Content $logfile | % {
  $len = [Math]::Min(4, $_.Length)
  if ($_.SubString(0, $len) -eq '----' -and $entry -ne $null) {
    "$entry"
    $entry = $null
  }
  $entry += "$_`n"
}
$log += $entry

$log | ? { $_ -match [regex]::Escape($userInput) }

regex - Powershell でのログ正規表現の最適化

2 に答える 2

説明

例

Related

Reference