powershell - Powershellで多くの大きなテキストファイルを検索する

Question

それぞれ200MB以上のファイルが50個以上含まれている可能性のあるディレクトリでサーバーログファイルを頻繁に検索する必要があります。この検索を行うための関数をPowershellで作成しました。指定されたクエリパラメータのすべての値を検索して抽出します。これは、個々の大きなファイルまたは小さなファイルのコレクションでうまく機能しますが、上記の状況、つまり大きなファイルのディレクトリでは完全に噛み付きます。

この関数は、検索対象のクエリパラメーターで構成されるパラメーターを受け取ります。

擬似コードの場合：

Take parameter (e.g. someParam or someParam=([^& ]+))
Create a regex (if one is not supplied)
Collect a directory list of *.log, pipe to Select-String
For each pipeline object, add the matchers to a hash as keys
Increment a match counter
Call GC
At the end of the pipelining: 
if (hash has keys) 
    enumerate the hash keys, 
    sort and append to string array
    set-content the string array to a file 
    print summary to console
    exit
else
    print summary to console
    exit

これは、ファイル処理の簡略版です。

$wtmatches = @{};
gci -Filter *.log | Select-String -Pattern $searcher |       
%{ $wtmatches[$_.Matches[0].Groups[1].Value]++; $items++; [GC]::Collect(); }

見つかったアイテムをハッシュのキーにすることで複製を解除するという古いperlトリックを使用しています。おそらくこれはエラーですが、処理の通常の出力は最大で約30,000アイテムになります。より一般的には、見つかったアイテムは数千の範囲内にあります。私が見ることができることから、ハッシュ内のキーの数は処理時間に影響を与えません。それを壊すのはファイルのサイズと数です。私は最近、必死になってGCを投入しました。それはいくつかのプラスの効果がありますが、それはわずかです。

問題は、大きなファイルの大規模なコレクションでは、処理によってRAMプールが約60秒で枯渇することです。興味深いことに、実際には多くのCPUを使用していませんが、多くの揮発性ストレージが使用されています。RAMの使用量が90％を超えたら、パンチアウトしてテレビを見に行くことができます。15,000または20,000の一意の値を持つファイルを生成する処理を完了するには、数時間かかる場合があります。

たとえそれが処理を達成するために異なるパラダイムを使用することを意味するとしても、効率を上げるためのアドバイスや提案をお願いします。私は自分が知っていることで行きました。私はこのツールをほぼ毎日使用しています。

ああ、私はPowershellの使用に取り組んでいます。;-)この関数は、私が自分の仕事のために書いた完全なモジュールの一部であるため、この場合、Python、perl、またはその他の有用な言語の提案は役に立ちません。

ありがとう。

mp

更新：latkinのProcessFile関数を使用して、テストに次のラッパーを使用しました。彼の機能は私のオリジナルよりも桁違いに速いです。

function Find-WtQuery {

<#
 .Synopsis
  Takes a parameter with a capture regex and a wildcard for files list.

 .Description
  This function is intended to be used on large collections of large files that have
  the potential to take an unacceptably long time to process using other methods. It
  requires that a regex capture group be passed in as the value to search for.

 .Parameter Target
  The parameter with capture group to find, e.g. WT.z_custom=([^ &]+).

 .Parameter Files
  The file wildcard to search, e.g. '*.log'

 .Outputs
  An object with an array of unique values and a count of total matched lines.
#>

        param(
        [Parameter(Mandatory = $true)] [string] $target,
        [Parameter(Mandatory = $false)] [string] $files
    )

    begin{
        $stime = Get-Date
    }
    process{
        $results = gci -Filter $files | ProcessFile -Pattern $target  -Group 1;
    }
    end{
        $etime = Get-Date;
        $ptime = $etime - $stime;
        Write-Host ("Processing time for {0} files was {1}:{2}:{3}." -f (gci   
    -Filter $files).Count, $ptime.Hours,$ptime.Minutes,$ptime.Seconds);
        return $results;
    }
}

出力：

clients:\test\logs\global
{powem} [4] --> Find-WtQuery -target "WT.ets=([^ &]+)" -files "*.log"
Processing time for 53 files was 0:1:35.

コメントと助けてくれたすべての人に感謝します。

score 2 · Accepted Answer

これは、ファイル処理部分のメモリへの影響を高速化して削減する機能です。一致した行の総数と、指定された一致グループからの一意の文字列の並べ替えられた配列の2つのプロパティを持つオブジェクトが返されます。（あなたの説明から、あなたは文字列ごとのカウントを本当に気にしていないように聞こえます、文字列値自体だけです）

function ProcessFile
{
   param(
      [Parameter(ValueFromPipeline = $true, Mandatory = $true)]
      [System.IO.FileInfo] $File,

      [Parameter(Mandatory = $true)]
      [string] $Pattern,

      [Parameter(Mandatory = $true)]
      [int] $Group
   )

   begin
   {
      $regex = new-object Regex @($pattern, 'Compiled')
      $set = new-object 'System.Collections.Generic.SortedDictionary[string, int]'
      $totalCount = 0
   }

   process
   {
      try
      {
        $reader = new-object IO.StreamReader $_.FullName

        while( ($line = $reader.ReadLine()) -ne $null)
        {
           $m = $regex.Match($line)
           if($m.Success)
           {
              $set[$m.Groups[$group].Value] = 1      
              $totalCount++
           }
        }
      }
      finally
      {
         $reader.Close()
      }
   }

   end
   {
      new-object psobject -prop @{TotalCount = $totalCount; Unique = ([string[]]$set.Keys)}
   }
}

次のように使用できます。

$results = dir *.log | ProcessFile -Pattern 'stuff (capturegroup)' -Group 1
"Total matches: $($results.TotalCount)"
$results.Unique | Out-File .\Results.txt

score 2 · Accepted Answer

IMO @latkinのアプローチは、PowerShell内でこれを実行し、専用のツールを使用しない場合に使用する方法です。ただし、パイプライン入力の受け入れに関してコマンドの再生を改善するために、いくつかの変更を加えました。また、特定の行のすべての一致を検索するように正規表現を変更しました。どちらのアプローチも複数行を検索しませんが、パターンが数行にしか及ばない限り、そのシナリオは非常に簡単に処理できます。これが私のコマンドの見方です（Search-File.ps1というファイルに入れてください）：

[CmdletBinding(DefaultParameterSetName="Path")]
param(
    [Parameter(Mandatory=$true, Position=0)]
    [ValidateNotNullOrEmpty()]
    [string]
    $Pattern,

    [Parameter(Mandatory=$true, Position=1, ParameterSetName="Path", 
               ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true,
               HelpMessage="Path to ...")]
    [ValidateNotNullOrEmpty()]
    [string[]]
    $Path,

    [Alias("PSPath")]
    [Parameter(Mandatory=$true, Position=1, ParameterSetName="LiteralPath", 
               ValueFromPipelineByPropertyName=$true,
               HelpMessage="Path to ...")]
    [ValidateNotNullOrEmpty()]
    [string[]]
    $LiteralPath,

    [Parameter()]
    [ValidateRange(0, [int]::MaxValue)]
    [int]
    $Group = 0
)

Begin 
{ 
    Set-StrictMode -Version latest 
    $count = 0
    $matched = @{}
    $regex = New-Object System.Text.RegularExpressions.Regex $Pattern,'Compiled'
}

Process 
{
    if ($psCmdlet.ParameterSetName -eq "Path")
    {
        # In the -Path (non-literal) case we may need to resolve a wildcarded path
        $resolvedPaths = @($Path | Resolve-Path | Convert-Path)
    }
    else 
    {
        # Must be -LiteralPath
        $resolvedPaths = @($LiteralPath | Convert-Path)
    }

    foreach ($rpath in $resolvedPaths) 
    {
        Write-Verbose "Processing $rpath"

        $stream = new-object System.IO.FileStream $rpath,'Open','Read','Read',4096
        $reader = new-object System.IO.StreamReader $stream
        try
        {
            while (($line = $reader.ReadLine())-ne $null)
            {
                $matchColl = $regex.Matches($line)
                foreach ($match in $matchColl)
                {
                    $count++
                    $key = $match.Groups[$Group].Value
                    if ($matched.ContainsKey($key))
                    {
                        $matched[$key]++
                    }
                    else
                    {
                        $matched[$key] = 1;
                    }
                }
            }
        }
        finally
        {
            $reader.Close()
        }
    }
}

End
{
    new-object psobject -Property @{TotalCount = $count; Matched = $matched}
}

これをIISログディレクトリ（8.5 GBおよび約1000ファイル）に対して実行して、すべてのログですべてのIPアドレスを検索しました。例：

$r = ls . -r *.log | C:\Users\hillr\Search-File.ps1 '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

これは私のシステムで27分かかり、54356330の一致が見つかりました：

$r.Matched.GetEnumerator() | sort Value -Descending | select -f 20


Name                           Value
----                           -----
xxx.140.113.47                 22459654
xxx.29.24.217                  13430575
xxx.29.24.216                  13321196
xxx.140.113.98                 4701131
xxx.40.30.254                  53724

powershell - Powershellで多くの大きなテキストファイルを検索する

2 に答える 2

Related

Reference