performance - perl でディレクトリ内の複数のキーワードを検索する方法

Question

私はパフォーマンスに大きな問題に直面しています。約600のサブディレクトリを持つ巨大なディレクトリ（30 GB）に存在するファイル内のキーワードを検索する必要があります（これには、内部に多くのサブディレクトリがあります）。

現在、サブディレクトリを 50 個のテキストファイルに分割しているため、各ファイルは 12 個のサブディレクトリ名を取得し、50 個のプロセスすべてを実行します。

my $pm = Parallel::ForkManager->new($lines);

# Forks and returns the pid for the child:
my $pid = $pm->start and next;

# we are now in the child process
ucm5 ("-iinput.txt","-f$data"); - here $data will be text file names(text1,text2...text50)
--input.txt will have the multiple search keywords(hi , hello)

$pm->finish; # Terminates the child process

#!/usr/bin/perl
sub ucm5 {
local @ARGV = @_;
use strict;
use warnings;  
use File::Find;
use Getopt::Std;
#getting the input parameters
getopts('i:f:');

our($opt_i, $opt_f);
my $searchKeyword = $opt_i;                               #Search keyword file.
my $intfSplit = $opt_f;                               #split file
my $path = "C:/";                           #source directory
my $searchString;                                   #search keyword

open FH, ">>log.txt";                                          #open the log file to write

print FH "$intfSplit ". "started at ".(localtime)."\n";       #write the log file

open (FILE,$intfSplit);                                       #open the split file to read

while(<FILE>){

   my $intf= $_;                                              #setting the interface to intf
   chomp($intf);
   my $dir = $path.$intf;
   chomp($dir);
   print "$dir \n";                                              
   open(INP,$searchKeyword);                                  #open the search keyword file to read

   while (<INP>){      

   $searchString =$_;                                         #setting the search keyword to string
   chomp($searchString);
   print "$searchString \n";
   #open my $out, ">", "vob$intfSplit.txt" or die $!;          #open the vobintfSplit_* file to write
   open my $out, ">", "vob$intfSplit.txt" or die $!;
#calling subroutine printFile to find and print the path of element


#the subroutine will search for the keyword and print the path if keyword is exist in file.
my $printFile = sub {
   my $element = $_;

   if(-f $element && $element =~ /\.*$/){ 

      open my $in, "<", $element or die $!;
      while(<$in>) {
         if (/\Q$searchString\E/) {
            my $last_update_time = (stat($element))[9];
            my $timestamp  = localtime($last_update_time);
            print $out "$File::Find::name". "     $element"."     $timestamp". "     $searchString\n";
            last;
          }
        }
      }
    };
find(\&$printFile,$dir);  
  }
}
print FH "$intfSplit ". "ended at ".(localtime)."\n";         #write the log file
}
1;

コードは少し混乱するかもしれません。何をしているのかを説明します。最初の while ループでは、サブディレクトリを含むテキストファイルを開き、その内部では、別の while ループで検索語 (こんにちは、こんにちは) を含むテキストファイルを開きます。その file::find 内で、サブディレクトリ内のキーワードを検索するために呼び出されます。

ここで、最初のサブディレクトリに移動して最初のキーワード (HI) を検索し、一度完了すると、同じディレクトリに移動して次のキーワード (Hello) を検索します。これは、同じディレクトリを 2 回読み取ることを意味します。

しかし、最初の読み取り時間自体で両方のキーワードを検索したいので、多くの時間を節約できます。私の出力には、パス、ファイル名、検索語が必要です。

例

C:/aims/if/sp/abcd.sql abcd.sql こんにちは

この問題で私を助けてください。並列処理とスレッド以外に、複数のキーワードを使用して 600 のサブディレクトリすべてを検索するより良い方法はありますか。

score 0 · Accepted Answer

すべてのキーワードを一度に一致させたい場合は、ファイルを 1 回スキャンするだけです。

以下のコードは、"hello" または "test" を検索し、一致する行を入力で出力します。キーワードに quotemeta() を使用して、正規表現パターンが単語に埋め込まれないようにする場合は、おそらく検討する価値があります

while (<>) {
    if (/\b(hello|test)\b/i) { print "$1: $_" }
}

\b は単語境界に一致し (したがって、"testing" には一致しません)、/i はテストケース全体をインセンシティブにします。一致した単語は $1 になります。

編集: 複数の一致を処理する方法を示す完全な例。

サンプルファイル /tmp/test.txt と perl スクリプト match.pl (この例では 1 つのファイルのみを処理します) があるとします。

テストデータ:

This is a test
Line two
And this is line three

スクリプト：

#!perl
use v5.10;
use warnings;
use File::Slurp;
my $contents = read_file('/tmp/test.txt');

my @raw_matches = $contents =~ /(test|line)/gi;

my %match_counts;
foreach (@raw_matches) {
    $match_counts{ lc($_) }++;
}

my @unique_matches = sort keys %match_counts;
foreach (@unique_matches) {
    say "$_ : count = $match_counts{$_}";
}

出力例:

line : count = 2
test : count = 1

さて、これはスクリプトでファイル全体をメモリに読み込むことができることに依存していますが、巨大なファイルについては言及していません。処理するファイルがたくさんあるだけです。上記の例をもう少し短く表現する方法があるかもしれませんが、うまくいけば、各部分が何をしているかが明確になります。

上記をファイルごとに1回呼び出すことができる関数にするのに十分簡単なはずです。

performance - perl でディレクトリ内の複数のキーワードを検索する方法

1 に答える 1

Related

Reference