perl - ファイル内の文字列のすべての出現を検索し、Perlでその行番号を出力します

Question

400000行を含む大きなファイルがあり、各行にはタブで区切られた多数のキーワードが含まれています。

また、一致するキーワードのリストを含むファイルがあります。このファイルがルックアップとして機能するとします。

したがって、ルックアップテーブルの各キーワードについて、指定されたファイルで出現するすべてのキーワードを検索する必要があります。そして、オカレンスの行番号を出力する必要があります。

私はこれを試しました

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

open OUT, ">", "SampleLineNum.txt";

while( $line = <FILE1> )
{
    while( <FILE2> ) 
    {
        $linenum = $., last if(/$line/);
    }
    print OUT "$linenum ";
}

close FILE1;

これにより、キーワードが最初に出現します。しかし、私はすべての出現が必要であり、キーワードも完全に一致する必要があります。

完全一致で直面している問題は、たとえば、「hello」と「helloworld」というキーワードがあることです。

「hello」と一致させる必要がある場合は、「hello world」を含む行番号が返されます。また、スクリプトは「hello」とのみ一致し、その行番号を指定する必要があります。

score 7 · Accepted Answer

すべてのキーワードのすべての出現に一致するソリューションを次に示します。

#!usr/bin/perl
use strict;
use warnings;

#Lexical variable for filehandle is preferred, and always error check opens.
open my $keywords,    '<', 'keywords.txt' or die "Can't open keywords: $!";
open my $search_file, '<', 'search.txt'   or die "Can't open search file: $!";

my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
my $regex = qr|\b($keyword_or)\b|;

while (<$search_file>)
{
    while (/$regex/g)
    {
        print "$.: $1\n";
    }
}

キーワード.txt:

hello
foo
bar

検索.txt:

plonk
food is good
this line doesn't match anything
bar bar bar
hello world
lalalala
hello everyone

出力：

4: bar
4: bar
4: bar
5: hello
7: hello

説明：

これにより、キーワードファイル内のすべてのキーワードに一致する単一の正規表現が作成されます。

<$keywords>- これをリストコンテキストで使用すると、ファイルのすべての行のリストが返されます。

map {chomp;qr/\Q$_\E/}- これにより、各行から改行が削除され、\Q...\Eクォートリテラルの正規表現演算子が各行に適用されます (これにより、「foo.bar」のようなキーワードがある場合、ドットが正規表現のメタ文字ではなくリテラル文字として扱われることが保証されます)。

join '|',- 結果のリストを、パイプ文字で区切られた単一の文字列に結合します。

my $regex = qr|\b($keyword_or)\b|;- 次のような正規表現を作成します。

/\b(\Qhello\E|\Qfoo\E|\Qbar\E)\b/

この正規表現は、どのキーワードにも一致します。\bは単語境界マーカーであり、単語全体のみが一致することを保証します:foodもはや一致しませんfoo。括弧は、で一致した特定のキーワードをキャプチャし$1ます。これは、出力が一致したキーワードを出力する方法です。

特定の行の各キーワードに一致し、完全な単語のみに一致するようにソリューションを更新しました。

score 6 · Accepted Answer

これは何か大きなものの一部ですか？これはワンライナーなのでgrep

grep -n hello filewithlotsalines.txt

grep -n "hello world" filewithlotsalines.txt

-n一致する行の前に最初にgrep行番号を表示します。あなたはman grepより多くのオプションのために行うことができます。

ここでは、Linuxまたは*nixシステムを使用していると想定しています。

score 1 · Accepted Answer

私はあなたの要求について別の解釈をしています。検索テーブルからの特定のエントリが「キーワード」ファイルの行で見つかった行番号のリストを維持したい場合があります。サンプルのルックアップテーブルを次に示します。

hello world
hello
perl
hash
Test
script

タブ区切りの「キーワード」ファイルでは、1 行に複数のキーワードが含まれる場合があります。

programming tests
hello   everyone
hello   hello world perl
scripting   scalar
test    perl    script
hello world perl    script  hash

上記を考慮して、次の解決策を検討してください。

use strict;
use warnings;

my %lookupTable;

print "Enter the file path of lookup table: \n";
chomp( my $lookupTableFile = <> );

print "Enter the file path that contains keywords: \n";
chomp( my $keywordsFile = <> );

open my $ltFH, '<', $lookupTableFile or die $!;

while (<$ltFH>) {
    chomp;
    undef @{ $lookupTable{$_} };
}

close $ltFH;

open my $kfFH, '<', $keywordsFile or die $!;

while (<$kfFH>) {
    chomp;
    for my $keyword ( split /\t+/ ) {
        push @{ $lookupTable{$keyword} }, $. if defined $lookupTable{$keyword};
    }
}

close $kfFH;

open my $slFH, '>', 'SampleLineNum.txt' or die $!;

print $slFH "$_: @{ $lookupTable{$_} }\n"
  for sort { lc $a cmp lc $b } keys %lookupTable;

close $slFH;

print "Done!\n";

出力先SampleLineNum.txt:

hash: 6
hello: 2 3
hello world: 3 6
perl: 3 5 6
script: 5 6
Test:

このスクリプトは、配列のハッシュ (HoA) を使用します。キーはルックアップテーブルのエントリであり、関連付けられた値は、「キーワード」ファイルの行でそのエントリが見つかった行番号のリストへの参照です。ハッシュ%lookupTableは空のリストへの参照で初期化されます。

「キーワード」ファイルの各行はsplit区切りタブにあり、対応するエントリがで定義されている%lookupTable場合、行番号がpush対応するリストに追加されます。完了すると、キーは大文字と小文字を区別せずに並べ替えられ、エントリが見つかった行番号の対応するリスト (存在する場合) と共にに%lookupTable書き出されます。SampleLineNum.txt

入力されたファイル名にはサニティチェックがないため、それらを追加することを検討してください。

お役に立てれば！

score 0 · Accepted Answer

すべての出現箇所を見つけるには、キーワードを読み込んでから、キーワードをループして各行の一致を見つける必要があります。これは、配列を使用して行内のキーワードを見つけるために変更したものです。さらに、行番号をカウントするカウンターを追加し、一致した場合は行番号を出力します。一致がない場合でも、コードは各行の項目を出力します。

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

# Read in all of the keywords
my @keywords = <FILE2>; 

# Close the file2
close(FILE2);

# Remove the line returns from the keywords
chomp @keywords;

# Sort and reverse the items to compare the maximum length items
# first (hello there before hello)
@keywords = reverse sort @keywords;

foreach my $k ( @keywords)
{
  print "$k\n";
}
open OUT, ">", "SampleLineNum.txt";
my $line;
# Counter for the lines in the file
my $count = 0;
while( $line = <FILE1> )
{
    # Increment the counter for the number of lines
    $count++;
    # loop through the keywords to find matches
    foreach my $k ( @keywords ) 
    {
        # If there is a match, print out the line number 
        # and use last to exit the loop and go to the 
        # next line
        if ( $line =~ m/$k/ ) 
        {
            print "$count\n";
            last;
        }
    }
}

close FILE1;

score 0 · Accepted Answer

このような質問もあると思います。あなたはチェックアウトすることができます：

File::Grepモジュールは興味深いものです。

score 0 · Accepted Answer

他の人が既にいくつかの perl ソリューションを提供しているので、ここで awk を使用できる可能性があることをお勧めします。

> cat temp
abc
bac
xyz

> cat temp2
abc     jbfwerf kfnm
jfjkwebfkjwe    bac     xyz
ndwjkfn abc kenmfkwe    bac     xyz

> awk 'FNR==NR{a[$1];next}{for(i=1;i<=NF;i++)if($i in a)print $i,FNR}' temp temp2
abc 1
bac 2
xyz 2
abc 3
bac 3
xyz 3
>

perl - ファイル内の文字列のすべての出現を検索し、Perlでその行番号を出力します

6 に答える 6

Related

Reference