regex - Perl を使用して、ファイル内またはディレクトリ内のすべてのファイル内のすべての単語の出現回数を数えます

Question

だから私は3つの引数を取るPerlスクリプトを書こうとしています。

最初の引数は、入力ファイルまたはディレクトリです。
- ファイルの場合、すべての単語の出現回数をカウントします
- ディレクトリの場合、各ディレクトリを再帰的に調べて、それらのディレクトリ内のファイル内のすべての単語の出現回数をすべて取得します。
2 番目の引数は、出現回数が最も多い単語をいくつ表示するかを表す数値です。
- これにより、各単語の番号のみがコンソールに出力されます
コマンドラインの 3 番目の引数である出力ファイルに出力します。

ディレクトリを再帰的に検索し、ファイル内のすべての単語の出現を見つけてコンソールに出力する限り、機能しているようです。

これらを出力ファイルに出力するにはどうすればよいですか。また、2 番目の引数 (たとえば 5) を取得して、単語を出力に出力する際に、出現回数が最も多い単語の数をコンソールに出力するにはどうすればよいでしょうか。ファイル？

以下は私がこれまでに持っているものです：

#!/usr/bin/perl -w

use strict;

search(shift);

my $input  = $ARGV[0];
my $output = $ARGV[1];
my %count;

my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
    print("This is a file!\n");
    while ( my $line = <$filename> ) {
        chomp $line;
        foreach my $str ( $line =~ /\w+/g ) {
            $count{$str}++;
        }
    }
    foreach my $str ( sort keys %count ) {
        printf "%-20s %s\n", $str, $count{$str};
    }
}
close($filename);
if ( -d $input ) {

    sub search {
        my $path = shift;
        my @dirs = glob("$path/*");
        foreach my $filename (@dirs) {
            if ( -f $filename ) {
                open( FILE, $filename ) or die "ERROR: Can't open file";
                while ( my $line = <FILE> ) {
                    chomp $line;
                    foreach my $str ( $line =~ /\w+/g ) {
                        $count{$str}++;
                    }
                }
                foreach my $str ( sort keys %count ) {
                    printf "%-20s %s\n", $str, $count{$str};
                }
            }
            # Recursive search
            elsif ( -d $filename ) {
                search($filename);
            }
        }
    }
}

score 0 · Accepted Answer

私はそれを理解しました。以下は私の解決策です。それが最善の方法かどうかはわかりませんが、うまくいきます。

    # Check if there are three arguments in the commandline
    if (@ARGV < 3) {
       die "ERROR: There must be three arguments!\n";
       exit;
    }
    # Open the file
    my $file = shift or die "ERROR: $0 FILE\n";
    open my $fh,'<', $file or die "ERROR: Could not open file!";
    # Check if it is a file
    if (-f $fh) {
       print("This is a file!\n");
       # Go through each line
       while (my $line = <$fh>) {
          chomp $line;
          # Count the occurrences of each word
          foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
             $count{$str}++;
          }
       }
    }

    # Check if the INPUT is a directory
    if (-d $input) {
       # Call subroutine to search directory recursively
       search_dir($input);
    }
    # Close the file
    close($fh);
    $high_count = 0;
    # Open the file
    open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
    # Sort the most occurring words in the file and print them
    foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
       $high_count++;
       if ($high_count <= $num) {
          printf "%-31s %s\n", $str, $count{$str};
       }
       printf $fileh "%-31s %s\n", $str, $count{$str};
    }
    exit;

    # Subroutine to search through each directory recursively
    sub search_dir {
       my $path = shift;
       my @dirs = glob("$path/*");
       # Loop through filenames
       foreach my $filename (@dirs) {
          # Check if it is a file
          if (-f $filename) {
             # Open the file
             open(FILE, $filename) or die "ERROR: Can't open file";
             # Go through each line
             while (my $line = <FILE>) {
                chomp $line;
                # Count the occurrences of each word
                foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
                   $count{$str}++;
                }
             }
             # Close the file
             close(FILE);
          }
          elsif (-d $filename) {
             search_dir($filename);
          }
       }
    }

score 0 · Accepted Answer

プログラム/スクリプトを再構築することをお勧めします。あなたが投稿したものは、従うのが難しいです。何が起こっているのかを追跡するために、いくつかのコメントが役立つ場合があります。うまくいけば項目を説明するのに役立つように、いくつかのコードスニペットを使用して物事を整理する方法を見ていきます。質問で説明した 3 つの項目について説明します。

最初の引数はファイルまたはディレクトリである可能性があるため、-f および -d を使用してチェックし、入力が何であるかを判別します。リスト/配列を使用して、処理するファイルのリストを含めます。ファイルだけの場合は、処理リストにプッシュします。それ以外の場合は、ルーチンを呼び出して、処理するファイルのリストを返します (検索サブルーチンと同様)。何かのようなもの：

# List file files to process
my @fileList = ();
# if input is only a file
if ( -f $ARGV[0] )
{
  push @fileList,$ARGV[0];
}
# If it is a directory
elsif ( -d $ARGV[0] ) 
{
   @fileList = search($ARGV[0]);
}

そのため、検索サブルーチンでは、ファイルであるアイテムをプッシュし、サブルーチンから配列を返すリスト/配列が必要です (glob 呼び出しからファイルのリストを処理した後)。ディレクトリがある場合は、パスを使用して検索を呼び出し (現在行っているのと同じように)、次のように現在の配列に要素をプッシュします。

# If it is a file, save it to the list to be returned
if ( -f $filename ) 
{
  push @returnValue,$filename;
}
# else if a directory, get the files from the directory and 
# add them to the list to be returned
elsif ( -d $filename )
{
  push @returnValue, search($filename);
}

ファイルリストを取得したら、各ファイルの処理をループします (開く、閉じる行を読み取る、単語の行を処理する)。各行を処理するための foreach ループは正しく機能します。ただし、単語にピリオド、コンマ、またはその他の句読点が含まれている場合は、ハッシュで単語をカウントする前にそれらの項目を削除することをお勧めします。

次のパートでは、カウントが最も高い単語を特定することについて質問しました。その場合、(単語ごとに) カウントのキーを持つ別のハッシュを作成する必要があり、そのハッシュの値は、そのカウント数に関連付けられた単語のリスト/配列です。何かのようなもの：

# Hash with key being a number and value a list of words for that number
my %totals= ();
# Temporary variable to store occurrences (counts) of the word
my $wordTotal;
# $w is the words in the counts hash
foreach my $w ( keys %counts ) 
{
  # Get the counts for the word
  $wordTotal = $counts{$w};
  # value of the hash is an array, so de-reference the array ( the @{ }, 
  # and push the value of the counts array onto the array
  push @{ $totals{$wordTotal} },$w;  # the key to total is the value of the count hash
                                     # for which the words ($w) are the keys
}

カウントが最も高い単語を取得するには、合計からキーを取得し、ソートされたリスト (数値でソート) を逆にして、最高の N 数を取得する必要があります。値の配列があるため、各出力をカウントして、N 個の最高カウントを取得する必要があります。

# Number of items outputted
my $current = 0;
# sort the total (keys) and reverse the list so the highest values are first
# and go through the list
foreach my $t ( reverse sort { $a <=> $b} keys %totals) # Use the numeric 
                                                        # comparison in 
                                                        # the sort 
{
   # Since each value of total hash is an array of words,
   # loop through that array for the values and print out the number 
   foreach my $w ( sort @{$total{$t}}
   {
     # Print the number for the count of words
     print "$t\n";
     # Increment the number output
     $current++;
     # if this is the number to be printed, we are done 
     last if ( $current == $ARGV[1] );
   }
   # if this is the number to be printed, we are done 
   last if ( $current == $ARGV[1] );
 }

ファイルへの出力の 3 番目の部分では、質問から「それら」が何であるか (単語、カウント、またはその両方。上位の単語またはすべての単語に限定) が不明です。ファイルを開き、情報をファイルに出力し、ファイルを閉じる作業は、あなたに任せます。

regex - Perl を使用して、ファイル内またはディレクトリ内のすべてのファイル内のすべての単語の出現回数を数えます

3 に答える 3

Related

Reference