perl - 印刷不可能な文字をカウントするPERL

Question

分析したいファイルが10万個あります。具体的には、任意のサイズのファイルのサンプルから印刷可能な文字の割合を計算したいと思います。これらのファイルの一部はメインフレーム、Windows、Unixなどからのものであるため、バイナリ文字と制御文字が含まれている可能性があります。

Linuxの「file」コマンドを使用して開始しましたが、目的に十分な詳細が提供されませんでした。次のコードは、私がやろうとしていることを伝えていますが、常に機能するとは限りません。

    #!/usr/bin/perl -n

    use strict;
    use warnings;

    my $cnt_n_print = 0;
    my $cnt_print = 0;
    my $cnt_total = 0;
    my $prc_print = 0;

    #Count the number of non-printable characters
    while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};

    #Count the number of printable characters
    while ($_ =~ m/[[:print:]]/g) {$cnt_print++};

    $cnt_total = $cnt_n_print + $cnt_print;
    $prc_print = $cnt_print/$cnt_total;

    #Print the # total number of bytes read followed by the % printable
    print "$cnt_total|$prc_print\n"

これは機能するテスト呼び出しです。

    echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

これは私がそれを呼ぶつもりであり、1つのファイルに対して機能します：

    find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

これは正しく機能しません：

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

これもしません：

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

検索によって返された各行に対してスクリプトを1回実行する代わりに、すべての結果に対して1回実行します。

前もって感謝します。

これまでの調査：

パイプとXARGSおよびセパレーター

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem

明確化：
1。）必要な出力：ディレクトリに932ファイルがある場合、出力は、ファイル名、ファイルから読み取られた合計バイト数、および印刷可能な文字である％の932行のリストになります。
2.）ファイルの多くはバイナリです。スクリプトは、埋め込まれたバイナリeolまたはeofシーケンスを処理する必要があります。
3.）ファイルの多くは大きいので、最初/最後のxxバイトだけを読み取りたいと思います。私は最初の256バイトまたは最後の128バイトをそれぞれhead -c 256またはうとしていtail -c 128ソリューションは、パイプラインで機能するか、perlスクリプト内のバイトを制限することができます。

score 4 · Accepted Answer

このオプションは、コード全体を1つのブロック-nにラップします。while(defined($_=<ARGV>) { ... }これはmy $cnt_print、入力のすべての行に対して変数宣言が繰り返され、基本的にすべての変数値がリセットされることを意味します。

回避策は、グローバル変数を使用することであり（our引き続き使用する場合はで宣言します）、入力のすべての行に対して再初期化されるため、グローバル変数use strictをに初期化しないでください。0あなたは次のようなことを言うことができます

our $cnt_print //= 0;

$cnt_print入力の最初の行でその友達を未定義にしたくない場合。

同様の問題があるこの最近の質問を参照してください。

score 1 · Accepted Answer

find一度に1つの引数を渡すことができます。

find /fct/inbound/trans/ -type f -exec perl script.pl {} \;

しかし、私はxargs、またはGNUを使用して、一度に複数findのファイルを渡し続けます-exec +。

find /fct/inbound/trans/ -type f -exec perl script.pl {} +

次のコードスニペットは両方をサポートしています。

一度に1行ずつ読み続けることができます。

#!/usr/bin/perl

use strict;
use warnings;

my $cnt_total   = 0;
my $cnt_n_print = 0;

while (<>) {
    $cnt_total += length;
    ++$cnt_n_print while /[^[:print:]]/g;
} continue {
    if (eof) {
        my $cnt_print = $cnt_total - $cnt_n_print;
        my $prc_print = $cnt_print/$cnt_total;

        print "$ARGV: $cnt_total|$prc_print\n";

        $cnt_total   = 0;
        $cnt_n_print = 0;
    }
}

または、一度にファイル全体を読み取ることもできます。

#!/usr/bin/perl

use strict;
use warnings;

local $/;
while (<>) {
    my $cnt_n_print = 0;
    ++$cnt_n_print while /[^[:print:]]/g;

    my $cnt_total = length;
    my $cnt_print = $cnt_total - $cnt_n_print;
    my $prc_print = $cnt_print/$cnt_total;

    print "$ARGV: $cnt_total|$prc_print\n";
}

score 0 · Accepted Answer

提供されたフィードバックに基づいた私の実用的なソリューションは次のとおりです。

フォームまたはより効率的な方法について、さらにフィードバックをいただければ幸いです。

    #!/usr/bin/perl

    use strict;
    use warnings;

    # This program receives a file path and name.
    # The program attempts to read the first 2000 bytes.
    # The output is a list of files, the number of bytes
    # actually read and the percent of tbe bytes that are
    # ASCII "printable" aka [\x20-\x7E].

    my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);

    # loop through each file
    foreach(@ARGV) {
       $file_name = shift or die "Pass the file name on the command line.\n";

       # open the file read only with "<" in "<$file_name"
       open(FILE, "<$file_name") or die "Can't open $file_name: $!";

       # open each file in binary mode to handle non-printable characters
       binmode FILE;

       # try to read 2000 bytes from FILE, save the results in $data and the
       # actual number of bytes read in $n_bytes
       $n_bytes = read FILE, $data, 2000;

       $cnt_n_print = 0;
       $cnt_print = 0;

       # count the number of non-printable characters
       ++$cnt_n_print while ($data =~ m/[^[:print:]]/g);

       $cnt_print = $n_bytes - $cnt_n_print;
       $prc_print = $cnt_print/$n_bytes;

       print "$file_name|$n_bytes|$prc_print\n";
       close(FILE);
    }

上記のスクリプトを呼び出す方法のサンプルを次に示します。

    find /some/path/to/files/ -type f -exec perl this_script.pl {} +

これが私が役に立ったと思った参考文献のリストです：

POSIXブラケット式
 binmodeでファイルを開く
 読み取り機能
 ファイルを開く読み取り専用

perl - 印刷不可能な文字をカウントするPERL

3 に答える 3

Related

Reference