regex - 20gのファイルをbashで取得する

Question

コードパフォーマンスに関する質問：〜20gのテキストファイルに対して〜25の正規表現ルールを実行しようとしています。スクリプトは、テキストファイルとの一致を出力する必要があります。各正規表現ルールは独自のファイルを生成します。以下の擬似コードを参照してください。

regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
    while read line
    # Each $line in the looped-through file contains a regex rule, e.g.,
    # egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
    # $rname is a unique rule name generated by a separate bash function
    # exported to the current shell.
        do
        cmd="$line $tmp > ~/outputdir/$tmp.$rname.filter.piped &"
        eval $cmd
    done < $regex_rules
done

カップルの考え：

テキストファイルを一度だけループして、すべてのルールを評価し、一度に個々のファイルに分割する方法はありますか？これはもっと速いでしょうか？
この仕事に使用すべき別のツールはありますか？

ありがとう。

score 5 · Accepted Answer

これがオプションgrepがある理由です。1 行に 1 つずつ、正規表現だけに-f減らして実行しますregexrulefile.txt

egrep -f regexrulefile.txt the_big_file

これにより、単一の出力ストリームですべての一致が生成されますが、後でループ操作を行ってそれらを分離することができます。組み合わせた一致リストがそれほど大きくないと仮定すると、これはパフォーマンスの向上になります。

score 2 · Accepted Answer

私はと同じようなことをしましlexた。もちろん、それは一日おきに実行されるので、YMMV。リモートWindows共有上の数百メガバイトのファイルでも非常に高速です。処理には数秒しかかかりません。クイックプログラムをハックするのがどれほど快適かはわかりませんがC、これが大規模な正規表現の問題に対する最も速くて簡単な解決策であることがわかりました。

有罪を保護するために編集された部品：

    /************************************************** 
        start of definitions section

    ***************************************************/


%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <getopt.h>
#include <errno.h>

char inputName[256];
// static insert variables

//other variables
char tempString[256];
char myHolder[256];
char fileName[256];
char unknownFileName[256];
char stuffFileName[256];
char buffer[5];

/* we are using pointers to hold the file locations, and allow us to dynamically open and close new files */
/* also, it allows us to obfuscate which file we are writing to, otherwise this couldn't be done */

FILE *yyTemp;
FILE *yyUnknown;
FILE *yyStuff;

// flags for command line options
static int help_flag = 0;

%}

%option 8bit 
%option nounput nomain noyywrap 
%option warn

%%
    /************************************************ 
        start of rules section
    *************************************************/


(\"A\",\"(1330|1005|1410|1170)\") { 
    strcat(myHolder, yytext);
    yyTemp = &(*yyStuff);
} //stuff files

. { strcat(myHolder, yytext); }

\n  {
    if (&(*yyTemp) == &(*yyUnknown))
        unknownCount += 1;
    strcat(myHolder, yytext); 
    //print to file we are pointing at, whatever it is
    fprintf(yyTemp, "%s", myHolder);
    strcpy(myHolder, "");
    yyTemp = &(*yyUnknown);
}

<<EOF>> {
    strcat(myHolder, yytext); 
    fprintf(yyTemp, "%s", myHolder);
    strcpy(myHolder, "");
    yyTemp = &(*yyUnknown);

    yyterminate();
}

%%
    /**************************************************** 
        start of code section


    *****************************************************/


int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
    /****************************************************
        The main method drives the program. It gets the filename from the
        command line, and opens the initial files to write to. Then it calls the lexer.
        After the lexer returns, the main method finishes out the report file,
        closes all of the open files, and prints out to the command line to let the
        user know it is finished.
    ****************************************************/

    int c;

    // the gnu getopt library is used to parse the command line for flags
    // afterwards, the final option is assumed to be the input file

    while (1) {
        static struct option long_options[] = {
            /* These options set a flag. */
            {"help",   no_argument,     &help_flag, 1},
            /* These options don't set a flag. We distinguish them by their indices. */
            {0, 0, 0, 0}
        };
           /* getopt_long stores the option index here. */
        int option_index = 0;
        c = getopt_long (argc, argv, "h",
            long_options, &option_index);

        /* Detect the end of the options. */
        if (c == -1)
            break;

        switch (c) {
            case 0:
               /* If this option set a flag, do nothing else now. */
               if (long_options[option_index].flag != 0)
                 break;
               printf ("option %s", long_options[option_index].name);
               if (optarg)
                 printf (" with arg %s", optarg);
               printf ("\n");
               break;

            case 'h':
                help_flag = 1;
                break;

            case '?':
               /* getopt_long already printed an error message. */
               break;

            default:
               abort ();
            }
    }

    if (help_flag == 1) {
        printf("proper syntax is: yourProgram.exe [OPTIONS]... INFILE\n");
        printf("splits csv file into multiple files")
        printf("Option list: \n");
        printf("--help                  print help to screen\n");
        printf("\n");
        return 0;
    }

    //get the filename off the command line and redirect it to input
    //if there is no filename then use stdin

    if (optind < argc) {
        FILE *file;

        file = fopen(argv[optind], "r");
        if (!file) {
            fprintf (stderr, "%s: Couldn't open file %s; %s\n", argv[0], argv[optind], strerror (errno));
            exit(errno);
        }
        yyin = file;
        strcpy(inputName, argv[optind]);
    }
    else {
        printf("no input file set, using stdin. Press ctrl-c to quit");
        yyin = stdin;
        strcpy(inputName, "\b\b\b\b\bagainst stdin");
    }

    //set up initial file names

    strcpy(fileName, inputName);
    strncpy(unknownFileName, fileName, strlen(fileName)-4);
    strncpy(stuffFileName, fileName, strlen(fileName)-4);

    strcat(unknownFileName, "_UNKNOWN_1.csv");
    strcat(stuffFileName, "_STUFF_1.csv");

    //open files for writing

    yyout = stdout;
    yyTemp = malloc(sizeof(FILE));
    yyUnknown = fopen(unknownFileName,"w");
    yyTemp = &(*yyUnknown);

    yyStuff = fopen(stuffFileName,"w");

    yylex();

    //close open files

    fclose(yyUnknown);

    printf("Lexer finished running %s",fileName);

    return 0;

}

このフレックスプログラムをビルドするには、フレックスをインストールし、このmakefileを使用します（パスを調整します）。

TARGET = project.exe
TESTBUILD = project
LEX = flex
LFLAGS = -Cf
CC = i586-mingw32msvc-gcc
CFLAGS = -O -Wall 
INSTALLDIR = /mnt/J/Systems/executables

.PHONY: default all clean install uninstall cleanall

default: $(TARGET)

all: default install

OBJECTS = $(patsubst %.l, %.c, $(wildcard *.l))

%.c: %.l
    $(LEX) $(LFLAGS) -o $@ $<

.PRECIOUS: $(TARGET) $(OBJECTS)

$(TARGET): $(OBJECTS)
    $(CC) $(OBJECTS) $(CFLAGS) -o $@

linux: $(OBJECTS)
    gcc $(OBJECTS) $(CFLAGS) -lm -g -o $(TESTBUILD)

cleanall: clean uninstall

clean:
    -rm -f *.c
    -rm -f $(TARGET)
    -rm -f $(TESTBUILD)

uninstall:
    -rm -f $(INSTALLDIR)/$(TARGET)

install:
    cp -f $(TARGET) $(INSTALLDIR)

score 2 · Accepted Answer

速い（！=速すぎる）Perlソリューション：

#!/usr/bin/perl
use strict; use warnings;

正規表現をプリロードして、ファイルを1回だけ読み取るようにします。それらは配列に格納され@regexます。正規表現ファイルは、引数として指定された最初のファイルです。

open REGEXES, '<', shift(@ARGV) or die;
my @regex = map {qr/$_/} <REGEXES>;
# use the following if the file still includes the egrep:
# my @regex = map {
#     s/^egrep \s+ -i \s+ '? (.*?) '? \s* $/$1/x;
#     qr{$_}
# } <REGEXES>;
close REGEXES or die;

引数として指定された残りの各ファイルを調べます。

while (@ARGV) {
  my $filename = shift @ARGV;

効率を上げるためにファイルを事前に開きます。

  my @outfile = map {
     open my $fh, '>', "outdir/$filename.$_.filter.piped"
       or die "Couldn't open outfile for $filename, rule #$_";
     $fh;
  } (1 .. scalar(@rule));
  open BIGFILE, '<', $filename or die;

ルールに一致するすべての行を指定されたファイルに出力します。

  while (not eof BIGFILE) {
    my $line = <BIGFILE>;
    for $ruleNo (0..$#regex) {
      print $outfile[$ruleNo] $line if $line =~ $regex[$ruleNo];
      # if only the first match is interesting:
      # if ($line =~ $regex[$ruleNo]) {
      #     print $outfile[$ruleNo] $line;
      #     last;
      # }
    }
  }

次の反復の前にクリーンアップします。

  foreach (@outfile) {
    close $_ or die;
  }
  close BIGFILE or die;
}

print "Done";

呼び出し：$ perl ultragrepper.pl regexFile bigFile1 bigFile2 bigFile3など。より速いものはすべてCで直接書き込む必要があります。ハードディスクのデータ転送速度が限界です。

ファイルを再度開いたり、正規表現を再解析したりすることを避けているので、これはbashペンダントとしてより速く実行されるはずです。さらに、外部ツール用に新しいプロセスを作成する必要はありません。しかし、いくつかのスレッドを生成することができます！（少なくともNumOfProcessors * 2つのスレッドが賢明かもしれません）

local $SIG{CHLD} = undef;
while (@ARGV) {
    next if fork();
    ...;
    last;
}

score 1 · Accepted Answer

また、ここに戻って perl バージョンを作成することにしましたが、amon が既に作成していることに気付きました。すでに書かれているので、ここに私のものがあります：

#!/usr/bin/perl -W
use strict;

# The search spec file consists of lines arranged in pairs like this:
# file1
# [Ff]oo
# file2
# [Bb]ar
# The first line of each pair is an output file. The second line is a perl
# regular expression. Every line of the input file is tested against all of
# the regular expressions, so an input line can end up in more than one
# output file if it matches more than one of them.

sub usage
{
        die "Usage: $0 search_spec_file [inputfile...]\n";
}

@ARGV or usage();

my @spec;

my $specfile = shift();
open my $spec, '<', $specfile or die "$specfile: $!\n";
while(<$spec>) {
        chomp;
        my $outfile = $_;
        my $regexp = <$spec>;
        chomp $regexp;
        defined($regexp) or die "$specfile: Invalid: Odd number of lines\n";
        open my $out, '>', $outfile or die "$outfile: $!\n";
        push @spec, [$out, qr/$regexp/];
}
close $spec;

while(<>) {
        for my $spec (@spec) {
                my ($out, $regexp) = @$spec;
                print $out $_ if /$regexp/;
        }
}

score 1 · Accepted Answer

構造を逆にします。ファイルを読み込んだ後、ルールをループして、個々の行でのみ一致を実行します。

regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
while read line ; do 
 while read rule
    # Each $line in the looped-through file contains a regex rule, e.g.,
    # egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
    # $rname is a unique rule name generated by a separate bash function
    # exported to the current shell.
        do
        cmd=" echo $line  | $rule  >> ~/outputdir/$tmp.$rname.filter.piped &"
        eval $cmd
    done < $regex_rules
done < $tmp

終わり

ただし、この時点では、マッチごとに個別の egrep プロセスを起動するのではなく、bash (または perl の) 組み込みの正規表現マッチングを使用できます/使用する必要があります。ファイルを分割して並列プロセスを実行することもできます。（注：私も>を>>に修正しました）

regex - 20gのファイルをbashで取得する

5 に答える 5

Related

Reference