bash - awk を使用して、複数行のレコードとフィルタリングを識別します

Question

複数行のレコードを含むビッグデータファイルを処理する必要があります。入力例:

1  Name      Dan
1  Title     Professor
1  Address   aaa street
1  City      xxx city
1  State     yyy
1  Phone     123-456-7890
2  Name      Luke
2  Title     Professor
2  Address   bbb street
2  City      xxx city
3  Name      Tom
3  Title     Associate Professor
3  Like      Golf
4  Name
4  Title     Trainer
4  Likes     Running

最初の整数フィールドは一意であり、実際にはレコード全体を識別することに注意してください。したがって、上記の入力では、実際には 4 つのレコードがありますが、各レコードに何行の属性があるかはわかりません。私はする必要があります: - 有効なレコードを識別します (「名前」と「タイトル」フィールドが必要です) - 有効なレコードごとに使用可能な属性を出力します。たとえば、「名前」、「タイトル」、「住所」は必要なフィールドです。

出力例:

1  Name      Dan
1  Title     Professor
1  Address   aaa street
2  Name      Luke
2  Title     Professor
2  Address   bbb street
3  Name      Tom
3  Title     Associate Professor

したがって、出力ファイルでは、「名前」フィールドがないため、レコード 4 が削除されます。レコード 3 には Address フィールドがありませんが、「名前」と「タイトル」を持つ有効なレコードであるため、出力に出力されます。

awkでこれを行うことはできますか？しかし、各行の最初の「id」フィールドを使用してレコード全体を識別するにはどうすればよいでしょうか?

私を助けてくれたUNIXシェルスクリプトの専門家に感謝します! :)

score 6 · Accepted Answer

これはうまくいくようです。awk であっても、これを行う方法はたくさんあります。

読みやすいように間隔を空けています。

レコード 3 は、必須であると特定した「住所」フィールドがないため、表示されないことに注意してください。

#!/usr/bin/awk -f

BEGIN {
        # Set your required fields here...
        required["Name"]=1;
        required["Title"]=1;
        required["Address"]=1;

        # Count the required fields
        for (i in required) enough++;
}

# Note that this will run on the first record, but only to initialize variables
$1 != last1 {
        if (hits >= enough) {
                printf("%s",output);
        }
        last1=$1; output=""; hits=0;
}

# This appends the current line to a buffer, followed by the record separator (RS)
{ output=output $0 RS }

# Count the required fields; used to determine whether to print the buffer
required[$2] { hits++ }

END {
        # Print the final buffer, since we only print on the next record
        if (hits >= enough) {
                printf("%s",output);
        }
}

score 3 · Accepted Answer

私は awk が苦手ですが、これは Perl で解決します。Perl のソリューションは次のとおりです。レコードごとに、重要な行と、名前とタイトルが表示されたかどうかを記憶します。レコードの最後に、すべての条件が満たされた場合にレコードが印刷されます。

#!/usr/bin/perl
use warnings;
use strict;

my ($last, $has_name, $has_title, @record);
while (<DATA>) {
    my ($id, $key, $value) = split;
    if ($id != $last and @record) {
        print @record if $has_name and $has_title;
        undef @record;
        undef $has_name;
        undef $has_title;
    }
    $has_name  = 1 if $key eq 'Name';
    $has_title = 1 if $key eq 'Title';
    push @record, $_ if grep $key eq $_, qw/Name Address Title/;
    $last = $id;
}


__DATA__
1  Name      Dan
1  Title     Professor
1  Address   aaa street
1  City      xxx city
1  State     yyy
1  Phone     123-456-7890
2  Name      Luke
2  Title     Professor
2  Address   bbb street
2  City      xxx city
3  Name      Tom
3  Title     Associate Professor
3  Like      Golf
4  Name
4  Title     Trainer
4  Likes     Running

bash - awk を使用して、複数行のレコードとフィルタリングを識別します

2 に答える 2

Related

Reference