r - 連続する行からの共通の値に基づいて行をオーバーレイする

Question

次のような入力があります。

A  200-400  213  253  295  350  0011
A  200-400  260  295  315  000
A  200-400  205  263  295  111
B  800-900  801  832  840  843  870  890  895  00110101
B  800-900  801  823  850  010
B  800-900  850  1
.
.
.

最後の列の 0 と 1 の値は、3 列目から最後の列までの値に対応します。

次のようなタブ区切りのマトリックスを生成したいと思います。

A 200-400  NA   213  253  NA   NA   295  NA   350 
A 200-400  NA   NA   NA   260  NA   295  315  NA
A 200-400  205  NA   NA   NA   263  295  NA   NA 
B 800-900  801  NA   832  840  843  NA   870  890 895 900
B 800-900  801  823  NA   NA   NA   850  NA   NA  NA  NA
B 800-900  NA   NA   NA   NA   NA   850  NA   NA  NA  NA

最後に、0 と 1 の値を対応する値に置き換え、

A  200-400  NA  0    0    NA   NA   1    NA    1 
A  200-400  NA  NA   NA   0    NA   0    0     NA
A  200-400  1   NA   NA   NA   1    1    NA    NA 
B  800-900  0   NA   0    1    1    NA   0     1   0   1
B  800-900  0   1    NA   NA   NA   0    NA    NA  NA  NA
B  800-900  NA  NA   NA   NA   NA   1    NA    NA  NA  NA

よろしくお願いいたします。

score 2 · Accepted Answer

なんて楽しい質問でしょう。Perlでお答えします。

同じ範囲のすべての行を一度に読み取る必要があります。これらの範囲内の各数値は、それらがどの行から来たのかを覚えておく必要があります。次に、各範囲の番号を並べ替えて、行を再組み立てできます。

最初の範囲では、次のような値のコレクションがあります

[213 => 1], [253 => 1], [295 => 1], [350 => 1],
[260 => 2], [295 => 2], [315 => 2],
[205 => 3], [263 => 3], [295 => 3],

共通の番号の重複を排除する必要があります。

[213 => 1], [253 => 1], [295 => 1, 2, 3], [350 => 1],
[260 => 2], [315 => 2],
[205 => 3], [263 => 3],

(順序は重要ではありません)。

これらの項目を最初のフィールドで並べ替えることができます。

my @sorted = sort { $a->[0] <=> $b->[0] } @items;

行ごとに、並べ替えられたアイテムを繰り返し処理し、行番号で出力するか番号を出力するかを決定できますNA。

for my $line (1 .. 3) {
  my @fields = map { decide_if_number_or_na($line, @$_) } @sorted;
  ...
}

sub decide_if_number_or_na {
  my ($line, $number, @lines) = @_;
  return $number if grep { $line == $_ } @lines;  # ... if any of the lines is our line
  return "NA";
}

もちろん、正しい0or1値をすぐに発行する必要があります。

これらすべてを結びつけるのは少し複雑です。入力の解析中に、各行を現在の01パターンに関連付け、最初の 2 つのフィールドを記憶し、項目のデータ構造を構築する必要があります。

結果のコードは上記の考慮事項に従いますが、いくつかのショートカットを使用します: 順序付けが行われると、各数値の実際の値はアイテムにとって重要ではないため、破棄できます。

use strict; use warnings; use feature 'say';

my @lines;   # an array of hashes, which hold information about each line
my %ranges;  # a hash mapping range identifiers to number-to-occuring-line-array hashes

while (<>) {
  chomp;
  my ($letter, $range, @nums) = split;  # split everything into field ...
  my @pattern = split //, pop @nums;    # but the last field is a pattern, which we split into chars.
  push @{ $ranges{$range}{$_} }, $. for @nums;  # $. is the line no
  push @lines, {
    letter  => $letter,
    range   => $range,
    pattern => \@pattern,
    line    => $.,
  };
}

# simplify and sort the ranges:
for my $key (keys %ranges) {
  my $nums2lines = $ranges{$key};  # get the number-to-occuring-lines-array hashes
  # read the next statement bottom to top:
  my @items =
    map { $nums2lines->{$_} }  # 3. get the line number arrayref only (forget actual number, now that they are ordered)
    sort { $a <=> $b }         # 2. sort them numerically
    keys %$nums2lines;         # 1. get all numbers
  $ranges{$key} = \@items; # Remember these items at the prior position
}

# Iterate through all lines
for my $line (@lines) {
  # Unpack some variables
  my @pattern = @{ $line->{pattern} };
  my $lineno  = $line->{line};
  my $items   = $ranges{$line->{range}};

  # For each item, emit the next part of the pattern, or NA.
  my @fields  = map { pattern_or_na($lineno, @$_) ? shift @pattern : "NA" } @$items;
  say join "\t", $line->{letter}, $line->{range}, @fields;
}

sub pattern_or_na {
  my ($line, @lines) = @_;  # the second value (the specific number)
  return scalar grep { $_ == $line } @lines;  # returns true if a number is on this line
}

これにより、目的の出力が生成されます。

これは、特に初心者にとっては非常に複雑なコードです。Perl の参照とautovivificationを利用します。sortまた、、、などの多くのリスト変換を使用しmapますgrep。このソリューションでは、同じ範囲の行が連続していることを考慮していないため、すべてをメモリに保持する必要はありません。このソリューションはより単純ですが (sic!)、必要以上のメモリを使用します。

これらすべてを理解するには、、、、およびマンページをperlreftut読むperlreことをお勧めします。perldsc

r - 連続する行からの共通の値に基づいて行をオーバーレイする

1 に答える 1

Related

Reference