arrays - 値のテーブルからの欠損データの線形補間

Question

以下は、私の出発点であるデータファイルからの抜粋です。

Marker      Distance_1  Distance_2  ID
.
.
.
30          13387412    34.80391242 seq-SN_FIRST
31          13387444    34.80391444 seq-SN_Second
31.1             
31.2             
32          13387555    39.80391    seq-SN_Third
.
.
.

これは、それぞれ 4 つの要素の複数行のタブ区切りファイルです。最初の行はヘッダーです。その後、多数のデータ行。縦の点は実際のファイルにはありませんが、ここでは、明示的に示されている行の例の前後に、示されている実際の行と同様のデータが発生していることを示しています。

一部のデータ行は「いっぱい」です。つまり、4 つのセルエントリすべてに何かが含まれています。他の行は「空白」で、最初の実際のエントリのみですが、その後に 3 つのタブ区切りの単一スペースが続きます。空白行の空白は「埋める」必要があります。入力は、直前および直後の行の対応するセルエントリを使用して、線形補間によって行われます。たとえば、列 2 の missingは、前の行の値と次の行Distance_1 valuesの値を使用して補間されます。列 3 の値についても同様です。ここでは、列 4 の値は無視されます。1338744413387555

スクリプトの最初の目標は、入力が必要なデータのブロックと、それらに隣接する「完全な」行を特定することです。空白行には 3 つのタブ付きシングルスペースが含まれ、そのように ID が付けられます。見つかったら、空白行と隣接する完全な行の連続したセットが、補間のためにサブルーチンに送信されます。

#!/usr/bin/perl
use strict;
use warnings;

die "usage: [ map positions file post SAS ]\n\n" unless @ARGV == 1;

my @file = ();

while (my $line = <$ARGV[0]>) {
  chomp $line;
  push(@file, $line);
}

my @consecutive_lines = (); # array collects a current set of consecutive lines requiring linear interpolation
my @filled = ();    # my final array, hopefully correctly filled in

#####
# search for consecutive set of lines in @file
#####

for (my $i = 0; $i < $#file; $i++) {          #  $#file returns the index of the last element in @file

  if ($file[$i] !~ /(?:\t\s){3}/) {           # if not a blank line
                                              # but a "full line"
    push(@filled, $file[$i]);                 # push the header and full lines, until...
  }

  elsif ($file[$i] =~ /(?:\t\s){3}/) {        # ...a blank line is found

    push (@consecutive_lines, $file[$i - 1]); # push preceding full line into array

    while ($file[$i] =~ /(?:\t\s){3}/ and $i < $#file) {  # keep pushing lines, so long as they are blank
                                                          # or end of file
      push(@consecutive_lines, $file[$i++]);
    }

    push(@consecutive_lines, $file[$i]) ;     # else we reach next full line, so push it into array

    my @fillme = linearInterpolation(@consecutive_lines); # send set of lines out for filling

    push(@filled, @fillme);                   # push filled in set of lines into the final array

    @consecutive_lines = ();                  # reset or undef array @consecutive_lines for next cycle

  }    # end of elsif

}    # end of for loop

ユーザー @Kenosis のおかげで、上記の多くのヘルプを得ることができました。

次は線形補間です。ここで、スクリプトの第 1 フェーズを第 2 フェーズにリンクしようとしています。そして、今のところうまく機能していません。

@incoming私の目標は、配列をサブルーチンに渡すことです。次に、この配列が分割されるため、実際のセルエントリが「表示」され、配列によってインデックス付けされ、呼び出されるようになります。最初に列2の値に対してこれを行う方法を理解しようとしていますDistance_1。このスクリプトは、補間値が計算された直後に近づいて迷走し始めているように感じます。

#####
# subroutine linear interpolation
#####

sub linearInterpolation {
  my @incoming = @_;    # array of consecutive set of lines

  my @splitup;                  # declare new array, will be a "split up" version of @incoming
  my ($A, $B, $C, $D, $E);      # variables for linear interpolation
  my @fillme;                   # declaring the "emtpy" array to be filled in
  my @dist_1_fills;             # array of interpolated values for dist_1

  for (my $i = 0;
    $i < scalar @incoming; $i++)     # loop to split up lines of @incoming
  {                                  # into indexed cell entries
    chomp $incoming[$i];             # and make new array of them
    my @entries = split('\t', $incoming[$i]);
    push(@splitup, @entries);
  }

  $A = $splitup[1];                   # cell entry in column 2 of preceding full line
  $B = $splitup[-3];                  # cell entry in column 2 of succeeding full line

  $C = $splitup[2];                   # cell entry in column 3 of preceding full line
  $D = $splitup[-2];                  # cell entry in column 3 of succeeding full line
  $E = scalar @incoming - 1;          # equals number of lines in the set minus 1

  for (my $i = 1; $i < $E; $i++) {    # need to start finding appropriate
                                      # number interpolated values, given number of
    my @dist_1_fills =
        interpvalues($A, $B, $E, $i); # of lines in consecutive set of lines

    for ($i = 0; $i < scalar @splitup; $i += 4) {
      push(@fillme, $splitup[$i], $dist_1_fills[$i], "dist_2_fills", "--");
                                      # fourth column values will be ignored or filled with --.
                                      # "dist_2_fills" just occupying it's proper spot until I can figure out distance 1 fills
    }
  }
}

#########

sub interpvalues {                  # subroutine to find interpolated values
  my ($A, $B, $E, $i) = @_;
  my $dist_1_answers = (($B - $A) / ($E)) * $i + $A;
  return $dist_1_answers;
}

コードは、補間された値を見つけてコードの最初の部分に送り返し、最終的にデータセットに入力する処理を扱う 2 番目の部分で混乱します。特に私の最大の (おそらく私だけではない) 問題は、2 番目のサブルーチンで計算された後に空白行を適切な値で埋めようとすることだと思います。

ヒントや手がかりは大歓迎です！

score 1 · Accepted Answer

このプログラムはあなたが必要とすることをします。コマンドラインのパラメータとしてinoutファイル名が必要です。

use strict;
use warnings;

my @saved;
my @needed;

while (<>) {
  chomp;
  my @fields = split /\t/;

  # Pass hrough headers and junk
  unless ($fields[0] and $fields[0] =~ /\d/) {
    print "$_\n";
    next;
  }

  # Save x-value for records without a y-value
  if ($fields[1] !~ /\d/) {
    push @needed, $fields[0];
    next;
  }

  # We have a filled-out row. Calculate any intermediate missing ones
  if (@needed) {
    if ($saved[0] == $fields[0]) {
      die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
    }
    my ($a1, $b1) = solve_linear(@saved[0,1], @fields[0,1]);
    my ($a2, $b2) = solve_linear(@saved[0,2], @fields[0,2]);
    while (@needed) {
      my $x = shift @needed;
      my $y1 = $a1 * $x + $b1;
      my $y2 = $a2 * $x + $b2;
      print join("\t", $x, $y1, $y2), "\n";
    }
  }

  print "$_\n";
  @saved = @fields;
}

sub solve_linear {
  my ($x0, $y0, $x1, $y1) = @_;
  my ($dx, $dy) = ($x1 - $x0, $y1 - $y0);
  my $aa = $dy / $dx;
  my $bb = ($y0 * $dx - $x0 * $dy)  / $dx;
  return ($aa, $bb);
}

出力

Marker  Distance_1  Distance_2  ID
.
.
.
30  13387412  34.80391242 seq-SN_FIRST
31  13387444  34.80391444 seq-SN_Second
31.1  13387455.1  35.303913996  --
31.2  13387466.2  35.803913552  --
32  13387555  39.80391  seq-SN_Third
.
.
.
Tool completed successfully

score 0 · Accepted Answer

コードをこれに変更して、線形補間が最初の列の値ではなく、2 番目と 3 番目の列の値に基づくようにしました。特に @Kenosis と @Borodin のユーザーに感謝します。以前の質問に対する Kenosis の回答を受け入れ、ボロディンの回答をここで受け入れましたが、この改訂版は「自分の質問に答える」セクションに投稿しました。ここにリビジョンを投稿してもよろしいですか? これに関する FAQ をざっと見ましたが、関連するものはまだ見つかりませんでした。

#!/usr/bin/perl
use strict; use warnings;

my @saved;
my @needed;

while (<>) {
  chomp;
  my @fields = split /\t/;

    # Does the current line actually exist AND does it contain one or more digits.
    unless ($fields[0] and $fields[0] =~ /\d/) {
    # If no, this is the header, so print it. If yes, advance.
    print "$_\n";
    #after printing header, go back to <> and read in next line.
    next;
  }

  # Is the second cell of the current line devoid of digits?
  if ($fields[1] !~ /\d/) {                
  # If no, advance. If yes, remember $field[0], the Marker.
  push @needed, $fields[0];              
  # After pushing, go back to <> and read in next line.
  next;
 }

  # If we are here, we must have a filled-out row.
  # Does @needed have any values? If no, advance. If yes,
  if (@needed) {
    if ($saved[0] == $fields[0]) {
      die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
     }
    # Else send preceding dist_1 value, succeeding dist_1 value,
    # preceding dist_2 value, succeeding dist_2 value, 
    # and number of emtpy lines to subroutine.
    my ($dist_1_interval, $dist_2_interval) = interval_sizes($saved[1], $fields[1], $saved[2],   $fields[2], scalar @needed);     
    # Current size of @needed is saved as $size and is used to help with iteration.
# So long as @needed contains values...
my $size = scalar @needed;
while (@needed) {
  # ...remove left-most Marker value from array @needed.
  my $x = shift @needed;
  # Interpolated values for dist_1 and dist_2 are
  # (respective interval size x iteration of while loop) + preceding values.
  my $new_dist_1 = ($dist_1_interval * (1 + ($size - (scalar @needed + 1)))) + $saved[1];
  my $new_dist_2 = ($dist_2_interval * (1 + ($size - (scalar @needed + 1)))) + $saved[2];
  print join("\t", $x, $new_dist_1, $new_dist_2, "--"), "\n";
  }
}     
      # We are here since current line is already a filled-in row.
      print "$_\n";
      # Print this row and assign it to @saved. Return to <>.
      @saved = @fields;
} 

sub interval_sizes {
  # $A = preceding dist_1, $B = succeeding dist_1, 
  # $C = preceding dist_2, $D = succeeding dist_2,
  # $E = number of needed distances.
  my ($A, $B, $C, $D, $E) = @_; 
  # I need an interval size for dist_1 based on difference between $B and $A.
  my $dist_1_interval = ($B - $A)/($E + 1);
  # I need an interval size for dist_2 based on difference between $D and $C.
  my $dist_2_interval = ($D - $C)/($E + 1);
  return ($dist_1_interval, $dist_2_interval);
 }

arrays - 値のテーブルからの欠損データの線形補間

2 に答える 2

Related

Reference