bash - 変換テーブルを使用して大きなテーブルの値を置き換える

Question

スペースで区切られた大きなテキストファイルの値を置き換えようとしていますが、この特定の問題に対する適切な答えが見つかりませんでした:

ヘッダーと約 200 万行を含むファイル「OLD_FILE」があるとします。

COL1 COL2 COL3 COL4 COL5
rs10 7 92221824 C A 
rs1000000 12 125456933 G A 
rs10000010 4 21227772 T C 
rs10000012 4 1347325 G C 
rs10000013 4 36901464 C A 
rs10000017 4 84997149 T C 
rs1000002 3 185118462 T C 
rs10000023 4 95952929 T G 
...

大きな (280 万行) 変換テーブルを使用して、各行の最初の値を対応する値に置き換えたいと考えています。この変換表では、最初の列に置き換えたい値がリストされ、2 番目の列に対応する新しい値がリストされます。

COL1_b36       COL2_b37
rs10    7_92383888
rs1000000       12_126890980
rs10000010      4_21618674
rs10000012      4_1357325
rs10000013      4_37225069
rs10000017      4_84778125
rs1000002       3_183635768
rs10000023      4_95733906
...

目的の出力は、最初の列のすべての値が変換テーブルに従って変更されたファイルになります。

COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A 
12_126890980 12 125456933 G A 
4_21618674 4 21227772 T C 
4_1357325 4 1347325 G C 
4_37225069 4 36901464 C A 
4_84778125 4 84997149 T C 
3_183635768 3 185118462 T C 
4_95733906 4 95952929 T G 
...

追加情報：

パフォーマンスに問題があります (次のコマンドには約 1 年かかります。

abを読んでいる間。do sed -i "s/\b$a\b/$b/g" OLD_FILE ; 完了 < CONVERSION_TABLE
交換前に完全一致が必要
OLD_FILE のすべての値が変換テーブルで見つかるわけではありません...
...しかし、置き換えられる可能性のあるすべての値は、変換テーブルで見つけることができます。

どんな助けでも大歓迎です。

score 15 · Accepted Answer

使用する1つの方法は次のawkとおりです。

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE

結果：

COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G

説明、出現順に：

NR==1 { next }            # simply skip processing the first line (header) of
                          # the first file in the arguments list (TABLE)

FNR==NR { ... }           # This is a construct that only returns true for the
                          # first file in the arguments list (TABLE)

a[$1]=$2                  # So when we loop through the TABLE file, we add the
                          # column one to an associative array, and we assign
                          # this key the value of column two

next                      # This simply skips processing the remainder of the
                          # code by forcing awk to read the next line of input

$1 in a { ... }           # Now when awk has finished processing the TABLE file,
                          # it will begin reading the second file in the
                          # arguments list which is OLD_FILE. So this construct
                          # is a condition that returns true literally if column
                          # one exists in the array

$1=a[$1]                  # re-assign column one's value to be the value held
                          # in the array

1                         # The 1 on the end simply enables default printing. It
                          # would be like saying: $1 in a { $1=a[$1]; print $0 }'

score 2 · Accepted Answer

これはあなたのために働くかもしれません（GNU sed）：

sed -r '1d;s|(\S+)\s*(\S+).*|/^\1\\>/s//\2/;t|' table | sed -f - file

score 1 · Accepted Answer

結合を使用できます：

join -o '2.2 1.2 1.3 1.4 1.5' <(tail -n+2 file1 | sort) <(tail -n+2 file2 | sort)

これにより、両方のファイルのヘッダーが削除されますhead -n1 file1。

出力：

12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
7_92383888 7 92221824 C A

score 1 · Accepted Answer

との別の方法join。ファイルが 1 列目に並べ替えられていると仮定すると、次のようになります。

head -1 OLD_FILE
join <(tail -n+2 CONVERSION_TABLE) <(tail -n+2 OLD_FILE) | cut -f 2-6 -d' '

ただし、このサイズのデータでは、データベースエンジンの使用を検討する必要があります。

bash - 変換テーブルを使用して大きなテーブルの値を置き換える

4 に答える 4

Related

Reference