perl - ループを使用せずに別のファイルを使用してファイル内の名前を変更する

Question

私は2つのファイルを持っています:

(one.txt) は次のようになります。

>ENST001 

(((....)))

(((...)))

>ENST002 

(((((((.......))))))

((((...)))

私はあと10000個のENSTを持っています

(two.txt) は次のようになります。

>ENST001   110

>ENST002  59

残りのすべての ENST についても同様

基本的に、(one.txt) の ENST を (two.txt) の 2 つのフィールドの組み合わせに置き換えたいので、結果は次のようになります。

>ENST001_110 

(((....)))

(((...)))

>ENST002_59 

(((((((.......))))))

((((...)))

そのための matlab スクリプトを作成しましたが、(two.txt) のすべての行をループするため、完了するまでに 6 時間ほどかかるため、awk、sed、grep、さらには perl を使用すると、数分で結果を取得できると思います. これは私がmatlabでやったことです：

frf = fopen('one.txt', 'r');       
frp = fopen('two.txt', 'r');                                     
fw = fopen('result.txt', 'w');    

while feof(frf) == 0

line = fgetl(frf);
first_char = line(1);

if strcmp(first_char, '>') == 1 % if the line in one.txt start by > it is the ID 

   id_fold = strrep(line, '>', ''); % Reomve the > symbol


   frewind(frp)     % Rewind two.txt file after each loop

    while feof(frp) == 0

        raw = fgetl(frp);
        scan = textscan(raw, '%s%s');
        id_pos = scan{1}{1};
        pos = scan{2}{1};

            if strcmp(id_fold, id_pos) == 1  % if both ids are the same


                id_new = ['>', id_fold, '_', pos];

                fprintf(fw, '%s\n', id_new);

            end    

    end    

else

    fprintf(fw, '%s\n', line);  % if the line doesn't start by > print it to results



end

終わり

score 4 · Accepted Answer

を使用した片道awk。FNR == NR引数の最初のファイルを処理し、各番号を保存します。2 番目の条件は 2 番目のファイルを処理し、最初のフィールドが配列内のキーと一致すると、その行に番号が追加されて変更されます。

awk '
    FNR == NR { 
        data[ $1 ] = $2; 
        next 
    } 
    FNR < NR && data[ $1 ] { 
        $0 = $1 "_" data[ $1 ] 
    } 
    { print }
' two.txt one.txt

出力：

>ENST001_110

(((....)))

(((...)))

>ENST002_59

(((((((.......))))))

((((...)))

score 3 · Accepted Answer

sed最初は on でのみ実行two.txtできるので、必要に応じて置換するコマンドsedを作成し、次の場所で実行できone.txtます。

最初の方法

sed "$(sed -n '/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt)" one.txt

第二の方法

ファイルが巨大な場合はtoo many arguments error、前の方法で取得できます。したがって、このエラーを修正する別の方法があります。3 つのコマンドを 1 つずつ実行する必要があります。

sed -n '1i#!/bin/sed -f
/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt > script.sed
chmod +x script.sed
./script.sed one.txt

最初のコマンドは、必要に応じて one.txt を変更できる sed スクリプトを形成します。chmodこの新しいスクリプトを実行可能にします。そして最後のコマンドはコマンドを実行します。したがって、各ファイルは一度だけ読み取られます。ループはありません。最初のコマンドは 2 行で構成されていますが、それでも 1 つのコマンドであることに注意してください。改行文字を削除すると、スクリプトが壊れます。のiコマンドによるものですsed。詳細は ``sed man page.

score 2 · Accepted Answer

この Perl ソリューションは、変更されたone.txtファイルをに送信しますSTDOUT。

use strict;
use warnings;

open my $f2, '<', 'two.txt' or die $!;

my %ids;

while (<$f2>) {
  $ids{$1} = "$1_$2" if /^>(\S+)\s+(\d+)/;
}

open my $f1, '<', 'one.txt' or die $!;

while (<$f1>) {
  s/^>(\S+)\s*$/>$ids{$1}/;
  print;
}

score 1 · Accepted Answer

これはうまくいくかもしれません（GNU sed）：

sed -n '/^$/!s|^\(\S*\)\s*\(\S*\).*|s/^\1.*/\1_\2/|p' two.txt | sed -f - one.txt

score 1 · Accepted Answer

この MATLAB ソリューションを試してください (ループなし)。

%# read files as cell array of lines
fid = fopen('one.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C1 = C{1};
fclose(fid);
fid = fopen('two.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C2 = C{1};
fclose(fid);

%# use regexp to extract ENST numbers from both files
num = regexp(C1, '>ENST(\d+)', 'tokens', 'once');
idx1 = find(~cellfun(@isempty, num));       %# location of >ENST line
val1 = str2double([num{:}]);                %# ENST numbers
num = regexp(C2, '>ENST(\d+)', 'tokens', 'once');
idx2 = find(~cellfun(@isempty, num));
val2 = str2double([num{:}]);

%# construct new header lines from file2
C2(idx2) = regexprep(C2(idx2), ' +','_');

%# replace headers lines in file1 with the new headers
[tf,loc] = ismember(val2,val1);
C1( idx1(loc(tf)) ) = C2( idx2(tf) );

%# write result
fid = fopen('three.txt','wt');
fprintf(fid, '%s\n',C1{:});
fclose(fid);

perl - ループを使用せずに別のファイルを使用してファイル内の名前を変更する

6 に答える 6

Related

Reference