regex - 正規表現を使用して単一行文字列の文字長を変更する

Question

60 塩基 (末尾に \n がある) の単一行の文字列で構成される genbank ファイルからシーケンスを抽出しました。perl を使用してシーケンスを変更し、bioperl ではなく正規表現を使用して各行に 120 塩基を出力する方法。元の形式:

    1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
   61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
  121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
  181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
  241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
  301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
  361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
  421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
  481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat

私はそれらを60文字の長さの文字列にすることしかできませんでした。それらを120文字の長さにする方法をまだ見つけようとしています。

my @lines= <$FH_IN>;
foreach my $line (@lines) {
    if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
            $line=~ s/$1//;
            $line=~ s/ //g;
            print $line;
    }

}

入力例:

agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat

これは、1 つの線ストリングごとに 60 塩基を持ちます。

更新 (まだ 120 塩基長の seq 行を提供していません):

my @seq_60;
foreach my $line (@lines) {
        if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
                $line=~ s/$1//;
                $line=~ s/ //g;
                push (@seq_60, $line);
        }
}

my @output;
for (my $pos= 0; $pos< @seq_60; $pos+= 2) {
        push (@output, $seq_60[$pos] . $seq_60[$pos+1]);
}

print @output;

score 0 · Accepted Answer

どうですか：

s/(^|\n)([^\n]{60})\n/$1$2/g

実際に:

use strict;
use warnings;
use 5.014;

my $str = q/agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat/;

$str =~ s/(^|\n)([^\n]{60})\n/$1$2/g;
say $str;

出力：

agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggccatccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat

説明：

(^|\n)      : group 1, start of string or line break
(           : start group 2
  [^\n]{60} : anything that is not a line break 60 times
)           : end group 2
\n          : line break

コメントに従って編集します。

行をペアで結合します。

my @out;
for (my $i = 0; $i < @arr; $i += 2) {
    chomp($in[$i]);
    push @out, $in[$i] . $in[$i+1];
}

score 0 · Accepted Answer

行の読み取りと書き込みを同時に行うことができ、前の行を変数に格納できます。何が起こっているかの説明については、コードのコメントを参照してください。

my $prev;
while (<$FH_IN>) {
    next unless /\w/; # make sure the lines have some content
    # remove the line endings
    chomp;
    # chop off the first 6 characters (the base numbers) - format is 4 chars that
    # can be numbers or spaces, a digit, and a space
    $_ =~ s/^[\s\d]{4}\d\s//g;
    # remove the spaces between bases
    $_ =~ s/\s//g;
    # have we got a saved line?
    if ($prev) {
        # print out saved line and this line
        print $prev . $_ . "\n";
        # delete the saved line $prev
        $prev = '';
    }
    else {
        # if we don't have a saved line, save this line
        $prev = $_;
    }
}

regex - 正規表現を使用して単一行文字列の文字長を変更する

2 に答える 2

Related

Reference