regex - 文字列内の不完全で完全なパターンを見つける

Question

ヌクレオチドの文字列内でパターンを検索するPerlスクリプトに取り組んでいます。これまでのところ、次の正規表現を使用できました

    my $regex1 = qr/( ([ACGT]{2}) \2{9,} )/x;
    my $regex2 = qr/( ([ACGT]{3}) \2{6,} )/x;
    my $regex3 = qr/( ([ACGT]{4}) \2{6,} )/x;
for my $regex ($regex1, $regex2, $regex3) {
    next unless $seq1 =~ $regex;
    printf "Matched %s exactly %d times\n", $2, length($1)/length($2);
    printf "Length of sequence: $number \n";
}

どうすれば次のことができますか？

-完全なもの（中断なしで繰り返される）と不完全なもの（繰り返されるが、ヌクレオチドによって一連の繰り返しが壊れている可能性がある）を見つけ、最低10回の繰り返しが必要です。

-見つかったシーケンス全体を印刷します

サンプル入力-GTCGTGTGTGTGTAGTGTGTGTGTGTGAACTGA

現在のスクリプト全体

print "Di-, Tri-, Tetra-nucleotide Tandem Repeat Finder v1.0 \n\n";
print "Please specify the file location (DO NOT DRAG/DROP files!) then press ENTER:\n";
$seq = <STDIN>;

#Remove the newline from the filename
chomp $seq;

#open the file or exit
open (SEQFILE, $seq) or die "Can't open '$seq': $!";

#read the dna sequence from the file and store it into the array variable @seq1
@seq1 = <SEQFILE>;

#Close the file
close SEQFILE;

#Put the sequence into a single string as it is easier to search for the motif
$seq1 = join( '', @seq1);

#Remove whitespace
$seq1 =~s/\s//g;

#Count of number of nucleotides
#Initialize the variable
$number = 0;
$number = length $seq1;
#Use regex to say "Find 3 nucelotides and match at least 6 times
# qr(quotes and compiles)/( ([nucs]{number of nucs in pattern}) \2{number of repeats,}/x(permit within pattern)

my $regex1 = qr/( ([ACGT]{2}) \2{9,} )/x;
my $regex2 = qr/( ([ACGT]{3}) \2{6,} )/x;
my $regex3 = qr/( ([ACGT]{4}) \2{6,} )/x;

#Tell program to use $regex on variable that holds the file
for my $regex ($regex1, $regex2, $regex3) {
    next unless $seq1 =~ $regex;
    printf "Matched %s exactly %d times\n", $2, length($1)/length($2);
    printf "Length of sequence: $number \n";
}

exit;

score 0 · Accepted Answer

あなたが何を必要としているのか完全には理解できませんが、おそらくこれはあなたにアイデアを与えるでしょう：

use strict;    # You should be using this,
use warnings;  # and this.

my $input = 'GTCGTGTGTGTGTAGTGTGTGTGTGTGAACTGA';

my $patt      = '[ACGT]{2}';   # Some pattern of interest.
my $intervene = '[ACGT]*';     # Some intervening pattern.
my $m         = 7 - 2;         # Minimum N of times to find pattern, less 2.

my $rgx = qr/( 
    ($patt) $intervene
    (\2     $intervene ){$m,}
    \2
)/x;

print $1, "\n" if $input =~ $rgx;

また、ファイル全体を文字列に読み込むより良い方法については、この質問を参照してください: Perl でファイルを文字列に丸呑みする最良の方法は何ですか? .

regex - 文字列内の不完全で完全なパターンを見つける

1 に答える 1

Related

Reference