regex - 正規表現Perlを使用して単語のさまざまなバリアントを照合する

Question

私は個々のスペース文字で文を分割し、次にこれらの用語をハッシュのキーと照合しています。用語が100％類似している場合にのみ一致が得られ、同じ単語の複数の出現に一致する可能性のある完全な正規表現を見つけるのに苦労しています。例えば。「拮抗薬」という用語があるとしましょう。「拮抗薬」という用語と完全に一致しますが、拮抗薬、拮抗薬または前拮抗薬、水力拮抗薬などとは一致しません。また、MCFなどの単語の出現に一致する正規表現が必要です。 -7MCF7またはMC-F7で特殊文字などの効果を消音します。

これは私が今まで持っているコードです。コメントされた部分は私が苦労しているところです。

（注：ハッシュ内の用語は、単語の語根形式に変換されます）。

    use warnings;
    use strict;
    use Drug;
    use Stop;
    open IN,  "sample.txt"   or die "cannot find sample";
    open OUT, ">sample1.txt" or die "cannot find sample";

    while (<IN>) {
        chomp $_;
        my $flag = 0;
        my $line = lc $_;
        my @full = ();
        if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
            my $string = $1;
            chomp $string;
            $string =~ s/,/ , /g;
            $string =~ s/\./ \. /g;
            $string =~ s/;/ ; /g;
            $string =~ s/\(/ ( /g;
            $string =~ s/\)/ )/g;
            $string =~ s/\:/ : /g;
            $string =~ s/\::/ :: )/g;
            my @array = split / /, $string;

            foreach my $word (@array) {
                chomp $word;
                if ( $word =~ /\,|\;|\.|\(|\)/g ) {
                    push( @full, $word );
                }
                if ( $Stop_words{$word} ) {
                    push( @full, $word );
                }

                if ( $Values{$word} ) {
                    my $term = "<Drug>$word<\/Drug>";
                    push( @full, $term );
                }
                else {
                    push( @full, $word );
                }

                # if($word=~/.*\Q$Values{$word}\E/i)#Changed this
                # {
                # $term="<Drug>$word</$Drug>";
                # print $term,"\n";
                # push(@full,$term);
                # }
            }
        }
        my $mod_str = join( " ", @full );
        print OUT $mod_str, "\n";
    }

score 3 · Accepted Answer

MCF-7のような単語の出現をMCF7またはMC-F7と一致させるために正規表現が必要です

最も簡単なアプローチは、ハイフンを取り除くことです。

my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;

値ハッシュに何が格納されているのかわからないため、何が起こると予想されるかを判断するのは困難です

if($word=~/.*\Q$Values{$word}\E/i)

しかし、私があなたが望んでいるのは、（コードをいくらか単純化した）ものです。

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;

while (<>) {
    chomp $_;
    my $flag = 0;
    my $line = lc $_;
    my @full = ();
    if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
        my $string = $1;
        chomp $string;
        $string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together 
        $string =~ s/\:\:/ :: )/g;          # typo in original
        my @array = split /\s+/, $string;   # split on one /or more/ spaces

        foreach my $word (@array) {
            chomp $word;
                        my $term=$word;
                        my $word_chars = "[\\w\\-_']";
                        my $word_part  = "antagon";
                        if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
                            $term="<Drug>$word</Drug>";
                        }
                        push(@full,$term); # push 

        }
    }
    my $mod_str = join( " ", @full );
        say "<Sentence>$mod_str</Sentence>";
}

これにより、次の出力が得られます。これは、あなたが期待するものについての私の最良の推測です。

$ cat tmp.txt 
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$

score 2 · Accepted Answer

perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME

ファイルに次のものが含まれている場合：

he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist

これは戻ります：

alpha-antagonist
antagonist
antagonize
antagonizing

以下は、通常の（ワンライナーではない）バージョンです。

#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";

my %things;

while (<$in>){
    $things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}

print $out "$_\n" for sort keys %things;

score 1 · Accepted Answer

あなたはあなたのアプローチに関するあなたの仮定をもう一度見たいと思うかもしれません。私には、単語のリストから一定の距離内にある単語を探しているように聞こえます。レーベンシュタイン距離の式を見て、これが必要かどうかを確認してください。ただし、これを計算するには指数関数的な時間がかかる場合があることに注意してください。

regex - 正規表現Perlを使用して単語のさまざまなバリアントを照合する

3 に答える 3

Related

Reference