regex - 特定の部分文字列で開始または終了しない文字列を削除するには?

Question

残念ながら、私は正規表現の専門家ではないので、少し助けが必要です。

文字列の配列をgrepして、特定の部分文字列で開始（1）または終了（2）しない文字列の2つのリストを取得する方法を探しています。

次のルールに一致する文字列を含む配列があるとします。

[speakerId]-[フレーズ]-[id].txt

すなわち

10-phraseone-10.txt 11-phraseone-3.txt 1-phraseone-2.txt 2-phraseone-1.txt 3-phraseone-1.txt 4-phraseone-1.txt 5-phraseone-3.txt 6 -phraseone-2.txt 7-phraseone-2.txt 8-phraseone-10.txt 9-phraseone-2.txt 10-phrasetwo-1.txt 11-phrasetwo-1.txt 1-phrasetwo-1.txt 2-フレーズツー-1.txt 3-フレーズツー-1.txt 4-フレーズツー-1.txt 5-フレーズツー-1.txt 6-フレーズツー-3.txt 7-フレーズツー-10.txt 8-フレーズツー-1.txt 9-フレーズツー-1.txt 10-phrasethree-10.txt 11-phrasethree-3.txt 1-phrasethree-1.txt 2-phrasethree-11.txt 3-phrasethree-1.txt 4-phrasethree-3.txt 5-phrasethree- 1.txt 6-phrasethree-3.txt 7-phrasethree-1.txt 8-phrasethree-1.txt 9-phrasethree-1.txt

変数を導入しましょう:

$speakerId
$phrase
$id1、$id2

リストをgrepして配列を取得したい：

特定の要素を含む$phraseが、同時に特定の文字列で始まり、指定された ID のいずれかで終わる文字列は除外$speakerIdされます (たとえば$id1、または$id2)
特定の要素を持ち、最後に特定$speakerIdの$phraseID を含まない要素 (警告: などの 10 または 11 を除外しないように注意してください$id=1)

おそらく、誰かが次のコードを使用してソリューションを作成できます。

@AllEntries = readdir(INPUTDIR);

@Result1 = grep(/blablablahere/, @AllEntries);

@Result2 = grep(/anotherblablabla/, @AllEntries);

closedir(INPUTDIR);

score 3 · Accepted Answer

例に一致する基本的なパターンを想定します。

(?:^|\b)(\d+)-(\w+)-(?!1|2)(\d+)\.txt(?:\b|$)

次のように分類されます。

(?:^|\b)    # starts with a new line or a word delimeter
(\d+)-      # speakerid and a hyphen
(\w+)-      # phrase and a hyphen
(\d+)       # id
\.txt       # file extension
(?:\b|$)    # end of line or word delimeter

否定先読みを使用して除外をアサートできます。たとえば、フレーズを持たないすべての一致を含めるにはphrasetwo、上記の式を変更して否定的な先読みを使用できます。

(?:^|\b)(\d+)-(?!phrasetwo)(\w+)-(\d+)\.txt(?:\b|$)

を含める方法に注意してください(?!phrasetwo)。phrasethreeまたは、先読みの代わりに後読みを使用して、偶数で終わるすべてのエントリを検索します。

(?:^|\b)(\d+)-phrasethree-(\d+)(?<![13579])\.txt(?:\b|$)

(?<![13579])ID の最後の番号が偶数になるようにするだけです。

score 1 · Accepted Answer

クエリ関数を説明しているように聞こえます。

#!/usr/bin/perl -Tw

use strict;
use warnings;
use Data::Dumper;

my ( $set_a, $set_b ) = query( 2, 'phrasethree', [ 1, 3 ] );

print Dumper( { a => $set_a, b => $set_b } );

# a) fetch elements which
#    1. match $phrase
#    2. exclude $speakerId
#    3. match @ids
# b) fetch elements which
#    1. match $phrase
#    2. match $speakerId
#    3. exclude @ids
sub query {
    my ( $speakerId, $passPhrase, $id_ra ) = @_;

    my %has_id = map { ( $_ => 0 ) } @{$id_ra};

    my ( @a, @b );

    while ( my $filename = glob '*.txt' ) {

        if ( $filename =~ m{\A ( \d+ )-( .+? )-( \d+ ) [.] txt \z}xms ) {

            my ( $_speakerId, $_passPhrase, $_id ) = ( $1, $2, $3 );

            if ( $_passPhrase eq $passPhrase ) {

                if ( $_speakerId ne $speakerId
                    && exists $has_id{$_id} )
                {
                    push @a, $filename;
                }

                if ( $_speakerId eq $speakerId
                    && !exists $has_id{$_id} )
                {
                    push @b, $filename;
                }
            }
        }
    }

    return ( \@a, \@b );
}

score 1 · Accepted Answer

私は、負の先読みと後読みを使用した純粋な正規表現によるアプローチが好きです。ただし、少し読みにくいです。このようなコードは、より自明かもしれません。場合によっては英語のように読める標準の perl イディオムを使用します。

my @all_entries      = readdir(...);
my @matching_entries = ();

foreach my $entry (@all_entries) {

    # split file name
    next unless /^(\d+)-(.*?)-(\d+).txt$/;
    my ($sid, $phrase, $id) = ($1, $2, $3);

    # filter
    next unless $sid eq "foo";
    next unless $id == 42 or $phrase eq "bar";
    # more readable filter rules

    # match
    push @matching_entries, $entry;
}

# do something with @matching_entries

リスト変換で複雑なものを本当に表現したい場合は、次のgrepようなコードを記述できます。

my @matching_entries = grep {

    /^(\d)-(.*?)-(\d+).txt$/
    and $1 eq "foo"
    and ($3 == 42 or $phrase eq "bar")
    # and so on

} readdir(...)

regex - 特定の部分文字列で開始または終了しない文字列を削除するには?

3 に答える 3

Related

Reference