regex - Perl - コンマ区切りの文字列のみを抽出する正規表現

Question

誰かが助けてくれることを望んでいる質問があります...

Web ページのコンテンツを含む変数があります (WWW::Mechanize を使用してスクレイピング)。

変数には、次のようなデータが含まれます。

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

上記の例から私が興味を持っている唯一のビットは次のとおりです。

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

私が抱えている問題：

変数からカンマ区切りの文字列のみを抽出し、後で使用できるように配列に格納しようとしています。

しかし、カンマで区切られた動物のリストの先頭 (つまり、cat_dog) と末尾 (つまり、chicken-pig) の文字列を確実に取得するための最良の方法は何ですか?

また、変数には Web ページのコンテンツが含まれるため、段落や文でコンマを使用する正しい方法であるため、コンマの直後にスペースが続き、次に別の単語が続く場合もあるのは避けられません...

例えば：

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

上記のように、コンマの後にスペースが続くケースには興味がありません。

コンマの後にスペースがない場合 (つまり、cat_dog、horse、rabbit、chicken-pig) にのみ関心があります。

これを行う方法をいくつか試しましたが、正規表現を構築するための最良の方法を見つけることができません。

score 8 · Accepted Answer

どうですか

[^,\s]+(,[^,\s]+)+

これは、スペースまたはコンマではない 1 つ以上の文字と、その後にコンマが続く 1 つ以上の文字、およびスペースまたはコンマではない 1 つ以上の文字に 1 回以上一致し[^,\s]+ます。

コメントの続き

複数のシーケンスに一致させるにはg、グローバルマッチングの修飾子を追加します。
以下は、各マッチ$&を aで分割,し、結果をにプッシュします@matches。

my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my @matches;

while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
    push(@matches, split(/,/, $&));
}   

print join("\n",@matches),"\n";

score 1 · Accepted Answer

おそらく単一の正規表現を作成できますが、正規表現、splits、grepおよび適切にmap見えるの組み合わせ

my @array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split

右から左へ：

スペースで行を分割 ( split)
両端にコンマがなく、内部にコンマがある要素のみを残す ( grep)
そのような各要素を部分 (mapおよびsplit)に分割します。

そうすれば、パーツを簡単に変更できます。たとえば、2 つの連続するコンマを削除するために、&& !/,,/内部に追加しますgrep。

score 1 · Accepted Answer

これが明確で、あなたのニーズに合っていることを願っています:

 #!/usr/bin/perl
    use warnings;
    use strict;

    my @strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
    "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf", 
     "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew", 
     "Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
     "Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");

    my $regex = qr/
                \s #From your examples, it seems as if every
                   #comma separated list is preceded by a space.
                (
                    (?:
                        [^,\s]+ #Now, not a comma or a space for the
                                 #terms of the list

                        ,        #followed by a comma
                    )+
                    [^,\s]+     #followed by one last term of the list
                )
                /x;

    my @matches = map {
                    $_ =~ /$regex/;
                    if ($1) {
                        my $comma_sep_list = $1;
                        [split ',', $comma_sep_list];
                    }
                    else {
                        []
                    }
                } @strs;

regex - Perl - コンマ区切りの文字列のみを抽出する正規表現

4 に答える 4

Related

Reference