regex - パターンに一致した後、perl.regex で文字列の後にダッシュを追加する方法

Question

私はこのタイプのデータを持っています:私を助けてください私は正規表現に慣れていないので、答えながら各ステップを説明してください.ありがとう..

7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

上記の行からこのデータのみを抽出したい:

7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

次に、AX1Aにアンダースコアの後に 2 つの連続するアルファベットが含まれる場合は AX_ と記述し、1 桁の数字と 1 つのアルファベットが含まれる場合は -1_ および -A_ になるため、このパターンを適用すると、次のようになります: AX_-1_-A_ およびすべて他のデータは同じままにする必要があります。

同様に、次の行「W1A」には最初に単一のアルファベット「W」が含まれており、これは -W_ に変換する必要があります。次の文字は 1 桁の数字であるため、同じパターンとして変換する必要があります -1_ 同様に、最後の文字も同じように扱われます。 -W_-1_-A_ になります

アンダースコアが続く数字の後の部分に正規表現を適用することにのみ関心があります。

_AX1A_

_W1A_

_U1A_

_AV21NA_

出力は次のようになります。

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN

7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

score 1 · Accepted Answer

#!/usr/bin/perl -w
use strict;
while (<>) {
    next if /^\s*$/;
    chomp;
    ## Remove those parts of the line we do not want
    ## You do not specify what, if anything, is constant about
    ## the parts you do not want. One of the following cases should
    ## serve.

    ## i) Remove the string _1X50_ and the next characters between
    ## two underscores:
    s/_1X50_.+?_/_/;

    ## ii) keep the first 2 and last 3 sections of each line.
    ## Uncomment this line and comment the previous one to use this:
    #s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;

    ## The line now contains only those regions we are 
    ## interested in. Split on '_' to collect an array of the
    ## different parts (@a):
    my @a=split(/_/);

    ## $a[1] is the second string, eg AX1A,W1A etc.
    ## We search for one or more letters, followed by one or more digits
    ## followed by one or more letters. The 'i' operand makes the match
    ## case Insensitive and the 'g' operand makes the search global, allowing
    ## us to capture the matches in the @matches array. 
    my @matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);

    ## So, for each of the matched strings, if the length of the match
    ## is less than 2, add a '-' to the beginning of the string:
    foreach my $match (@matches) {
        if (length($match)<2) {
        $match="-" . $match;
        }
    }
    ## Now replace the original $a[1] with each string in
    ## @matches, connected by '_':
    $a[1]=join("_", @matches);

    ## Finally, build the string $kk by joining each element
    ## of the line (@a) by a '_', and print:
    my $kk=join("_", @a);
    print "$kk\n";
}

score 1 · Accepted Answer

use strict;
use warnings;

my $match 
    = qr/
    ( \d+          # group of digits
      _            # followed by an underscore
    )              # end group
    ( \p{Alpha}+ ) # group of alphas             
    ( \d+ )        # group of digits
    ( \p{Alpha}* ) # group of alphas
    ( \w+ )        # group of word characters
    /x
    ;

while ( my $record = <$input> ) { # record of input
    # match and capture
    if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
        say $pre 
             # if the alpha has length 1, add a dash before it
          . ( length $pre_alpha == 1 ? '-' : '' )
            # then the alpha
          . $pre_alpha
            # then the underscore
          . '_'
            # test if the length of the number is 1 and the length of the 
            # trailing alpha string is 1 
          . ( length( $num ) == 1 && length( $post_alpha ) == 1
              # if true, apply a dash before each 
            ? "-$num\_-$post_alpha" 
              # otherwise treat as AV21NA in example.
            : "$num\_$post_alpha"
            )
          . $post
          ;

    }
}

score 1 · Accepted Answer

あなたは本当にこのようですか：

while (<DATA>) {
    s/1X50_(LI|RE)_//;
    s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
    s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
    s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
    print;
}

__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

出力：

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

score 1 · Accepted Answer

私はあなたが取り除かなければならないことのすべてを知っているわけではありませんが、これがあなたが必要とすることを十分に行わないかどうかを推測して明確にします.

1X50_RE_最初のステップであるandの抽出では、1X50_LIこれらの文字列を検索して、何も置き換えないことができます。

次に、2 番目の文字/数字コードを小さなチャンクに分割するために、それぞれの先読みを使用して、一致のペアを使用できます。ただし、その 2 番目のコードチャンクだけをいじりたいので、最初に全体のラインを分割し、2 番目のチャンクで作業してから、再び断片を結合します。

while (<$input>) {

    # Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
    s/1X50_(RE|LI)_//;

    my @pieces = split /_/; # split the line into pieces at each underscore

    # Just working with the second chunk. /g, means do it for all matches found
    $pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
    $pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A

    # Join the pieces back together again
    $_ = join '_', @pieces;

    print;
}

これ$_は、指定しない場合、多くの Perl 操作が動作する変数です。は、に指定され<$input>たファイルハンドルの次の行を読み取ります。、、および関数は、指定されていない場合に機能します。演算子は、正規表現操作の代わりに (または作業中の変数を)使用するように Perl に指示する方法です。(またはの場合、代わりに変数を引数として渡すので、と同じであり、と同じです。)$input$_s///splitprint$_=~$pieces[1]$_splitprintsplit /_/split /_/, $_printprint $_

ああ、そして正規表現を少し説明するために：

s/1X50_(RE|LI)_//;

1X50_REこれは、 or 1X50_LI(これは(|)代替のリストです)を含むものと一致し、それらを何も置き換えません (//最後の空)。

他の行の1つを見る：

s/([A-Z])(?=[0-9])/$1_-/g;

単純な括弧で囲ま(...)れて[A-Z]いるため$1、内部で一致する文字 (この場合は A から Z) が設定されます。括弧は(?=...)、ゼロ幅の正の先読みアサーションを引き起こします。つまり、正規表現は、文字列の次の部分が式 (数字、0 ～ 9) と一致する場合にのみ一致しますが、一致のその部分は置換される文字列の一部として含まれません。

を使用/$1_-/すると、文字列の一致部分が[A-Z]括弧で囲まれた値に置き換えられます(...)が、ルックヘッドの前に必要な[0-9]が追加され_-ます。

score -1 · Accepted Answer

あなたが正規表現の初心者であれば、行を分割するというzostayの提案は物事をより簡単にするかもしれません. ただし、パフォーマンスの観点からは、分割を回避することが最適です。分割せずにそれを行う方法は次のとおりです。

open IN_FILE, "filename" or die "Whoops!  Can't open file.";
while (<IN_FILE>)
{
     s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/ 
          or print "line didn't match: $line\n";
     s/1X50_(LI|RE)_//;
}

最初のパターンを分解する: s///は、検索と置換の演算子です。 ^行の先頭に \d{7}_一致する 7 桁の数字に一致し、その後にアンダースコア \Kの後読み演算子が続きます。これは、前にあったものは、置き換えられる文字列の一部ではないことを意味します。()括弧の各セットは、キャプチャされる一致のチャンクを指定します。これらは、マッチ変数 $1、$2 などに順番に入れられます。[A-Z]{1,2}これは、1 文字から 2 文字の大文字が一致することを意味します。括弧内の他の 2 つのセクションが何を意味するかは、おそらく理解できるでしょう。-${1}-${2}-${3}一致したものを、先頭にダッシュを付けた最初の 3 つの一致変数に置き換えます。中括弧を使用する唯一の理由は、変数名が何であるかを明確にするためです。

regex - パターンに一致した後、perl.regex で文字列の後にダッシュを追加する方法

5 に答える 5

Related

Reference