regex - 正規表現パターンから姓のみをキャプチャする方法は?

Question

チーム

姓、名、および年号のフォーマット (句読点など) の正確性を検証する Perl プログラムを作成しました。特定のエントリが指定されたパターンに従っていない場合、そのエントリは強調表示されて修正されます。

たとえば、私の入力ファイルには同様のテキストの行があります。

<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

私のプログラムは問題なく動作します。つまり、エントリがパターンに従っていない場合、スクリプトはエラーを生成します。上記の入力テキストはエラーを生成しません。ただし、以下はRose AJのRoseの後にカンマがないためのエラーの例です。

NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., &amp; Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey &amp; D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>

私の正規表現検索パターンから、すべての姓と年をキャプチャすることは可能ですか?以下に示すように、各行にプレフィックスを付けたテキストを生成できますか?

<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

私の正規表現検索スクリプトは次のとおりです。

while(<$INPUT_REF_XML_FH>){
    $line_count += 1;
    chomp;
    if(/

    # bibliomixed XML ID tag and attribute----<START>
    <bibliomixed
    \s+
    id=".*?">
    # bibliomixed XML ID tag and attribute----<END>

    # --------2 OR MORE AUTHOR GROUP--------<START>
    (?:
    (?:
    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>
    ,\s)+
    #---------------FINAL AUTHOR GROUP SEPATOR----<START>
    &amp;\s
    #---------------FINAL AUTHOR GROUP SEPATOR----<END>

    # --------2 OR MORE AUTHOR GROUP--------<END>
    )? 

    # --------LAST AUTHOR GROUP--------<START>

    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>

    (?: # pattern for editor notation----<START>
    \s\(Ed(?:s)?\.\)\.
    )? # pattern for editor notation----<END>

    # --------LAST AUTHOR GROUP--------<END>
    \s
    \(
    # pattern for a year----<START>
    (?:[A-Za-z]+,\s)? # July, 1999
    (?:[A-Za-z]+\s)? # July 1999
    (?:[0-9]{4}\/)? # 1999\/2000
    (?:\w+\s\d+,\s)?# August 18, 2003
    (?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
    (?:[A-Za-z])? # 1999a
    (?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
    (?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
    (?:,\s[A-Za-z]+)? # 1999, Spring
    (?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
    (?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
    (?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
    # pattern for a year----<END>
    \)\.
    /six){
        print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
        $found_count += 1;
    } else{
        print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
        $not_found_count += 1;
    }

ご協力いただきありがとうございます、

プレム

score 0 · Accepted Answer

このビットを変更する

# pattern for surname----<END>
    ,?\s

これは、オプションの後に空白が続くことを意味します。人物の名前が「ブンガブンガ」の場合は機能しません

score 0 · Accepted Answer

で始まるすべてのサブパターンは非キャプチャグループ(?:です。これにより、多くの要因でコンパイル時間が短縮されます。その 1 つは、サブパターンがキャプチャされないことです。

パターンをキャプチャするには、キャプチャする必要がある部分を括弧で囲むだけです。したがって、キャプチャしないアサーションを削除する?:か、必要な場所に括弧を配置()できます。http://perldoc.perl.org/perlretut.html#Non-capturing-groupings

よくわかりませんが、コードから、先読みアサーションを使用しようとしている可能性があると思います。たとえば、スペースを含む姓をテストし、そうでない場合はハイフンを含む姓をテストします。これは毎回同じポイントから開始されるわけではなく、最初の例と一致するかどうかのいずれかであり、次の位置を 2 番目の姓パターンでテストするために進みます。正規表現が最初のサブパターンの 2 番目の名前をテストするかどうかは、よくわかりません。http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind

#!usr/bin/perl

use warnings;
use strict;


my $line = '123 456 7antelope89';

$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;

my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7bealzelope89';

$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7canteloupe89';

$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);

exit 0;

パターン全体をキャプチャする場合、3 番目の例の最初のパターンは意味がありません。これは、パターングループをキャプチャしながらパターングループをキャプチャしないように正規表現に指示するためです。これが役立つのは、キャプチャされたパターンが非キャプチャグループの一部であるという点で、きめ細かいパターンキャプチャである 2 番目のパターンです。

a: 123 456 b: 7antelope89
a: nocapture b: nocapture 
a: 123 456 b: canteloupe

ちょっとニトピック

  id=".*?"

として良いかもしれません

  id="\w*?"

_alphanumeric iirc である必要がある ID 名。

regex - 正規表現パターンから姓のみをキャプチャする方法は?

2 に答える 2

Related

Reference