regex - 記事 (a/an/the) と番号 (1 ～ 4 桁) の間の正規表現句

Question

この問題を解決するために正規表現の順列を何十回も試しましたが、うまくいきませんでした。

{}()[] などの句読点を無視して、"the/a/an" と 1 から 4 までの数字の間の特定のフレーズをプルして、何十ものファイルを反復処理する必要があります。

例

機敏な茶色のキツネ {15} が怠惰な犬 [20] を特定の方法で飛び越えます 4。

返す必要があります:

速い茶色のキツネ 15

怠惰な犬 20

ある方法 4

視聴者 0012

句読点をなくしても問題ありません。sed 's/[][{}()]//g'

何かアドバイス？

score 1 · Accepted Answer

GNU awk では、必要に応じて句読点で囲まれた数字で終わるレコードに入力を分割できます。

$ cat file
The quick brown fox {15} jumps over the lazy dog [20] in a certain way 4 that is definitely not appropriate for all of the viewers (0012).


$ gawk -v RS='[[:punct:]]*[[:digit:]]+[[:punct:]]*' 'RT{print $0 RT}' file
The quick brown fox {15}
 jumps over the lazy dog [20]
 in a certain way 4
 that is definitely not appropriate for all of the viewers (0012).

あとは、レコードの必要な部分とレコードターミネータを印刷するだけです。

$ gawk -v RS='[[:punct:]]*[[:digit:]]+[[:punct:]]*' 'RT{print gensub(/.*\y(the|a|an)\y/,"\\1","") gensub(/[[:punct:]]/,"","g",RT)}' file
The quick brown fox 15
the lazy dog 20
a certain way 4
the viewers 0012

あなたの例では、出力をすべて小文字に変換していることに気付きました。これを行うには、印刷の前にaを挿入するだけです (比較で大文字と小文字を区別しない$0=tolower($0)という問題も解決します)。the|a|an

$ gawk -v RS='[[:punct:]]*[[:digit:]]+[[:punct:]]*' 'RT{$0=tolower($0); print gensub(/.*\y(the|a|an)\y/,"\\1","") gensub(/[[:punct:]]/,"","g",RT)}' file

score 0 · Accepted Answer

これはうまくいくかもしれません（GNU sed）：

sed -r '/\b(the|an|a)\b/I!d;s//\n&/;s/[^\n]*\n//;s/\{([0-9]{1,4})\}|\(([0-9]{1,4})\)|\[([0-9]{1,4})\]|\b([0-9]{1,4})\b/\1\2\3\4\n/;P;D' file

regex - 記事 (a/an/the) と番号 (1 ～ 4 桁) の間の正規表現句

4 に答える 4

Related

Reference