python - Python の正規表現: 角かっこと角かっこ内のフレーズの一部を削除する

Question

ウィキペディアのダンプがあり、式の二重角括弧を削除するための適切な正規表現パターンを見つけるのに苦労しています。式の例を次に示します。

line = 'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the [[herbicide]]s and [[defoliant]]s used by the [[United States armed forces|U.S. military]] as part of its [[herbicidal warfare]] program, [[Operation Ranch Hand]], during the [[Vietnam War]] from 1961 to 1971.'

次の条件ですべての角括弧を削除しようとしています。

角括弧内に垂直区切りがない場合は、角括弧を削除します。

例：[[herbicide]]sとなりherbicidesます。
括弧内に垂直区切りがある場合は、括弧を削除し、区切りの後の句のみを使用します。

例：[[United States armed forces|U.S. military]]となりU.S. militaryます。

使用してみre.matchましre.searchたが、目的の出力に到達できませんでした。

ご協力ありがとうございました！

score 11 · Accepted Answer

必要なのはre.sub. 角括弧とパイプは両方ともメタ文字であるため、エスケープする必要があることに注意してください。

re.sub(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]', r'\1', line)

\1置換文字列のは、括弧内で一致したものを参照します。これは、開始されていません(つまり?:、いずれにしても、必要なテキスト)。

2 つの注意事項があります。これにより、開始ブラケットと終了ブラケットの間に 1 つのパイプのみが許可されます。複数ある場合は、最初のもの以降のすべてが必要か、最後のもの以降のすべてが必要かを指定する必要があります。もう 1 つの注意点は、]開き括弧と閉じ括弧の間の単一は許可されていないことです。それが問題である場合、正規表現の解決策はまだありますが、かなり複雑になります。

パターンの完全な説明については、次のとおりです。

\[\[        # match two literal [
(?:         # start optional non-capturing subpattern for pre-| text
   [^\]|]   # this looks a bit confusing but it is a negated character class
            # allowing any character except for ] and |
   *        # zero or more of those
   \|       # a literal |
)?          # end of subpattern; make it optional
(           # start of capturing group 1 - the text you want to keep
    [^\]|]* # the same character class as above
)           # end of capturing group
\]\]        # match two literal ]

score 3 · Accepted Answer

とre.subの間のすべてを見つけるために使用できます。ラムダ関数を渡して置換を行う方が少し簡単だと思います (最後の '|' 以降のすべてを取得するため)。[[]]

>>> import re
>>> re.sub(r'\[\[(.*?)\]\]', lambda L: L.group(1).rsplit('|', 1)[-1], line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'

score 2 · Accepted Answer

>>> import re
>>> re.sub(r'\[\[(?:[^|\]]*\|)?([^\]]*)]]', r'\1', line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'

説明：

\[\[       # match two opening square brackets
(?:        # start optional non-capturing group
   [^|\]]*   # match any number of characters that are not '|' or ']'
   \|        # match a '|'
)?         # end optional non-capturing group
(          # start capture group 1
   [^\]]*    # match any number of characters that are not ']'
)          # end capture group 1
]]         # match two closing square brackets

上記の正規表現の一致をキャプチャグループ 1 の内容に置き換えることで、角かっこの内容を取得できますが、区切り記号が存在する場合はその後ろにあるもののみが取得されます。

python - Python の正規表現: 角かっこと角かっこ内のフレーズの一部を削除する

3 に答える 3

Related

Reference