python - python regex、文字列のパターンを置換

Question

文字列内のいくつかの部分文字列をウィキマークアップに置き換えたい。たとえば、文字列があります

some other string before
; Methods
{{columns-list|3|
* [[Anomaly detection|Anomaly/outlier/change detection]]
* [[Association rule learning]]
* [[Statistical classification|Classification]]
* [[Cluster analysis]]
* [[Decision trees]]
* [[Factor analysis]]
* [[Neural Networks]]
* [[Regression analysis]]
* [[Structured data analysis (statistics)|Structured data analysis]]
* [[Sequence mining]]
* [[Text mining]]
}}

; Application domains
{{columns-list|3|
* [[Analytics]]
* [[Bioinformatics]]
* [[Business intelligence]]
* [[Data analysis]]
* [[Data warehouse]]
* [[Decision support system]]
* [[Drug Discovery]]
* [[Exploratory data analysis]]
* [[Predictive analytics]]
* [[Web mining]]
}}
some other string after

元の部分文字列を次のように置き換えたい

[[Anomaly detection|Anomaly/outlier/change detection]]
[[Association rule learning]]
[[Statistical classification|Classification]]
[[Cluster analysis]]
[[Decision trees]]
[[Factor analysis]]
[[Neural Networks]]
[[Regression analysis]]
[[Structured data analysis (statistics)|Structured data analysis]]
[[Sequence mining]]
[[Text mining]]
[[Analytics]]
[[Bioinformatics]]
[[Business intelligence]]
[[Data analysis]]
[[Data warehouse]]
[[Decision support system]]
[[Drug Discovery]]
[[Exploratory data analysis]]
[[Predictive analytics]]
[[Web mining]]

最初に{{}}で何かを抽出するために、いくつかの正規表現を試しました。しかし、私は常になしを取得しました。

追加：問題は、それ自体が{{}}にある[[]]のコンテンツにのみ関心があることです。文字列の他の部分に[[]]が他にもいくつか出現しています。

では、re.subを使用してこれを行うにはどうすればよいですか？ありがとう

追加：現在の解決策（醜い）

def regt(matchobj):
  #store matchobj.group(0) somewhere else, later on add them to the string
  #Next, another function will remove all {{}} alway
  return ''

matches = re.sub(r'\[\[.*?\]\](?=[^{]*\}\})', regt,wiki_string2)

score 0 · Accepted Answer

次のような欲張りでない正規表現を使用してみてください：r "\ {\ {。*？\} \}"

score 0 · Accepted Answer

それの代わりに一致replacingする

\[\[.*?\]\](?=[^{]*\}\})

.*?]]lazily にマッチするので、最初の発生時に停止します

.*貪欲にマッチするので、最後に]]発生した時点で停止します

(?=[^{]*}})は、最後に..を除く 0 から多くの文字が続く場合にのみ、lookahead次の範囲内のコンテンツに一致することを意味します。[[ ]]{}}

これは、 ..[[``]]内にある場合に一致させたいためです。{{ }}

したがって、後の文字は、まで]]を除く任意の文字になります..{}}

したがって、これはこのようなケースを回避します

[[xyz]]<-this would not match since { after it
{{
[[xyz]]<-this would match since it is not followed by { and it reaches }}
[[xyz]]<-this would match since it is not followed by { and it reaches }}
}}

score 0 · Accepted Answer

次のことを試すことができます。

In [10]: p = "\[\[.*?\]\]"
In [11]: s1 = '\n'.join(re.findall(p, s))

更新追加の制約 ({{}} 内のテキストのみが一致する) を使用すると、2 つのステップで目標を達成できます。

中括弧内のテキストを選択
次に、角括弧内のテキストを選択します

次のように実行できます (一致しない角括弧内のテキストを含むソース文字列を使用します)。

In [157]: print s
some [[other string before]]
Methods("")
{{columns-list|3|
* [[Cluster analysis]]
* [[Decision trees]]
* [[Factor analysis]]
}}
Application("domains")
{{columns-list|3|
* [[Analytics]]
* [[Bioinformatics]]
* [[Web mining]]
}}
some [[other string after]]

In [158]: p = "(?:\{\{)[\s\S]*?(?:\}\})"

In [159]: s1 = '\n'.join(re.findall(p, s))

In [160]: print s1
{{columns-list|3|
* [[Cluster analysis]]
* [[Decision trees]]
* [[Factor analysis]]
}}
{{columns-list|3|
* [[Analytics]]
* [[Bioinformatics]]
* [[Web mining]]
}}

In [161]: p1 = "\[\[.*\]\]"

In [162]: s2 = '\n'.join(re.findall(p1, s1))

In [163]: print s2
[[Cluster analysis]]
[[Decision trees]]
[[Factor analysis]]
[[Analytics]]
[[Bioinformatics]]
[[Web mining]]

python - python regex、文字列のパターンを置換

3 に答える 3

Related

Reference