regex - 正規表現を含むテキストで繰り返しの誤植を見つける

Question

たとえば、テキスト（私の場合はLaTeXソース）ですべての繰り返しのミスプリントを見つけることは可能ですか？

... The Lagrangian that that includes this potential ...
... This is confimided by the the theorem of ...

正規表現を使用していますか？

お気に入りのツール（sed、grep）/言語（python、perl、...）を使用します

score 1 · Accepted Answer

このJavaScriptの例は機能します：

var s = '... The Lagrangian that that includes this potential ... This is confimided by the the theorem of ...'
var result = s.match(/\b(\w+)\s\1\b/gi)

結果：

["that that", "the the"];

正規表現：

/\s(\w+)\s\1/gi

# /     --> Regex start,
# \b    --> A word boundary,
# (\w+) --> Followed by a word, grouped,
# \s    --> Followed by a space,
# \1    --> Followed by the word in group 1,
# \b    --> Followed by a word boundary,
# /gi   --> End regex, (g)lobal flag, case (i)nsensitive flag.

正規表現がまたはのような文字列と一致しないようにするために、単語の境界が追加され"hot hotel"ます"nice ice"

score 1 · Accepted Answer

egrep -wと正規表現での後方参照の使用(\w+)\s+\1：

$ echo "The Lagrangian that that includes this potential" | egrep -ow "(\w+)\s\1"
that that

$ echo "This is confimided by the the theorem of" | egrep -ow "(\w+)\s+\1"
the the

注：この-oオプションは、一致する行の一部のみを表示します。これは、実際に一致するものを示すのに役立ちます。おそらく、そのオプションを削除して、--color代わりに使用することをお勧めします。この-wオプションは、単語全体に一致することが重要です。そうでない場合is isは、で一致しThis is con..ます。

(\w+) # Matches & captures one or more word characters ([A-Za-z0-9_])
\s+   # Match one or more whitespace characters 
\1    # The last captured word

使用 egrep -w --color "(\w+)\s+\1" fileすると、間違った繰り返しの可能性のある単語が明確に強調表示されるという利点があります。置換は、変更される、reggae raggae sauceまたはbeautiful beautiful day変更されるなど、多くの正しい例ほど賢明ではない可能性があります。

score 1 · Accepted Answer

1

これを試して：

grep -E '\b(\w+)\s+\1\b'  myfile.txt

于 2013-01-08T14:25:37.790 に答える

score 0 · Accepted Answer

重複する単語を削除する方法を示すPythonの例：

In [1]: import re

In [2]: s1 = '... The Lagrangian that that includes this potential ...'

In [3]: s2 = '... This is confimided by the the theorem of ...'

In [4]: regex = r'\b(\w+)\s+\1\b'

In [5]: re.sub(regex, '\g<1>', s1)
Out[5]: '... The Lagrangian that includes this potential ...'

In [6]: re.sub(regex, '\g<1>', s2)
Out[6]: '... This is confimided by the theorem of ...'

regex - 正規表現を含むテキストで繰り返しの誤植を見つける

4 に答える 4

Related

Reference