regex - grep/regex はアクセント付きの単語を見つけることができません

Question

この単語のすべての文字が単語パターンと一致するファイルにいくつかの単語を取得する正規表現をマウントしようとしています。

私の問題は、正規表現がアクセント付きの単語を見つけることができないことですが、私のテキストファイルにはアクセント付きの単語がたくさんあります。

私のコマンドラインは次のとおりです。

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

ファイルの内容は次のとおりです。

carroça
éra
éssa
roça
roco
rato
onça
orça
roca

どうすれば修正できますか？

score 11 · Accepted Answer

ファイルが ISO-8859-1 でエンコードされていても、システムロケールが UTF-8 の場合、これは機能しません。

ファイルを UTF-8 に変換するか、システムロケールを ISO-8859-1 に変更します。

# grep の前に ISO-8859-1 から環境ロケールに変換
# 出力は現在のロケールになります
$ iconv -f 8859_1 入力/words.txt | grep ...

# ISO-8859-1 ロケールで grep を実行
# 出力は ISO-8859-1 エンコーディングになります
$ 猫の入力/words.txt | env LC_ALL=en_US grep ...

score 1 · Accepted Answer

Assuming everything is UTF-8, I’d usually just use something like

perl -CSAD -le 'print if /^carroça{1,3}$/' filenames

because then I know what it’s doing.

score 1 · Accepted Answer

ここで、うまくいくように見える関連する質問を見つけました。

したがって、次のようなものを試してみると:

cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt

それはあなたが期待するものを生み出しますか？

score 0 · Accepted Answer

@duleが言ったように試してくださいLANG=en_US.iso88591。

cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt

regex - grep/regex はアクセント付きの単語を見つけることができません

5 に答える 5

Related

Reference