regex - sed: 引用符内の英数字以外の文字をすべて削除する

Question

次のような文字列があるとします。

Output:   
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"

コンマ、ピリオド、またはスペースを除く、引用符内の英数字以外の文字のみを削除したい:

Desired Output:    
I have some-non-alphanumeric % characters remain here, I "also, have some  .here"

文字列に一致し、引用符内を削除する次のコマンドを試しましたsedが、引用符を含む引用符内のすべてが削除されます。

sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'

sed必要な出力を得るために、できればを使用して、任意のヘルプをいただければ幸いです。前もって感謝します！

score 2 · Accepted Answer

英数字以外の文字をすべて削除するには、置換を複数回繰り返す必要があります。sed でこのようなループを実行するには、ラベルbとtコマンドの使用が必要です。

sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'

(これはがなくても機能[^"a-zA-Z0-9,. ]*しますが、行に英数字以外の文字が多数含まれる行では遅くなります)

perlでこれを行う方がはるかに簡単であるという点で、他の答えは正しいですが。

score 2 · Accepted Answer

Sed はこれに適したツールではありません。これはPerlによるものです。

perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g' file

例：

$ echo 'I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"' | perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g'
I have some-non-alphanumeric % characters remain here, I "also, have some  .here"

正規表現のデモ

regex - sed: 引用符内の英数字以外の文字をすべて削除する

2 に答える 2

Related

Reference