regex - sed - 大きな csv ファイルで引用符内の引用符を削除する

Question

ストリームエディター sed を使用して、大量のテキストファイルデータ (400MB) を csv 形式に変換しています。

私は終わりに非常に近づいていますが、未解決の問題は、次のようなデータの引用符内の引用符です。

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"

望ましい出力は次のとおりです。

1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

ヘルプを探しましたが、解決策に近づきすぎていません。正規表現パターンを使用して次の sed を試しました。

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

これらは以下の質問からのものですが、sed では機能していないようです:

perlの関連質問

SISSの関連質問

元のファイルは *.txt で、sed でその場で編集しようとしています。

score 2 · Accepted Answer

FPAT変数を使用する 1 つの方法をGNU awk次に示します。

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file

結果：

1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

説明：

FPAT を使用すると、フィールドは「カンマ以外のすべて」または「二重引用符、二重引用符以外のすべて、および終了二重引用符」のいずれかとして定義されます。次に、入力のすべての行で、各フィールドをループし、フィールドが二重引用符で開始および終了する場合は、フィールドからすべての引用符を削除します。最後に、フィールドを二重引用符で囲みます。

score 1 · Accepted Answer

sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE

これは、型の文字列を調べて、"STR1 "STR2" STR3 "それらをに変換します"STR1 STR2 STR3"。何かが見つかった場合は、深さ > 2 でネストされたすべての文字列を確実に削除するように繰り返します。

また、STRx にが含まれていないことも保証されますcomma。

regex - sed - 大きな csv ファイルで引用符内の引用符を削除する

2 に答える 2

Related

Reference