linux - 文字列フィールドに複数のコンマがある .CSV ファイルの日付フィールドをフォーマットする方法

Question

データがすべて二重引用符で囲まれている .CSV ファイル (file.csv) があります。ファイルのサンプル形式は次のとおりです。

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","13-OCT-11","232"

9 番目のフィールドは、「DD-MMM-YY」形式の日付フィールドです。YYYY/MM/DDの形式に変換する必要があります。以下のコードを使用しようとしていますが、役に立ちません。

awk -F, '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/"20"%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 print
 }' OFS="," file.csv > temp_file.csv

上記のコードを実行した後のファイル temp_file.csv の出力は、次のようになります。

column1,column2,column3,column4,column5,column6,column7,Column8,00/00/2000,Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

私が理解している限り、問題は二重引用符内のコンマにあります。これは、私のコードでも考慮されているためです...以下の質問について提案してください:

1) すべてのフィールドのすべての値を二重引用符で囲んでも違いはありますか? それらが違いを生む場合、コンマを含む文字列を除くすべての値からそれらを取り除くにはどうすればよいですか? 2）「DD-MMM-YYYY」形式の9番目のフィールドをYYYY / MM / DDにフォーマットできるように、コードを変更しました

score 2 · Accepted Answer

適切な CSV パーサーを使用することを強くお勧めします。たとえば、Perl でText::CSV_XSを使用すると、ジョブが適切かつ適切に実行されます。たとえば、このワンライナー：

perl -MText::CSV_XS -E'$csv=Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1});@m=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);@m{@m}=(1 .. @m);while(my $row=$csv->getline(ARGV)){($d,$m,$y)=split("-",$row->[8]);$row->[8]=sprintf"%02d/%02d/%04d",$d,$m{$m},$y if $m{$m};$csv->print(STDOUT, $row)}' file.csv > temp_file.csv

score 1 · Accepted Answer

あなたはこれを試すことができますawk、

awk -F"\"" 'BEGIN { OFS="\"" }{ "date -d "$18" +%Y/%m/%d" | getline $18; print $0; }' yourfile.txt

出力：

"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2,name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,name","333","22","2011/10/13","232"

score 1 · Accepted Answer

次のワンライナーを試すことができます。

awk '
BEGIN {
    FS = OFS = ","
    split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, / /)
    for (i=1; i<=12; i++) {
        mm[month[i]]=i
    }
}
NR>1 { 
    gsub(/\"/, "", $(NF-1))
    split($(NF-1), d, /-/)
    $(NF-1)=q "20" d[3] "/" mm[d[2]] "/" d[1] q}1' q='"' file

出力：

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"

linux - 文字列フィールドに複数のコンマがある .CSV ファイルの日付フィールドをフォーマットする方法

3 に答える 3

出力：

Related

Reference