次のような行を持つ SQL Server 2008 から CSV ダンプを取得しました。
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
parse_dbenhur
きれいですが、コンマと引用符の両方の存在をサポートするように書き直すことはできますか? parse_ugly
まあ、醜いです。
# @dbenhur's excellent answer, which works 100% for what i originally asked for
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def parse_dbenhur(line)
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end
def parse_ugly(line)
dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') }
fields = []
open = false
dumb_fields.each_with_index do |v, i|
open ? fields.last.concat(v) : fields.push(v)
open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"'))
end
fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v }
end
lines = []
lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00'
lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00'
lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00'
require 'csv'
lines.each do |line|
puts
puts line
begin
c = CSV.parse_line(line)
puts "#{c.to_csv.chomp} (size #{c.length})"
rescue
puts "FasterCSV says: #{$!}"
end
a = parse_ugly(line)
puts "#{a.to_csv.chomp} (size #{a.length})"
b = parse_dbenhur(line)
puts "#{b.to_csv.chomp} (size #{b.length})"
end
これを実行したときの出力は次のとおりです。
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
FasterCSV says: Illegal quoting in line 1.
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
FasterCSV says: Unclosed quoted field on line 1.
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5)
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5)
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4)
Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
アップデート
フィールドにコンマがある場合、CSV では二重引用符が使用されることに注意してください。
更新 2
コンマが問題のフィールドから取り除かれても問題ありません... 私の parse_ugly メソッドはそれらを保持しません。
更新 3
クライアントから、この奇妙な CSV をエクスポートしているのは SQL Server 2008 であることを知りました。Microsoftには、こことここで報告されています。
更新 4
@dbenhurの回答は、私が最初に求めたものに対して完全に機能しましたが、コンマと引用符の両方を含む行を表示することを怠ったことを指摘しました. d@benhur の回答を受け入れますが、上記のすべての行で機能するように改善できることを願っています。
願わくば最終更新
このコードは機能します (そして、「意味的に正しい」と考えます):
QUOTED = /"((?:[^"]|(?:""(?!")))*)"/
SEPQ = /,(?! )/
UNQUOTED = /([^,]*)/
SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/
FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/
def parse_sql_server_2008_csv_line(line)
line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') }
end
How can I process a CSV file with “bad commas”?の @dbenhur と @ghostdog74 の回答から適応