r - Rで不適切な形式のcsvを読み取る - 引用符の不一致

Question

私は何百もの大きな CSV ファイル (サイズはそれぞれ 10k 行から 100k 行までさまざまです) を持っていますが、それらのいくつかは、引用符内の引用符で不適切に形成された説明であるため、次のようになります。

ID,Description,x
3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486

R のこれらすべての行を CSV としてきれいに解析できる必要があります。dput() して読んでいます...

txt <- c("ID,Description,x",
    "3434,\"abc\"def\",988",
    "2344,\"fred\",3484", 
    "2345,\"fr\"\"ed\",3485",
    "2346,\"joe,fred\",3486")

read.csv(text=txt[1:4], colClasses='character')
    Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
      incomplete final line found by readTableHeader on 'text'

引用符を変更し、コンマが埋め込まれた最後の行を含めない場合、うまく機能します

read.csv(text=txt[1:4], colClasses='character', quote='')

ただし、引用符を変更して、カンマが埋め込まれた最後の行を含めると...

read.csv(text=txt[1:5], colClasses='character', quote='')
    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
      line 1 did not have 4 elements

編集 x2: 残念ながら、説明の一部にコンマが含まれていると言うべきでした - コードは上記で編集されています。

score 5 · Accepted Answer

設定を変更しquoteます。

read.csv(text=txt, colClasses='character',quote = "")

    ID Description    x
1 3434   "abc"def"  988
2 2344      "fred" 3484
3 2345    "fr""ed" 3485
4 2346       "joe" 3486

誤ったコンマに対処するために編集します。

  txt <- c("ID,Description,x",
         "3434,\"abc\"def\",988",
         "2344,\"fred\",3484", 
         "2345,\"fr\"\"ed\",3485",
         "2346,\"joe,fred\",3486")

txt2 <- readLines(textConnection(txt)) 

txt2 <- strsplit(txt2,",")

txt2 <- lapply(txt2,function(x) c(x[1],paste(x[2:(length(x)-1)],collapse=","),x[length(x)]) )
m <- do.call("rbind",txt2)
df <- as.data.frame(m,stringsAsFactors = FALSE)
names(df) <- df[1,]
df <- df[-1,]

#     ID Description    x
# 2 3434   "abc"def"  988
# 3 2344      "fred" 3484
# 4 2345    "fr""ed" 3485
# 5 2346  "joe,fred" 3486

それがユースケースにとって十分に効率的であるかどうかはわかりません。

score 2 · Accepted Answer

この一連の厄介なファイルには引用された列が1つしかないread.csv()ため、引用された列の左右にある他の引用されていない列を処理するために両側で実行できるため、現在の解決策は@agstudyと@rolandの両方からの情報に基づいています

csv.parser <- function(txt) {
    df <- do.call('rbind', regmatches(txt,gregexpr(',"|",',txt),invert=TRUE))
    # remove the header
    df <- df[-1,]
    # parse the left csv
    df1 <- read.csv(text=df[,1], colClasses='character', comment='', header=FALSE)
    # parse the right csv
    df3 <- read.csv(text=df[,3], colClasses='character', comment='', header=FALSE)
    # put them back together
    dfa <- cbind(df1, df[,2], df3)
    # put the header back in
    names(dfa) <- names(read.csv(text=txt[1], header=TRUE))
    dfa
}

# debug(csv.parser)
csv.parser(txt)

したがって、これをより広いデータセットで実行すると、ありがたいことに機能します。

txt <- c("ID,Description,x,y",
         "3434,\"abc\"def\",988,344",
         "2344,\"fred\",3484,3434", 
         "2345,\"fr\"\"ed\",3485,7347",
         "2346,\"joe,fred\",3486,484")
csv.parser(txt)
    ID Description    x    y
1 3434     abc"def  988  344
2 2344        fred 3484 3434
3 2345      fr""ed 3485 7347
4 2346    joe,fred 3486  484

score 1 · Accepted Answer

between とを使用readLinesして要素を使用および抽出できますregmatches,"",

ll <- readLines(textConnection(object='ID,Description,x
  3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486'))
ll<- ll[-1]     ## remove the header
ll <- regmatches(ll,gregexpr(',"|",',ll),invert=TRUE)
do.call(rbind,ll)
       [,1]     [,2]       [,3]  
[1,] "  3434" "abc\"def" "988" 
[2,] "2344"   "fred"     "3484"
[3,] "2345"   "fr\"\"ed" "3485"
[4,] "2346"   "joe,fred" "3486"

r - Rで不適切な形式のcsvを読み取る - 引用符の不一致

3 に答える 3

誤ったコンマに対処するために編集します。

Related

Reference