1

I've imported a large data frame from a CSV file with oddly formatted numerical data. Here's a reproducible example of the data frame I'm working with:

df <- data.frame("r1" = c(1,2,3,4,5), "r2" = c(1,2.01,-3,"-","2,000"))

'r2' contains values with negatives signs, e.g. "-", and values with zeros represented as dashes '-'. To run some numerical analysis on this messy r2 column, I will need to:

  1. Replace the "-" with zeros "0" while avoiding to remove the negative sign in front of the negative values.
  2. Avoid coercion of legitimate values like "2,000" to NAs. For some reason, when I run the command: foo$row2<- as.numeric(sub("-",0,foo$row2)) R coerces the values formatted with commas to NAs, thus corrupting the data in the column.

Here's an example of output after running foo$row2<- as.numeric(sub("-",0,foo$row2)) :

Warning message:
NAs introduced by coercion 
  r1   r2
1 1  1.00
2 2  2.01
3 3  3.00
4 4  0.00
5 5   NA

As you can see, "2,000" was coerced to NA. -3 was erroneously converted to 3 (dash removed). But hey, at least we got rid of the "-" in row 3, right!!!

Here's ultimately what I would like to produce:

 r1   r2
1 1  1.00
2 2  2.01
3 3  -3.00
4 4  0.00
5 5  2000

Note that the comma from row 5 is removed. Column r2 should be formatted such that I can run commands like sum(df$r2) on it.

4

2 に答える 2

5

あなたのアプローチは健全でした。置換を 2 回実行するだけです。1 回目は単なるダッシュを削除し、もう 1 回はコンマを削除します。

df$r2<-as.numeric(gsub('^-$','0',gsub(',','',df$r2)))

また、正規表現に慣れていない場合は、先頭 ( )、ダッシュ、末尾 ( ) の^-$文字列のみを削除してください。^$

于 2013-10-18T02:27:55.220 に答える
1

nograpes のソリューションの方がずっとクールです。

## df <- data.frame("r1" = c(1,2,3,4,5), "r2" = c(1,2.01,-3,"-","2,000"))

df$r2 <- as.numeric(gsub(",", "", df$r2))
df$r2[is.na(df$r2)] <- 0

##   r1      r2
## 1  1    1.00
## 2  2    2.01
## 3  3   -3.00
## 4  4    0.00
## 5  5 2000.00
于 2013-10-18T02:30:56.787 に答える