I've imported a large data frame from a CSV file with oddly formatted numerical data. Here's a reproducible example of the data frame I'm working with:
df <- data.frame("r1" = c(1,2,3,4,5), "r2" = c(1,2.01,-3,"-","2,000"))
'r2' contains values with negatives signs, e.g. "-", and values with zeros represented as dashes '-'. To run some numerical analysis on this messy r2 column, I will need to:
- Replace the "-" with zeros "0" while avoiding to remove the negative sign in front of the negative values.
- Avoid coercion of legitimate values like "2,000" to NAs. For some reason, when I run the command:
foo$row2<- as.numeric(sub("-",0,foo$row2))
R coerces the values formatted with commas to NAs, thus corrupting the data in the column.
Here's an example of output after running foo$row2<- as.numeric(sub("-",0,foo$row2))
:
Warning message:
NAs introduced by coercion
r1 r2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 NA
As you can see, "2,000" was coerced to NA. -3 was erroneously converted to 3 (dash removed). But hey, at least we got rid of the "-" in row 3, right!!!
Here's ultimately what I would like to produce:
r1 r2
1 1 1.00
2 2 2.01
3 3 -3.00
4 4 0.00
5 5 2000
Note that the comma from row 5 is removed. Column r2 should be formatted such that I can run commands like sum(df$r2)
on it.