2

I have a three dataframes, and I want to add some columns to the first dataframe which counts the number of times the first two columns in the first dataframe appear in the other dataframes e.g.

dataframe - x
a b
1 1
1 2
2 1
2 2

dataframe - y
a b
1 1
1 1
1 2
2 2
2 2

dataframe - z
a b
1 2
2 1
2 1
2 2

So the first dataframe would become
a b y z
1 1 2 0
1 2 1 1
2 1 0 2
2 2 2 1

I have ways to do this, e.g. I am currently doing

x$y<- sapply(1:nrow(x), function(i){
    sum(y$a == x$a[i] & y$b == x$b[i])
  }

x$z<- sapply(1:nrow(x), function(i){
    sum(z$a == x$a[i] & z$b == x$b[i])
  }

But my dataframe is very large and my way takes a while to complete so I was wondering of the quickest way to do this.

Please ask if anything is unclear.

Thanks in advance

4

3 に答える 3

3

To avoid the double loop, I would use the function match, which is optimized for finding elements in another list. To count how many elements, I propose to tabulate the variables first, and then to match against the table.

My guess is that it would significantly reduce the time complexity, because the method you propose is quadratic (one loop goes over x rows and for each an inner loop goes over y rows) whereas the functions match and table are based on sorts (I think) which are rather n*log(n).

We first turn the data frames into vectors with paste, taken from the answer of Josh:

# Recreate your data
x <- data.frame(a=c(1,1,2,2), b=c(1,2,1,2))
y <- data.frame(a=c(1,1,1,2,2), b=c(1,1,2,2,2))
z <- data.frame(a=c(1,2,2,2), b=c(2,1,1,2))

# Use paste to combine the two columns
X <- do.call(paste, c(x, sep="_"))
Y <- do.call(paste, c(y, sep="_"))
Z <- do.call(paste, c(z, sep="_"))

Then we tabulate and match against the tabluation.

x$y <- table(Y)[match(X, names(table(Y)))]
x$y[is.na(x$y)] <- 0

x$z <- table(Z)[match(X, names(table(Z)))]
x$z[is.na(x$z)] <- 0

x  
a b y z
1 1 1 2 0
2 1 2 1 1
3 2 1 0 2
4 2 2 2 1

You could put table(Y) in an intermediate variable if you want to avoid tabulating two times.

于 2012-05-23T14:39:24.450 に答える
2

This will likely be faster:

# Recreate your data
x <- data.frame(a=c(1,1,2,2), b=c(1,2,1,2))
y <- data.frame(a=c(1,1,1,2,2), b=c(1,1,2,2,2))
z <- data.frame(a=c(1,2,2,2), b=c(2,1,1,2))

# Use paste to combine the two columns in each data.frame
X <- do.call(paste, c(x, sep="-"))
Y <- do.call(paste, c(y, sep="-"))
Z <- do.call(paste, c(z, sep="-"))

# Count number of times each element of X appears in Y and Z
x$y <- sapply(X, function(string) sum(string==Y))
x$z <- sapply(X, function(string) sum(string==Z))
x
#   a b y z
# 1 1 1 2 0
# 2 1 2 1 1
# 3 2 1 0 2
# 4 2 2 2 1
于 2012-05-23T14:11:15.033 に答える
2

You said your dataframe was very large, so this is the data.table way :

> require(data.table)
> x <- data.table(a=c(1,1,2,2), b=c(1,2,1,2))
> y <- data.table(a=c(1,1,1,2,2), b=c(1,1,2,2,2))
> z <- data.table(a=c(1,2,2,2), b=c(2,1,1,2)) 
> 
> setkey(x,a,b)    # sort and mark as sorted by a,b
> setkey(y,a,b)    # same for y
> setkey(z,a,b)    # same for z
> x[,y:=y[x,.N][[3]]]  
       # join to y from x, using the key.
       # .N = number of matching rows
       # := means assign by reference back to column y in x, no copy at all
       # [[3]] can be understood by running `y[x,.N]` on its own
     a b y
[1,] 1 1 2
[2,] 1 2 1
[3,] 2 1 0
[4,] 2 2 2
> x[,z:=z[x,.N][[3]]]   # same for z
     a b y z
[1,] 1 1 2 0     # bug in v1.8.0 gave z=1 on this row, fixed in v1.8.1
[2,] 1 2 1 1
[3,] 2 1 0 2
[4,] 2 2 2 1

That doesn't copy the large objects at all, even once. The larger they are, the more significant that might be.

于 2012-05-25T11:50:39.883 に答える