r - 完全に相関/冗長な数値列と文字列を検索する

Question

数百列のデータセットがあります。メーリングリストのデータが含まれており、いくつかの列は互いに完全に重複しているように見えますが、形式が異なります。

例えば：

rowNum    StateCode       StateName      StateAbbreviation
  1          01             UTAH               UT
  2          01             UTAH               UT
  3          03             TEXAS              TX
  4          03             TEXAS              TX
  5          03             TEXAS              TX
  6          44             OHIO               OH
  7          44             OHIO               OH
  8          44             OHIO               OH
 ...         ...            ...                ...

重複するデータを削除し、可能であれば数値列だけを残して、1つの列だけに同じ情報が含まれるようにします。したがって、上記の例は次のようになります。

rowNum    StateCode
      1          01 
      2          01   
      3          03  
      4          03  
      5          03 
      6          44
      7          44
      8          44 
     ...         ...

使用してみましcor()たが、これは数値変数に対してのみ機能します。私は試しましcaret::nearZeroVar()たが、これは列自体でのみ機能します。

非数値データを含む完全に相関する列を見つけるための提案はありますか？

ありがとう。

score 8 · Accepted Answer

これが楽しくて速い解決策です。最初にdata.frameを適切に構造化された整数クラスの行列に変換し、次にを使用cor()して冗長列を識別します。

## Read in the data
df <- read.table(text="rowNum    StateCode       StateName      StateAbbreviation
  1          01             UTAH               UT
  2          01             UTAH               UT
  3          03             TEXAS              TX
  4          03             TEXAS              TX
  5          03             TEXAS              TX
  6          44             OHIO               OH
  7          44             OHIO               OH
  8          44             OHIO               OH", header=TRUE)

## Convert data.frame to a matrix with a convenient structure
## (have a look at m to see where this is headed)
l <- lapply(df, function(X) as.numeric(factor(X, levels=unique(X))))
m <- as.matrix(data.frame(l))

## Identify pairs of perfectly correlated columns    
M <- (cor(m,m)==1)
M[lower.tri(M, diag=TRUE)] <- FALSE

## Extract the names of the redundant columns
colnames(M)[colSums(M)>0]
[1] "StateName"         "StateAbbreviation"

score 3 · Accepted Answer

これでうまくいくでしょうか？table(col1, col2)を呼び出すと、列が重複している場合、テーブル内のすべての列にゼロ以外の値が1つだけ含まれるという考えに基づいています。例：

     OHIO TEXAS UTAH
  1     0     0    2
  3     0     3    0
  44    3     0    0

だからこのようなもの：

dup.cols <- read.table(text='rowNum    StateCode       StateName      StateAbbreviation
  1          01             UTAH               UT
  2          01             UTAH               UT
  3          03             TEXAS              TX
  4          03             TEXAS              TX
  5          03             TEXAS              TX
  6          44             OHIO               OH
  7          44             OHIO               OH
  8          44             OHIO               OH', header=T)
library(plyr)
combs <- combn(ncol(dup.cols), 2)
adply(combs, 2, function(x) {
  t <- table(dup.cols[ ,x[1]], dup.cols[ , x[2]])
  if (all(aaply(t1, 2, function(x) {sum(x != 0) == 1}))) {
    paste("Column numbers ", x[1], x[2], "are duplicates")
  }
})

score 1 · Accepted Answer

これにより、どの変数が互いに一致するかを示すマップが返されます。

check.dup <- expand.grid(names(dat),names(dat)) #find all variable pairs
check.dup[check.dup$Var1 != check.dup$Var2,] #take out self-reference
check.dup$id <- mapply(function(x,y) {
        x <- as.character(x); y <- as.character(y)
            #if number of levels is different, discard; keep the number for later
        if ((n <- length(unique(dat[,x]))) != length(unique(dat[,y])))  {
            return(FALSE)
            }
            #subset just the variables in question to get pairs
        d <- dat[,c(x,y)]
            #find unique pairs
        d <- unique(d)
            #if number of unique pairs is the number of levels from before,
            #then the pairings are one-to-one
        if( nrow(d) == n ) {
            return(TRUE)
        } else return(FALSE)
    },
    check.dup$Var1,
    check.dup$Var2
)

score 0 · Accepted Answer

 dat <- read.table(text="rowNum    StateCode       StateName     
   1          01             UTAH
   2          01             UTAH
   3          03             TEXAS
   4          03             TEXAS 
   5          03             TEXAS 
   6          44             OHIO
   7          44             OHIO
   8          44             OHIO", header=TRUE)

 dat [!duplicated(dat[, 2:3]), ]
#------------
  rowNum StateCode StateName
1      1         1      UTAH
3      3         3     TEXAS
6      6        44      OHIO

r - 完全に相関/冗長な数値列と文字列を検索する

4 に答える 4

Related

Reference