r - ジェノタイピングデータの変換方法

Question

私はこれを持っていますdataframe(約 446664 X 234 の薄暗い) と呼ばれるmydf(dputが提供されています)。これdataframeには列REFとがありますALT。

REFすべての行に 1 文字しかありませんがALT、コンマ (",") で区切られた 1 つ、2 つ、または 3 つの文字を含めることができます。残りの列 (サンプル列) は、すべての作業を行う必要がある列です。

の任意の文字をREF0、最初の文字をALT1、2 番目の文字を 2、3 番目の文字をそれぞれ 3 と考えると、次のような関数を作成する必要があります。

すべてのサンプル列 (つまり、REF と ALT を除く) の数字を文字に置き換えることができます。
それらに NA/NA を入力し、「/」を折りたたんで、すべてのセルでペアの文字を取得します。

最後に、に示すように、すべてのサンプル列を行全体で反転する必要があります ( transpose) result。ありがとうございました！

mydf<-
structure(list(REF = structure(c(1L, 4L, 3L, 2L, 3L), .Label = c("A", 
"C", "G", "T"), class = "factor"), ALT = structure(c(6L, 6L, 
1L, 9L, 1L), .Label = c("A", "A,C", "A,G", "A,T", "C", "C,G", 
"C,T", "G", "G,T", "T"), class = "factor"), X860 = structure(c(1L, 
3L, 2L, 1L, 1L), .Label = c("./.", "0/0", "0/1", "0/2", "1/1"
), class = "factor"), X861 = structure(c(1L, 6L, 2L, 1L, 1L), .Label = c("./.", 
"0/0", "0/1", "0/2", "1/1", "1/2"), class = "factor"), X862 = structure(c(6L, 
3L, 1L, 2L, 1L), .Label = c("./.", "0/0", "0/1", "0/2", "1/1", 
"2/2"), class = "factor")), .Names = c("REF", "ALT", "X860", 
"X861", "X862"), row.names = c(NA, -5L), class = "data.frame")

期待される出力:

X860 NANA TC GG NANA NANA
X861 NANA CG GG NANA NANA 
X862 GG TC NANA CC NANA

score 4 · Accepted Answer

これを手に入れましたが、そのパフォーマンスについてはよくわかりません：

letters <- strsplit(paste(mydf$REF,mydf$ALT,sep=","),",") # concatenate the letters to have an index to work on from the numbers
values <- t(mydf[,3:ncol(mydf)]) # let's work on each column needing values
nbval <- ncol(values) # Save time for later and save the length of values 

#Prepare the two temp vectors used later
chars <- vector("character",2) 
ret <- vector("character",nbval)

#Loop over the rows (and transpose the result)
t(sapply(rownames(values),
   function(x) { 
     indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes

     for(i in 1:nbval) { # Loop over the number of columns :/
       for (j in 1:2) { # Loop over the pair 
         chars[j] <- ifelse(indexes[[i]][j] == ".", "NA",letters[[i]][as.integer(indexes[[i]][j])+1]) # Get NA if . or the letter with the correct index at this postion
       }
       ret[i] <- paste0(chars[1],chars[2]) # concatenate the two chars
     }
     return(ret) # return this for this row
   }
))

サンプルデータを含む出力:

     [,1]   [,2] [,3]   [,4]   [,5]  
X860 "NANA" "TC" "GG"   "NANA" "NANA"
X861 "NANA" "CG" "GG"   "NANA" "NANA"
X862 "GG"   "TC" "NANA" "CC"   "NANA"

コメントからの関数の更新バージョン (残りのコードは変更されないため):

#Loop over the rows (and transpose the result)
t(sapply(rownames(values),
   function(x) {
     indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes
     for(i in 1:nbval) { # Loop over the number of columns :/
       if (values[x,i] == "./.") { # test if we have ./. and if yes, set to NA
         ret[i] <- "NA"
       } else { # if it's not ./. then try to find the corresponding letters
         for (j in 1:2) { # Loop over the pair 
           chars[j] <- ifelse(indexes[[i]][j] == ".", "NA",letters[[i]][as.integer(indexes[[i]][j])+1]) # Get NA if . or the letter with the correct index at this postion
         }
         ret[i] <- paste0(chars[1],chars[2]) # concatenate the two chars
       }
     }
     return(ret) # return this for this row
   }
))

出力：

     [,1] [,2] [,3] [,4] [,5]
X860 "NA" "TC" "GG" "NA" "NA"
X861 "NA" "CG" "GG" "NA" "NA"
X862 "GG" "TC" "NA" "CC" "NA"

r - ジェノタイピングデータの変換方法

2 に答える 2

Related