r - 文字ベクトルをデータフレーム内の新しい行に分割する最速の方法

Question

検索中にこれを正しく表現する方法がわからなかったので、簡単な答えがあれば申し訳ありません。

.csvから取得している58個のデータフレームがあり、それぞれが最大25,000行です。彼らはこのように見えます：

Probe.Id     Gene.Id             Score.d
1418126_at   6352                28.52578
145119_a_at  2192                24.87866
1423477_at   NA                  24.43532
1434193_at   100506144///9204    6.22395

理想的には、「///」でIDを分割し、それらを新しい行に配置したいと思います。そのようです：

Probe.Id     Gene.Id             Score.d
1418126_at   6352                28.52578
145119_a_at  2192                24.87866
1423477_at   NA                  24.43532
1434193_at   100506144           6.22395
1434193_at   9204                6.22395

strsplitを使用すると、Gene.Idを文字ベクトルのリストとして取得できますが、一度取得すると、他の行から正しい値を使用して、それぞれの行の個々のIDを取得するのが最も効果的な方法がわかりません。列。理想的には、25,000行をループするだけではありません。

誰かがこれを行う正しい方法を知っているなら、私はそれをとても感謝します。

編集：次のようなIDを持つ行があるという複雑な要因があることを追加する必要があります：

333932///126961///653604///8350///8354///8355///8356///8968///8352///8358///835‌1///8353///8357"

そして、私は行のIDの最大数がいくつであるかわかりません。

score 6 · Accepted Answer

編集： OPのコメントの後の新しい解決策。使用するのは非常に簡単data.tableです：

df <- structure(list(Probe.Id = c("1418126_at", "145119_a_at", "1423477_at", 
        "1434193_at", "100_at"), Gene.Id = c("6352", "2192", NA, 
        "100506144///9204", "100506144///100506146///100506148///100506150"), 
         Score.d = c(28.52578, 24.87866, 24.43532, 6.22395, 6.22395)), 
        .Names = c("Probe.Id", "Gene.Id", "Score.d"), row.names = c(NA, 5L), 
        class = "data.frame")

require(data.table)
dt <- data.table(df)
dt.out <- dt[, list(Probe.Id = Probe.Id, 
          Gene.Id = unlist(strsplit(Gene.Id, "///")), 
          Score.d = Score.d), by=1:nrow(dt)]

> dt.out

#    nrow    Probe.Id   Gene.Id  Score.d
# 1:    1  1418126_at      6352 28.52578
# 2:    2 145119_a_at      2192 24.87866
# 3:    3  1423477_at        NA 24.43532
# 4:    4  1434193_at 100506144  6.22395
# 5:    4  1434193_at      9204  6.22395
# 6:    5      100_at 100506144  6.22395
# 7:    5      100_at 100506146  6.22395
# 8:    5      100_at 100506148  6.22395
# 9:    5      100_at 100506150  6.22395

が固定パターンの場合は、式に追加fixed = TRUEしてさらに高速化できます。strsplit///

代替方法再び使用しdata.tableます。strsplitこれがベクトル化された操作であり、列全体でGene.Id実行する方が、一度に1行ずつ実行するよりもはるかに高速であることを考慮すると（ data.tablethro'は非常に高速に実行されますが、前のコードを分割することで、より高速化できます） 2つのステップに：

# first split using strsplit (data.table can hold list in its columns!!)
dt[, Gene.Id_split := strsplit(dt$Gene.Id, "///", fixed=TRUE)]
# then just unlist them
dt.2 <- dt[, list(Probe.Id = Probe.Id, 
                  Gene.Id = unlist(Gene.Id_split), 
                  Score.d = Score.d), by = 1:nrow(dt)]

data.table行を取得するまで、この例に示されているものを何度も複製しました295245。そして、私は以下を使用してベンチマークを実行しましたrbenchmark：

# first function
DT1 <- function() {
    dt.1 <- dt[, list(Probe.Id = Probe.Id, 
             Gene.Id = unlist(strsplit(Gene.Id, "///", fixed = TRUE)), 
             Score.d = Score.d), by=1:nrow(dt)]
}

# expected to be faster function
DT2 <- function() {
    dt[, Gene.Id_split := strsplit(dt$Gene.Id, "///", fixed=TRUE)]
    # then just unlist them
    dt.2 <- dt[, list(Probe.Id = Probe.Id, Gene.Id = unlist(Gene.Id_split), Score.d = Score.d), by = 1:nrow(dt)]
}

require(rbenchmark)
benchmark(DT1(), DT2(), replications=10, order="elapsed")

#    test replications elapsed relative user.self sys.self
# 2 DT2()           10  15.708    1.000    14.390    0.391
# 1 DT1()           10  24.957    1.589    23.723    0.436

この例では、約1.6倍速くなります。ただし、これは。を含むエントリの数によって異なります///。お役に立てれば。

古いソリューション:(継続性のために）

1つの方法は、次のとおりです。1）find the positionsこれ///が発生する場所、2）extract、3）duplicate、4）sub、および5）combineそれら。

df <- structure(list(Probe.Id = structure(c(1L, 4L, 2L, 3L), 
         .Label = c("1418126_at", "1423477_at", "1434193_at", "145119_a_at"), 
         class = "factor"), Gene.Id = structure(c(3L, 2L, NA, 1L), 
         .Label = c("100506144///9204", "2192", "6352"), class = "factor"), 
         Score.d = c(28.52578, 24.87866, 24.43532, 6.22395)), 
         .Names = c("Probe.Id", "Gene.Id", "Score.d"), 
         class = "data.frame", row.names = c(NA, -4L))

# 1) get the positions of "///"
idx <- grepl("[/]{3}", df$Gene.Id)

# 2) create 3 data.frames
df1 <- df[!idx, ] # don't touch this.
df2 <- df[idx, ] # we need to work on this

# 3) duplicate
df3 <- df2 # duplicate it.

4) sub    
df2$Gene.Id <- sub("[/]{3}.*$", "", df2$Gene.Id) # replace the end
df3$Gene.Id <- sub("^.*[/]{3}", "", df3$Gene.Id) # replace the beginning

# 5) combine/put them back
df.out <- rbind(df1, df2, df3)

# if necessary sort them here.

score 2 · Accepted Answer

strsplitここでとを使用したソリューションmerge

dat <- read.table(text ='Probe.Id     Gene.Id             Score.d
1418126_at   6352                28.52578
145119_a_at  2192                24.87866
1423477_at   NA                  24.43532
1434193_at   100506144///9204    6.22395',header=T,stringsAsFactors=F)
dat1 <- dat
xx <- do.call(rbind,strsplit(dat$Gene.Id,split='///'))
dat[which(xx[,1]!=xx[,2]),2]  <- xx[which(xx[,1]!=xx[,2]),1]
dat1[which(xx[,1]!=xx[,2]),2]  <- xx[which(xx[,1]!=xx[,2]),2]
  merge(dat,dat1,all.y=T,all.x=T)
     Probe.Id   Gene.Id  Score.d
1  1418126_at      6352 28.52578
2  1423477_at      <NA> 24.43532
3  1434193_at 100506144  6.22395
4  1434193_at      9204  6.22395
5 145119_a_at      2192 24.87866

score 2 · Accepted Answer

data.frame入力ベクトルをサイレントにリサイクルする「機能」を使用して、のコンストラクターを使用するメソッドを次に示します。

do.call(rbind, 
        apply(dat, 1, function(x) 
                         data.frame(Probe.ID=x['Probe.Id'], 
                                    Gene.Id=strsplit(x['Gene.Id'], '///'),
                                    Score.d=x['Score.d'],
                                    row.names=NULL
                                   )
             )
        )
##      Probe.ID   Gene.Id  Score.d
## 1  1418126_at      6352 28.52578
## 2 145119_a_at      2192 24.87866
## 3  1423477_at      <NA> 24.43532
## 4  1434193_at 100506144  6.22395
## 5  1434193_at      9204  6.22395

r - 文字ベクトルをデータフレーム内の新しい行に分割する最速の方法

3 に答える 3

Related

Reference