r - Rのdata.frameの列を使用して遷移行列から値を抽出します

Question

ある状態から別の状態に移行するコストを伴う遷移行列があります。

cost <- data.frame( a=c("aa","ab"),b=c("ba","bb"))

(文字列 "aa" が a から a に移動するコストであるふりをする)

私は次のdata.frame状態のを持っています:

transitions <- data.frame( from=c("a","a","b"), to=c("a","b","b") )

各トランジションのコストを入れてトランジションに列を追加できるようにしたいので、最終的には次のようになります。

  from to cost
1    a  a   aa
2    a  b   ab
3    b  b   bb

これを行うR風の方法があると確信しています。forループを使用することになりました：

n <- dim(data)[1]
v <- vector("numeric",n)
for( i in 1:n ) 
{ 
    z<-data[i,c(col1,col2),with=FALSE]
    za <- z[[col1]]
    zb <- z[[col2]]
    v[i] <- dist[za,zb]
}
data <- cbind(data,d=v)
names(data)[dim(data)[2]] <- colName
data

しかし、これはかなり見苦しく、信じられないほど遅いです。2M 行で約 20 分かかりますdata.frame(同じテーブルの要素間の距離を計算する操作には 1 秒もかかりません)。

上記のコスト列を取得する、簡単で高速な 1 行または 2 行のコマンドはありますか?

score 3 · Accepted Answer

更新: 既知の状態を考慮する

data.table解決：

require(utils)
require(data.table)

## Data generation
N <- 2e6
set.seed(1)
states <- c("a","b")
cost <- data.frame(a=c("aa","ab"),b=c("ba","bb"))
transitions <- data.frame(from=sample(states, N, replace=T), 
                            to=sample(states, N, replace=T))

## Expanded cost matrix construction
f <- expand.grid(states, states)
f <- f[order(f$Var1, f$Var2),]
f$cost <- unlist(cost)

## Prepare data.table
dt <- data.table(transitions)
setkey(dt, from, to)

## Routine itself  
dt[,cost:=as.character("")] # You don't need this line if cost is numeric
apply(f, 1, function(x) dt[J(x[1],x[2]),cost:=x[3]])

2M 行のtransitions場合、処理に約 0.3 秒かかります。

score 2 · Accepted Answer

これが1つの方法です:(少なくともこれはこの例で機能し、より大きなデータでも機能すると思います。そうでない場合は例を書き戻してください）

# load both cost and transition with stringsAsFactors = FALSE
# so that strings are NOT by default loaded as factors
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb"), stringsAsFactors=F)
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b"), 
                                      stringsAsFactors = FALSE)

# convert cost to vector: it'll have names a1, a2, b1, b2. we'll exploit that.
cost.vec <- unlist(cost)
# convert "to" to factor and create id column with "from" and as.integer(to)
# the as.integer(to) will convert it into its levels
transitions$to <- as.factor(transitions$to)
transitions$id <- paste0(transitions$from, as.integer(transitions$to))

# now, you'll have a1, a2 etc.. here as well, just match it with the vector
transitions$val <- cost.vec[!is.na(match(names(cost.vec), transitions$id))]

#   from to id val
# 1    a  a a1  aa
# 2    a  b a2  ab
# 3    b  b b2  bb

もちろん、を削除することもできますid。これがどのような場合でも機能しない場合は、私に知らせてください。私はそれを修正しようとします。

score 2 · Accepted Answer

アルンの答えから始めて、私は次のようにしました：

library(reshape)
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb") )
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b") )
row.names(cost) <- c("a","b") #Normally get this from the csv file
cost$from <- row.names(cost)
m <- melt(cost, id.vars=c("from"))
m$transition = paste(m$from,m$variable)
transitions$transition=paste(transitions$from,transitions$to)
merge(m, transitions, by.x="transition",by.y="transition")

あと数行ですが、インデックスとしての因子の順序付けを少し信用できません。また、それらがdata.tablesの場合、次のことができることも意味します:

setkey(m,transition)
setkey(transitions,transition)
m[transitions]

ベンチマークは行っていませんが、大規模なデータセットでは、data.table のマージがマージやベクトルスキャンのアプローチよりも高速になると確信しています。

r - Rのdata.frameの列を使用して遷移行列から値を抽出します

3 に答える 3

Related

Reference