r - CSV から Vowpal への入力形式 - 遅い R コードの最適化

Question

高速な CSV から Vowpal への入力形式トランスレータを構築しようとしています。libsvmに関連するいくつかの優れたコードを見つけ、それに基づいて例を示しました。提供された小さなタイタニックデータセットではうまく機能しますが、実際のデータセットは 4.5 ミルを超えています。200 以上の機能を備えた観測。強力なサーバーで提供されるコードでは、3 日かかります。

ここで単一のループを削除する方法はありますか? Vowpal には独自のスパース性があるため、コードは毎回インデックスをチェックして、すべての行で 0 または NA を除外する必要があることに注意してください。(vowpal は、データフレームとは異なり、各行に同じ数のフィーチャを保持する必要はありません)。すべての行をメモリに保持するのではなく、すべての行をファイルに書き込むことで問題ありません。どんな解決策も大歓迎です！

# sample data set
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt',sep='\t')
titanicDF  <- titanicDF  [c("PClass", "Age", "Sex", "Survived")]

# target variable
y <- titanicDF$Survived
lineHolders <- c()
for ( i in 1:nrow( titanicDF  )) {

    # find indexes of nonzero values - anything 
    # with zero for that row needs to be ignored!
    indexes = which( as.logical( titanicDF [i,] ))
    indexes <- names(titanicDF [indexes])

    # nonzero values
    values = titanicDF [i, indexes]

    valuePairs = paste( indexes, values, sep = ":", collapse = " " )

    # add label in the front and newline at the end
    output_line = paste0(y[i], " |f ", valuePairs, "\n", sep = "" )

    lineHolders <- c(lineHolders, output_line)
}

score 1 · Accepted Answer

行のループに関する最初の質問に対処すると、ある程度までは、行ではなくデータフレームの列でこれを処理する方が速いようです。以下に示すように、コードを func_Row という関数に入れました

func_Row  <-  function(titanicDF) {
# target variable
y <- titanicDF$Survived
lineHolders <- c()
for ( i in 1:nrow( titanicDF  )) {
# find indexes of nonzero values - anything 
# with zero for that row needs to be ignored!
 indexes = which( as.logical( titanicDF [i,] ))
 indexes <- names(titanicDF [indexes])
# nonzero values
 values = titanicDF [i, indexes]
 valuePairs = paste( indexes, values, sep = ":", collapse = " " )
# add label in the front and newline at the end
 output_line = paste0(y[i], " |f ", valuePairs, "\n", sep = "" )
 lineHolders <- c(lineHolders, output_line)
} 
return(lineHolders)
}

列ごとに処理する別の関数をまとめます

 func_Col <- function(titanicDF) {
 lineHolders <- paste(titanicDF$Survived, "|f")
 for( ic in 1:ncol(titanicDF)) {
   nonzeroes <- which(as.logical(as.numeric(titanicDF[,ic]))) 
   lineHolders[nonzeroes] <- paste(lineHolders[nonzeroes]," ",names(titanicDF)[ic], ":", as.numeric(titanicDF[nonzeroes,ic]),sep="") 
 }
 lineHolders <- paste(lineHolders,"\n",sep="")
 return(lineHolders)
 }

マイクロベンチマークを使用してこれら 2 つの関数を比較すると、次の結果が得られます。

microbenchmark( func_Row(titanicDF), func_Col(titanicDF), times=10)
Unit: milliseconds
            expr        min         lq     median         uq       max neval
func_Row(titanicDF) 370.396605 375.210624 377.044896 385.097586 443.14042    10
func_Col(titanicDF)   6.626192   6.661266   6.675667   6.798711  10.31897    10

このデータセットの結果はミリ秒単位であることに注意してください。したがって、列単位の処理は、行単位の処理よりも約 50 倍高速です。行のブロックでデータを読み取ることにより、メモリの問題に対処し、列ごとの処理の利点を維持するのは非常に簡単です。次のように、タイタニックのデータに基づいて 5,300,000 行のファイルを作成しました。

titanicDF_big <- titanicDF
for( i in 1:12 )  titanicDF_big <- rbind(titanicDF_big, titanicDF_big)
write.table(titanicDF_big, "titanic_big.txt", row.names=FALSE )

このファイルは、次の関数を使用して行のブロックで読み取ることができます

read_blocks <- function(file_name, row_max = 6000000L, row_block = 5000L ) {
#   Version of code using func_Col to process data by columns
blockDF = NULL
for( row_num in seq(1, row_max, row_block)) { 
  if( is.null(blockDF) )  {
    blockDF <- read.table(file_name, header=TRUE, nrows=row_block)
    lineHolders <- func_Col(blockDF)
  }  
  else  {
    blockDF <- read.table(file_name, header=FALSE, col.names=names(blockDF),
                            nrows=row_block, skip = row_num - 1)
    lineHolders <- c(lineHolders, func_Col(blockDF))
  }
}
return(lineHolders)
}

func_Col を使用して列ごとにデータを処理する read_blocks のこのバージョンを使用したベンチマーク結果は、ブロックサイズが 500,000 行から 2,000,000 行の範囲の展開された Titanic データファイル全体を読み取るために以下に示されます。

Unit: seconds
                                                                 expr      min       lq       median       uq      max neval
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 2000000L) 39.43244 39.43244 39.43244 39.43244 39.43244     1
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 1000000L) 46.66375 46.66375 46.66375 46.66375 46.66375     1
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 500000L) 62.51387 62.51387 62.51387 62.51387 62.51387     1

ブロックサイズが大きいほど、時間は大幅に短縮されますが、より多くのメモリが必要になります。ただし、これらの結果は、列ごとにデータを処理することで、ファイルサイズの約 10% に相当するブロックサイズでも、530 万行の展開された Titanic データファイル全体を約 1 分以内で読み取ることができることを示しています。繰り返しますが、結果はデータの列数とシステムプロパティによって異なります。

r - CSV から Vowpal への入力形式 - 遅い R コードの最適化

1 に答える 1

Related

Reference