r - カウントをパーセンテージとインデックススコアに変換する効率

Question

現在、必要な結果を生成する次のコードがあります（Data_IndexおよびData_Percentages）

Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))

Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1

#Total population column
TotalCount <- Input_Data[1]

#Total population sum
TotalCountSum  <- sum(TotalCount)
Input_Data[1]  <- NULL
VarNames       <- colnames(Input_Data)
Data_Per_Row   <- c()
Data_Index_Row <- c()

for (i in 1:Rows) {

    #Proportion of all areas population found in this row
    OAPer <- TotalCount[i, ] / TotalCountSum * 100

    Data_Per_Col   <- c()
    Data_Index_Col <- c()

    for(u in 1:Vars) {
        # For every column value in the selected row 
        # the percentage of that value compared to the 
        # total population (TotalCount) for that row is calculated
        VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100

        # Once the percentage is calculated the index 
        # score is calculated by diving this percentage 
        # by the proportion of the total population in that 
        # area compared to all areas
        VarIndex <- VarPer / OAPer * 100

        # Binds results for all columns in the row
        Data_Per_Col   <- cbind(Data_Per_Col, VarPer)
        Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
    }

    # Binds results for completed row with previously completed rows
    Data_Per_Row   <- rbind(Data_Per_Row, Data_Per_Col) 
    Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col) 
}
colnames(Data_Per_Row)   <- VarNames
colnames(Data_Index_Row) <- VarNames

# Changes the index scores to range from -1 to 1
OldRange   <- (max(Data_Index_Row) - min(Data_Index_Row))  
NewRange   <- (1 - -1)  
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row

# Final outputs
Data_Index
Data_Percentages

私が抱えている問題は、コードが非常に遅いことです。200,000 行 200 列のデータセットで使用できるようにしたいと考えています (現在のコードを使用すると、約 4 日かかります)。このプロセスを高速化する方法があるに違いないと確信していますが、正確な方法はわかりません。

コードが行っていることは、(この例では) 年齢層とさまざまな地域に分割された人口カウントテーブルを取得し、それをパーセンテージとインデックススコアに変換することです。現在、すべての行と列のすべての値が個別に選択され、それらに対して計算が実行されるように、2 つのループがあります。これらのループが実行を遅くしていると思いますが、同じ結果をより速く生成する代替手段はありますか? ご協力いただきありがとうございます。

score 0 · Accepted Answer

「i」ループを取り除くために適用を使用してOAPerを計算する

 OAPer<-apply(TotalCount,1,
                   function(x,tcs)x/tcs*100,
                   tcs = TotalCountSum)

同様に、「u」ループ内の作業をベクトル化することもできます。コード内のコメントをいただければ幸いです。

score 0 · Accepted Answer

これがコード全体です。for ループは必要ありません。そしてそうですapply。分割は、行列を完全に分割することで実装できます。

df <- Input_Data

total_count <- df[, 1]
total_sum   <- sum(total_count)

df <- df[, -1]

# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)

# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row

r - カウントをパーセンテージとインデックス スコアに変換する効率

2 に答える 2

Related

Reference

r - カウントをパーセンテージとインデックススコアに変換する効率