python - R で数千のファイルを読み込んで処理する

Question

私はRに少し慣れていないので、ここで初心者の色合いを許してください...

各ファイルのデータに対して関数を実行し、結果の値をベクトルに格納するスクリプトで、保存された数千のデータフレーム (ファイル) をロードするコードを R で書いています。さまざまな機能でこれを何度も行う必要があり、現時点では非常に時間がかかっています。

マルチコア mclapply を使用してプロセスを並列化しようとしていますが、残念ながら、2 コアから 8 コアまでは、単に 1 コアで実行するよりもさらに時間がかかるようです。

ディスク I/O の制限により、この考えは根本的に不健全ですか? マルチコア、あるいは R でさえ、適切なソリューションではないのでしょうか? Python などでファイルを開き、内容に対して R 関数を実行するのは、R よりも優れたパスでしょうか?

これに関するガイダンスや考えをいただければ幸いです -

わかりやすくするために追加されたコード:

    library(multicore)

    project.path = "/pathtodata/"

    #This function reads the file location and name, then loads it and runs a simple statistic
    running_station_stats <- function(nsrdb_stations)
    {
      varname <- "column_name"
      load(file = paste(project.path, "data/",data_set_list[1], sep = ""))
      tempobj <- as.data.frame(coredata(get(data_set_list[2])))
      mean(tempobj[[varname]],na.rm=TRUE)
    }



    options(cores = 2)

    #This file has a list of R data files data_set_list[1] and the names they were created with data_set_list[2]
    load(file = paste(project.path, "data/data_set_list.RData", sep = ""))

    thelist <- list()

    thelist[[1]] <- data_set_list[1:50,]

    thelist[[2]] <- data_set_list[51:100,]

    thelist[[3]] <- data_set_list[101:150,]

    thelist[[4]] <- data_set_list[151:200,]


    #All three of these are about the same speed to run regardless of the num of cores
    system.time(
    {
      apply(nsrdb_stations[which(nsrdb_stations$org_stations==TRUE),][1:200,],1,running_station_stats)
    })

    system.time(
      lapply(thelist, apply, 1, running_station_stats)
     )

    system.time(
      mclapply(thelist, apply, 1, running_station_stats)
    )

score 1 · Accepted Answer

Python と R はどちらも、計算処理などに複数のコアを使用しようとします。多数のファイルの読み取りには役立ちません。複数のスレッドも答えではありません(re python GIL )。

いくつかの可能な解決策 (単純なものではありません) は次のとおりです。

ファイル io async の (一部の) 可能な場所で twisted のようなものを使用します。プログラミングが難しく、numpy フレンドリーではありません。
Celery またはその他の自作のマスタースレーブソリューションを使用します。独自のアクションをたくさん転がしてください。
Ipython (w/ ipcluster ) を使用して、python が再結合する複数のプロセスを生成します (BEST ソリューション IMO)

score 0 · Accepted Answer

まず、古き良きマルチプロセッシングを Python で試してみます。上記のオプションもすべて可能です。multiprocessing モジュールを使用してバッチジョブを実行する例を次に示します。


import multiprocessing as mp
import time

def worker(x):
    time.sleep(0.2)
    print "x= %s, x squared = %s" % (x, x*x)
    return x*x

def apply_async():
    pool = mp.Pool()
    for i in range(100):
        pool.apply_async(worker, args = (i, ))
    pool.close()
    pool.join()

if name == 'main':
    apply_async()

出力は次のようになります。


x= 0, x squared = 0
x= 1, x squared = 1
x= 2, x squared = 4
x= 3, x squared = 9
x= 4, x squared = 16
x= 6, x squared = 36
x= 5, x squared = 25
x= 7, x squared = 49
x= 8, x squared = 64
x= 10, x squared = 100
x= 11, x squared = 121
x= 9, x squared = 81
x= 12, x squared = 144

ご覧のとおり、数値は非同期で実行されているため、順序が正しくありません。上記の worker() 関数を変更して処理を行い、mp.Pool(10) または mp.Pool(15) などで同時プロセスの数を変更するだけです。そのようなものは、比較的まっすぐ進むべきです。. .

python - R で数千のファイルを読み込んで処理する

2 に答える 2

Related

Reference