python - クラスター上で python と PBS を使用した「恥ずかしい並列」プログラミング

Question

図形を生成する関数 (ニューラルネットワークモデル) があります。Torque を備えた標準クラスターで PBS を使用して、python からいくつかのパラメーター、メソッド、およびさまざまな入力 (関数の何百回もの実行を意味する) をテストしたいと考えています。

注: 私は parallelpython 、 ipython などを試しましたが、完全に満足することはありませんでした。クラスターは、私が変更できない特定の構成にあり、python + qsub を統合するこのようなソリューションは、コミュニティに確実に利益をもたらします。

物事を単純化するために、次のような単純な関数があります。

import myModule
def model(input, a= 1., N=100):
    do_lots_number_crunching(input, a,N)
    pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')

はinput入力を表すオブジェクトinput.nameで、文字列であり、do_lots_number_crunching数時間続く場合があります。

私の質問は次のようなパラメータのスキャンのようなものを変換する正しい方法はありますか?

for a in pylab.linspace(0., 1., 100):
    model(input, a)

関数へのすべての呼び出しに対してPBSスクリプトを起動する「何か」にmodel？

#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py

PBS テンプレートをインクルードして Python スクリプトから呼び出す関数を考えていましたが、まだわかりませんでした (decorator?)。

score 4 · Accepted Answer

pbs_python[1] はこれで機能します。引数としてexperiment_model.py 'a'の場合、実行できます

import pbs, os

server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)

attopl = pbs.new_attropl(4)
attropl[0].name  = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'

attropl[1].name  = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'

attropl[2].name  = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'

attrop1[3].name = pbs.ATTR_V

script='''
cd /data/work/
python experiment_model.py %f
'''

jobs = []

for a in pylab.linspace(0.,1.,100):
    script_name = 'experiment_model.job' + str(a)
    with open(script_name,'w') as scriptf:
        scriptf.write(script % a)
    job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
    jobs.append(job_id)
    os.remove(script_name)

 print jobs

[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

score 3 · Accepted Answer

You can do this easily using jug (which I developed for a similar setup).

You'd write in file (e.g., model.py):

@TaskGenerator
def model(param1, param2):
     res = complex_computation(param1, param2)
     pyplot.coolgraph(res)


for param1 in np.linspace(0, 1.,100):
    for param2 in xrange(2000):
        model(param1, param2)

And that's it!

Now you can launch "jug jobs" on your queue: jug execute model.py and this will parallelise automatically. What happens is that each job will in, a loop, do something like:

while not all_done():
    for t in tasks in tasks_that_i_can_run():
        if t.lock_for_me(): t.run()

(It's actually more complicated than that, but you get the point).

It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.

This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.

score 2 · Accepted Answer

私はパーティーに少し遅れているようですが、数年前にPythonで恥ずかしいほど並列の問題をクラスターにマップする方法について同じ質問があり、独自のソリューションを作成しました. 最近、こちらの github にアップロードしました: https://github.com/plediii/pbs_util

pbs_util を使用してプログラムを作成するには、最初に pbs_util.ini を作業ディレクトリに作成します。

[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00

次に、このようなpythonスクリプト

import pbs_util.pbs_map as ppm

import pylab
import myModule

class ModelWorker(ppm.Worker):

    def __init__(self, input, N):
        self.input = input
        self.N = N

    def __call__(self, a):
        myModule.do_lots_number_crunching(self.input, a, self.N)
        pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')



# You need  "main" protection like this since pbs_map will import this file on the     compute nodes
if __name__ == "__main__":
    input, N = something, picklable
    # Use list to force the iterator
    list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
                     startup_args=(input, N),
                     num_clients=100))

そしてそれはそれをするでしょう。

score 0 · Accepted Answer

クラスターと EP アプリケーションを使い始めたばかりです。私の目標 (私は図書館にいます) は、学内の他の研究者が EP アプリケーションを使用して HPC にアクセスできるように十分に学習することです。特に、STEM 以外の研究者です。私はまだ非常に新しいですが、PBSスクリプトでGNU Parallelを使用してさまざまな引数で基本的なpythonスクリプトを起動することを指摘することは、この質問に役立つかもしれないと考えました. .pbs ファイルには、次の 2 つの行があります。

module load gnu-parallel # this is required on my environment

parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*

# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}`   this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.' 
#     will be substituted into the python call as the second(3rd) argument where the
#     `{}` resides.  These can be simple text files that you use in your 'simple.py'
#     script to pass the parameter sets, filenames, etc.

EP スーパーコンピューティングの初心者として、「並列」の他のすべてのオプションをまだ理解していませんが、このコマンドを使用すると、さまざまなパラメーターを使用して並列で Python スクリプトを起動できました。これは、問題を並列化する多数のパラメーターファイルを事前に生成できる場合にうまく機能します。たとえば、パラメーター空間全体でシミュレーションを実行します。または、同じコードで多くのファイルを処理します。

python - クラスター上で python と PBS を使用した「恥ずかしい並列」プログラミング

4 に答える 4

Related

Reference