tensorflow - Distributed model training without worker servers

翻译自：https://stackoverflow.com/questions/67236588 2021-04-23T20:43:02.993

14 次

I'm interested in distributing the training of my DNN model. However, I'm interested in having the communication occur via AWS S3 instead of over the local network. Why? I have a great batch/async compute cluster setup based on Hashicorp Nomad. I'd love it if I could distribute the model training by simply creating new batch jobs (e.g. a job for each subsample/mini batch) and add that to the Nomad job queue and allow the cluster to auto-scale to take on the work and send it back to main parameter server. So I guess I'm trying to avoid needing to know all of the machines upfront, their network identity etc. More of a serverless approach.

I'm already using the batch compute jobs to do necessary preprocessing and some limited feature extraction, but can distributed training be framed as jobs in a job queue with a fluctuating number of workers?

Is this even a thing? Or is it a bad idea because of the overhead of exchanging data via something like S3? I'm currently focused on TensorFlow, but we're early enough in the project switching frameworks isn't off the table.

tensorflow - Distributed model training without worker servers

0 に答える 0

Related

Reference