tensorflow - TensorFlow 拡張 Kubeflow 複数のワーカー

Question

Kubeflow DAG Runner の TFX に問題があります。問題は、実行ごとに 1 つのポッドしか開始できなかったことです。役に立たない Apache Beam 引数を除いて、「workers」の構成が表示されません。

1 つの Pod で CSV ロードを実行すると、ファイルが 5GB を超えているため、OOMKilled エラーが発生します。ファイルを 100MB ごとに分割しようとしましたが、それも役に立ちませんでした。

だから私の質問は次のとおりです。複数の「ワーカー」ポッドで Kubeflow で TFX ジョブ/ステージを実行する方法、またはそれは可能ですか?

これが私が使用しているコードです：

examples = external_input(data_root)
example_gen = CsvExampleGen(input=examples)
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])

dsl_pipeline = pipeline.Pipeline(
  pipeline_name=pipeline_name,
  pipeline_root=pipeline_root,
  components=[
      example_gen, statistics_gen
  ],
  enable_cache=True,
  beam_pipeline_args=['--num_workers=%d' % 5]
)


if __name__ == '__main__':
    tfx_image = 'custom-aws-imgage:tfx-0.26.0'
    config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
        kubeflow_metadata_config=kubeflow_dag_runner.get_default_kubeflow_metadata_config(),
        tfx_image=tfx_image)
    kfp_runner = kubeflow_dag_runner.KubeflowDagRunner(config=config)
    # KubeflowDagRunner compiles the DSL pipeline object into KFP pipeline package.
    # By default it is named <pipeline_name>.tar.gz
    kfp_runner.run(dsl_pipeline)

環境：

Docker イメージ: boto3 がインストールされた tensorflow/tfx:0.26.0 (aws 関連の問題)
Kubernetes: AWS EKS 最新
キューブフロー: 1.0.4

tensorflow - TensorFlow 拡張 Kubeflow 複数のワーカー

1 に答える 1

Related

Reference