4

I'm using dataflow to process files stored in GCS and write to Bigquery tables. Below are my requirements:

  1. input files contain events records, each record pertains to one eventType;
  2. need to partition records by eventType;
  3. for each eventType output/write records to a corresponding Bigquery table, one table per eventType.
  4. events in each batch input files vary;

I'm thinking of applying transforms such as "groupByKey" and "partition", however seems that I have to know number of (and type of) events at the development time which is needed to determine the partitions.

Do you guys have a good idea to do the partitioning dramatically? meaning partitions can be determined at run time?

4

1 に答える 1