google cloud dataflow - PipeLine with multiple Transformations -
i trying understand lifecycle of transforms within pipeline.
i have pipleline several transforms.
pipeline p = pipeline.create(options); p.apply(textio.read.named("readlines").from(inputfile)) .apply(new readdata()) .apply(new match()) .apply(new record()) .apply(bigqueryio.write .to(tableref) .withschema(getschema()) .withcreatedisposition(bigqueryio.write.createdisposition.create_if_needed) .withwritedisposition(bigqueryio.write.writedisposition.write_truncate));
inside each of these transformations single dofn. entire batch node processing need complete before moving next transformation?
what observing @ least directpipelinerunner entire dataset read before match transformation run.
with directpipelinerunner, transforms executed entirely serially observed. when running dataflowpipelinerunner without --streaming set, many transforms can fused , run simultaneously. --streaming, data continually stream through entire pipeline, , transforms active.
Comments
Post a Comment