google cloud dataflow - Partition data coming from CSV so I can process larger patches rather then individual lines -
i getting started google data flow, have written simple flow reads csv file cloud storage. 1 of steps involves calling web service enrich results. web service in question performs better when sending several 100 requests in bulk.
in looking @ api don't see great way aggregate 100 elements of pcollection single par.do execution. results need split handle last step of flow writing bigquery table.
not sure if need use windowing want. of windowing examples see more geared towards counting on given time period.
you can buffer elements in local member variable of dofn, , call web service when buffer large enough, in finishbundle. example:
class callservicefn extends dofn<string, string> { private list<string> elements = new arraylist<>(); public void processelement(processcontext c) { elements.add(c.element()); if (elements.size() >= max_call_size) { (string result : callservicewithdata(elements)) { c.output(result); } elements.clear(); } } public void finishbundle(context c) { (string result : callservicewithdata(elements)) { c.output(result); } } }
Comments
Post a Comment