Writing to BigQuery from Dataflow - JSON files are not deleted when a job finishes -


one of our dataflow jobs writes output bigquery. understanding of how implemented under-the-hood, dataflow writes results (sharded) in json format gcs, , kicks off bigquery load job import data.

however, we've noticed json files not deleted after job regardless of whether succeeds or fails. there no warning or suggestion in error message files not deleted. when noticed this, had @ our bucket , had hundreds of large json files failed jobs (mostly during development).

i have thought dataflow should handle cleanup, if job fails, , when succeeds files should deleted leaving these files around after job has finished incurs significant storage costs!

is bug?

example job id of job "succeeded" left hundreds of large files in gcs: 2015-05-27_18_21_21-8377993823053896089

enter image description here

enter image description here

enter image description here

because still happening decided we'll clean ourselves after pipeline has finished executing. run following command delete not jar or zip:

gsutil ls -p <project_id> gs://<bucket> | grep -v '[zip|jar]$' | xargs -n 1 gsutil -m rm -r 

Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -