hadoop - Migrate data to new data format for data already in HDFS -


the process , tools bring in csv data external source hdfs , store in particular format well-known; however, how convert data formats data existing in hdfs ?

i working existing data set (~ multi tb) on hdfs in json format/uncompressed. how convert data on cluster say, parquet, on same cluster, while minimizing cluster resources?

options:

  • temporarily cluster of same size, , move data on while converting, move data?
  • supplement additional nodes on existing cluster temporarily ? how ensure used migration ?
  • ??

thanks,

matt

you write java code convert existing csv file parquet using parquetoutputformat class. here parquet implementation.

code this:

    public static void main(string[] args) throws ioexception,         interruptedexception, classnotfoundexception {      configuration conf = new configuration();     job job = new job(conf);     job.setjobname("csv parquet");     job.setjarbyclass(mapper.class);      job.setmapperclass(mapper.class);     job.setreducerclass(reducer.class);      job.setnumreducetasks(1);      job.setoutputkeyclass(longwritable.class);     job.setoutputvalueclass(text.class);      job.setoutputformatclass(parquetoutputformat.class);     job.setinputformatclass(textinputformat.class);      textinputformat.addinputpath(job, new path("/csv"));     parquetoutputformat.setoutputpath(job, new path("/parquet"));      job.waitforcompletion(true);    } 

/csv hdfs path csv file , /parquet hdfs path new parquet file.

source


Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -