hadoop - Migrate data to new data format for data already in HDFS -

the process , tools bring in csv data external source hdfs , store in particular format well-known; however, how convert data formats data existing in hdfs ?

i working existing data set (~ multi tb) on hdfs in json format/uncompressed. how convert data on cluster say, parquet, on same cluster, while minimizing cluster resources?

options:

temporarily cluster of same size, , move data on while converting, move data?
supplement additional nodes on existing cluster temporarily ? how ensure used migration ?
??

thanks,

matt

you write java code convert existing csv file parquet using parquetoutputformat class. here parquet implementation.

code this:

    public static void main(string[] args) throws ioexception,         interruptedexception, classnotfoundexception {      configuration conf = new configuration();     job job = new job(conf);     job.setjobname("csv parquet");     job.setjarbyclass(mapper.class);      job.setmapperclass(mapper.class);     job.setreducerclass(reducer.class);      job.setnumreducetasks(1);      job.setoutputkeyclass(longwritable.class);     job.setoutputvalueclass(text.class);      job.setoutputformatclass(parquetoutputformat.class);     job.setinputformatclass(textinputformat.class);      textinputformat.addinputpath(job, new path("/csv"));     parquetoutputformat.setoutputpath(job, new path("/parquet"));      job.waitforcompletion(true);    }

/csv hdfs path csv file , /parquet hdfs path new parquet file.

source

Search This Blog

Braziel

hadoop - Migrate data to new data format for data already in HDFS -

Comments

Post a Comment

Popular posts from this blog

javascript - Add class to another page attribute using URL id - Jquery -

android - MPAndroidChart - How to add Annotations or images to the chart -

IF statement in MySQL trigger -