hadoop - Migrate data to new data format for data already in HDFS -
the process , tools bring in csv data external source hdfs , store in particular format well-known; however, how convert data formats data existing in hdfs ?
i working existing data set (~ multi tb) on hdfs in json format/uncompressed. how convert data on cluster say, parquet, on same cluster, while minimizing cluster resources?
options:
- temporarily cluster of same size, , move data on while converting, move data?
- supplement additional nodes on existing cluster temporarily ? how ensure used migration ?
- ??
thanks,
matt
you write java code convert existing csv file parquet using parquetoutputformat
class. here parquet implementation.
code this:
public static void main(string[] args) throws ioexception, interruptedexception, classnotfoundexception { configuration conf = new configuration(); job job = new job(conf); job.setjobname("csv parquet"); job.setjarbyclass(mapper.class); job.setmapperclass(mapper.class); job.setreducerclass(reducer.class); job.setnumreducetasks(1); job.setoutputkeyclass(longwritable.class); job.setoutputvalueclass(text.class); job.setoutputformatclass(parquetoutputformat.class); job.setinputformatclass(textinputformat.class); textinputformat.addinputpath(job, new path("/csv")); parquetoutputformat.setoutputpath(job, new path("/parquet")); job.waitforcompletion(true); }
/csv
hdfs path csv file , /parquet
hdfs path new parquet file.
Comments
Post a Comment