amazon web services - What controls the number of partitions when reading Parquet files? -
my setup:
two spark clusters. 1 on ec2 , 1 on amazon emr. both spark 1.3.1.
the emr cluster installed emr-bootstrap-actions. ec2 cluster installed spark's default ec2 scripts.
the code:
read folder containing 12 parquet files , count number of partitions
val logs = sqlcontext.parquetfile(“s3n://mylogs/”) logs.rdd.partitions.length observations:
- on ec2 code gives me 12 partitions (one per file, makes sense).
- on emr code gives me 138 (!) partitions.
question:
what controls number of partitions when reading parquet files?
i read exact same folder on s3, exact same spark release. leads me believe there might configuration settings control how partitioning happens. have more info on this?
insights appreciated.
thanks.
update:
it seems many partitions created emr's s3 file system implementation (com.amazon.ws.emr.hadoop.fs.emrfilesystem).
when removing
<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.emrfilesystem</value></property> from core-site.xml (hereby reverting hadoop's s3 filesystem), end 12 partitions.
when running emrfilesystem, seems number of partitions can controlled with:
<property><name>fs.s3n.block.size</name><value>xxx</value></property> could there cleaner way of controlling # of partitions when using emrfilesystem?
Comments
Post a Comment