amazon web services - What controls the number of partitions when reading Parquet files? -


my setup:

two spark clusters. 1 on ec2 , 1 on amazon emr. both spark 1.3.1.

the emr cluster installed emr-bootstrap-actions. ec2 cluster installed spark's default ec2 scripts.

the code:

read folder containing 12 parquet files , count number of partitions

val logs = sqlcontext.parquetfile(“s3n://mylogs/”) logs.rdd.partitions.length 

observations:

  • on ec2 code gives me 12 partitions (one per file, makes sense).
  • on emr code gives me 138 (!) partitions.

question:

what controls number of partitions when reading parquet files?

i read exact same folder on s3, exact same spark release. leads me believe there might configuration settings control how partitioning happens. have more info on this?

insights appreciated.

thanks.

update:

it seems many partitions created emr's s3 file system implementation (com.amazon.ws.emr.hadoop.fs.emrfilesystem).

when removing

<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.emrfilesystem</value></property> 

from core-site.xml (hereby reverting hadoop's s3 filesystem), end 12 partitions.

when running emrfilesystem, seems number of partitions can controlled with:

<property><name>fs.s3n.block.size</name><value>xxx</value></property> 

could there cleaner way of controlling # of partitions when using emrfilesystem?


Comments

Popular posts from this blog

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -

javascript - Blogger related post gadget image Resize s72-c [ Need Expert Help ] -