amazon web services - What controls the number of partitions when reading Parquet files? -

my setup:

two spark clusters. 1 on ec2 , 1 on amazon emr. both spark 1.3.1.

the emr cluster installed emr-bootstrap-actions. ec2 cluster installed spark's default ec2 scripts.

the code:

read folder containing 12 parquet files , count number of partitions

val logs = sqlcontext.parquetfile(“s3n://mylogs/”) logs.rdd.partitions.length

observations:

on ec2 code gives me 12 partitions (one per file, makes sense).
on emr code gives me 138 (!) partitions.

question:

what controls number of partitions when reading parquet files?

i read exact same folder on s3, exact same spark release. leads me believe there might configuration settings control how partitioning happens. have more info on this?

insights appreciated.

thanks.

update:

it seems many partitions created emr's s3 file system implementation (com.amazon.ws.emr.hadoop.fs.emrfilesystem).

when removing

<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.emrfilesystem</value></property>

from core-site.xml (hereby reverting hadoop's s3 filesystem), end 12 partitions.

when running emrfilesystem, seems number of partitions can controlled with:

<property><name>fs.s3n.block.size</name><value>xxx</value></property>

could there cleaner way of controlling # of partitions when using emrfilesystem?

Search This Blog

Braziel

amazon web services - What controls the number of partitions when reading Parquet files? -

Comments

Post a Comment

Popular posts from this blog

python 3 IndexError: list index out of range -

android - How to save instance state of selected radiobutton on menu -

IF statement in MySQL trigger -