Is there an "Explain RDD" in spark -
in particular, if
rdd3 = rdd1.join(rdd2) then when call rdd3.collect, depending on partitioner used, either data moved between nodes partitions, or join done locally on each partition (or, know, else entirely). depends on rdd paper calls "narrow" , "wide" dependencies, knows how optimizer in practice.
anyways, can kind of glean trace output thing happened, nice call rdd3.explain.
does such thing exist?
i think todebugstring appease curiosity.
scala> val data = sc.parallelize(list((1,2))) data: org.apache.spark.rdd.rdd[(int, int)] = parallelcollectionrdd[8] @ parallelize @ <console>:21 scala> val joineddata = data join data joineddata: org.apache.spark.rdd.rdd[(int, (int, int))] = mappartitionsrdd[11] @ join @ <console>:23 scala> joineddata.todebugstring res4: string = (8) mappartitionsrdd[11] @ join @ <console>:23 [] | mappartitionsrdd[10] @ join @ <console>:23 [] | cogroupedrdd[9] @ join @ <console>:23 [] +-(8) parallelcollectionrdd[8] @ parallelize @ <console>:21 [] +-(8) parallelcollectionrdd[8] @ parallelize @ <console>:21 [] each indentation stage, should run 2 stages.
also, optimizer decent, suggest using dataframes if using 1.3+ optimizer there better in many cases:)
Comments
Post a Comment