apache pig - What's the effective way to count rows in Pig? -


in pig, effective way count? can group all, given 1 reducer. when data size large,say n terabytes, can try multiple reducers somehow?

  datacount = foreach (group data all) generate      'count' metric,     count(datacount) value; 

instead of using directly group all, divide 2 steps. first, group field , count number of rows. , then, perform group all sum of these counts. way, able count number of rows in parallel.

note, however, if field use in first group by not have duplicates, resulting counts of 1 there wont difference. try using field has many duplicates improve performance.

see example:

a;1 a;2 b;3 b;4 b;5 

if first group first field, has duplicates, final count deal 2 rows instead of 5:

a = load 'data' using pigstorage(';'); b = group $0; c = foreach b generate count(a); dump c; (2) (3) d = group c all; e = foreach d generate sum(c.$0); dump e; (5) 

however, if group second one, unique, deal 5 rows:

a = load 'data' using pigstorage(';'); b = group $1; c = foreach b generate count(a); dump c; (1) (1) (1) (1) (1) d = group c all; e = foreach d generate sum(c.$0); dump e; (5) 

Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -