apache pig - What's the effective way to count rows in Pig? -

in pig, effective way count? can group all, given 1 reducer. when data size large,say n terabytes, can try multiple reducers somehow?

  datacount = foreach (group data all) generate      'count' metric,     count(datacount) value;

instead of using directly group all, divide 2 steps. first, group field , count number of rows. , then, perform group all sum of these counts. way, able count number of rows in parallel.

note, however, if field use in first group by not have duplicates, resulting counts of 1 there wont difference. try using field has many duplicates improve performance.

see example:

a;1 a;2 b;3 b;4 b;5

if first group first field, has duplicates, final count deal 2 rows instead of 5:

a = load 'data' using pigstorage(';'); b = group $0; c = foreach b generate count(a); dump c; (2) (3) d = group c all; e = foreach d generate sum(c.$0); dump e; (5)

however, if group second one, unique, deal 5 rows:

a = load 'data' using pigstorage(';'); b = group $1; c = foreach b generate count(a); dump c; (1) (1) (1) (1) (1) d = group c all; e = foreach d generate sum(c.$0); dump e; (5)

Search This Blog

Braziel

apache pig - What's the effective way to count rows in Pig? -

Comments

Post a Comment

Popular posts from this blog

javascript - Add class to another page attribute using URL id - Jquery -

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -