apache pig - What's the effective way to count rows in Pig? -
in pig, effective way count? can group all, given 1 reducer. when data size large,say n terabytes, can try multiple reducers somehow?
datacount = foreach (group data all) generate 'count' metric, count(datacount) value;
instead of using directly group all
, divide 2 steps. first, group field , count number of rows. , then, perform group all
sum of these counts. way, able count number of rows in parallel.
note, however, if field use in first group by
not have duplicates, resulting counts of 1 there wont difference. try using field has many duplicates improve performance.
see example:
a;1 a;2 b;3 b;4 b;5
if first group first field, has duplicates, final count
deal 2 rows instead of 5:
a = load 'data' using pigstorage(';'); b = group $0; c = foreach b generate count(a); dump c; (2) (3) d = group c all; e = foreach d generate sum(c.$0); dump e; (5)
however, if group second one, unique, deal 5 rows:
a = load 'data' using pigstorage(';'); b = group $1; c = foreach b generate count(a); dump c; (1) (1) (1) (1) (1) d = group c all; e = foreach d generate sum(c.$0); dump e; (5)
Comments
Post a Comment