Cassandra: low cardinality partition -


let's have table, this:

create table users (   user uuid,   seq int,   group text,   time bigint,   primary key ((user), seq) ); 

this follows desired pattern of cassandra, distribution across partitions (assuming default murmur3 hash partitioner).

however, need (rarely) perform range queries on , in time order. doesn't seem possible in cassandra. in reality need access data group, (group, time) acceptable. since there doesn't seem way have secondary index have multiple columns, guess right thing denormalize, this:

create table usersbygrouptime (   user uuid,   seq int,   group text,   time bigint,   primary key ((group), time) ) clustering order (time asc); 

this works entirely should, except group low cardinality, let's ('a','b','c'), , uneven distribution across users. since queries on table rare, i'm not worried hot nodes, worried uneven distribution, perhaps single node getting all.

is common scenario , there way mitigate or there alternative solutions?

one technique avoid hot-spots in cassandra time series models, in making use of "time bucket." determine "happy medium" level of time precision provides adequate data distribution, while being known , semi-convenient query by.

for purposes of example, i'll choose year , month ("yyyymm"). note: have no idea if year , month work you...it's example. once determine time bucket, add additional partition key, this:

create table usersbygrouptime (   user uuid,   seq int,   group text,   time timeuuid,   yearmonth bigint,   primary key ((group, yearmonth), time) ) clustering order (time desc); 

after inserting rows, queries work:

aploetz@cqlsh:stackoverflow2> select group, yearmonth, dateof(time), time, seq, user  usersbygrouptime group='b' , yearmonth=201505;   group | yearmonth | dateof(time)             | time                                 | seq | user -------+-----------+--------------------------+--------------------------------------+-----+--------------------------------------      b |    201505 | 2015-05-16 10:04:10-0500 | ceda56f0-fbdc-11e4-bd43-21b264d4c94d |   1 | d57ba8a4-db24-440c-a983-b1dd6b0d2e27      b |    201505 | 2015-05-16 10:04:09-0500 | ce1cac40-fbdc-11e4-bd43-21b264d4c94d |   1 | 66d07cbb-a2ff-4d56-8fa1-14dfaf684474      b |    201505 | 2015-05-16 10:04:08-0500 | cd525760-fbdc-11e4-bd43-21b264d4c94d |   1 | 07b589ac-4d5f-401e-a34f-e3479e269e01      b |    201505 | 2015-05-16 10:04:06-0500 | cc76c470-fbdc-11e4-bd43-21b264d4c94d |   1 | 984f85b5-ea58-4cf8-b512-43abacb227c9  (4 rows) 

now may or may not query-wise, need spend time ensuring pick appropriate time bucket. but, in terms of data distribution in ring, can see token function:

aploetz@cqlsh:stackoverflow2> select group, yearmonth, token(group,yearmonth) usersbygrouptime ;   group | yearmonth | token(group, yearmonth) -------+-----------+-------------------------      |    201503 |    -3784784210711042553      |    201504 |     -610775546464185720      b |    201505 |     6232834565276653514      b |    201505 |     6232834565276653514      b |    201505 |     6232834565276653514      b |    201505 |     6232834565276653514      |    201505 |     8281745497436252453      |    201505 |     8281745497436252453      |    201505 |     8281745497436252453      |    201505 |     8281745497436252453      |    201505 |     8281745497436252453      |    201505 |     8281745497436252453  (12 rows) 

notice how different tokens generated each group/yearmonth pair, though of them have same group ("a").


Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -