r - Data.Table Merge - Result is larger than input Datatables -


i have 2 data tables. df1, ref_df

internal structures follows:

classes ‘data.table’ , 'data.frame':  10153986 obs. of  18 variables:  $ chr_no        : chr  "1" "1" "1" "1" ...  $ pos           : int  238 324 340 353 355 357 380 420 435 571 ...  $ ref           : chr  "c" "a" "g" "t" ...  $ id            : logi  na na na na na na ...  $ alt           : chr  na na na na ...  $ af            : num  na na na na 0.807 na na 0.877 na 0.868 ...  $ cases_hom     : int  na na na na 50 na na 58 na 59 ...  $ cases_het     : int  na na na na 15 na na 7 na 6 ...  $ cases_count   : int  na na na na 115 na na 123 na 124 ...  $ controls_hom  : int  na na na na 48 na na 55 na 56 ...  $ controls_het  : int  na na na na 13 na na 6 na 5 ...  $ controls_count: int  na na na na 109 na na 116 na 117 ...  $ cc_trend      : num  na na na na 0.812 ...  $ cc_geno       : num  na na na na na na na na na na ...  $ cc_all        : num  na na na na 0.492 ...  $ cc_dom        : num  na na na na 0.491 ...  $ cc_rec        : num  na na na na 1 na na 1 na 1 ...  $ cmh_p_val     : num  0.9267 0.0672 0.0279 0.3939 0.2522 ...  - attr(*, ".internal.selfref")=<externalptr>    classes ‘data.table’ , 'data.frame':  9915916 obs. of  5 variables:  $ chr_no  : chr  "10" "10" "10" "10" ...  $ pos     : int  86 126 148 208 232 396 413 413 454 1173 ...  $ snp_name: chr  "rs459413697" "rs446265986" "rs460495236" "rs437891922" ...  $ ref     : chr  "g" "g" "t" "g" ...  $ alt     : chr  "c,t" "a,t" "c,g" "t" ...  - attr(*, ".internal.selfref")=<externalptr 

i perform left outer join all.x = true :

merge(x = df1, y = ref_df, all.x = t,                      = c("chr_no" , "pos" , "ref"), suffixes=c(".study", ".ref"))  

the resulting data table

> dim(result_data) [1] 10154765       20 > sum(duplicated(df1)) [1] 0 > sum(duplicated(ref_df)) [1] 0 

so not sure happening. have looked https://github.com/rdatatable/data.table/issues/508 using latest data.table 1.9.5.

try

>>sum(duplicated(df1[, c("chr_no", "pos", "ref"), with=false])) 

this give total number of duplicates based on joining keys.

>>table(duplicated(df1[, c("chr_no", "pos", "ref"), with=false])) 

this give total number of duplicates , non-duplicates records based on joining keys.

similarly other dataframe,

>>table(duplicated(ref_df[, c("chr_no", "pos", "ref"), with=false])) 

Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -