r - Data.Table Merge - Result is larger than input Datatables -
i have 2 data tables. df1, ref_df
internal structures follows:
classes ‘data.table’ , 'data.frame': 10153986 obs. of 18 variables: $ chr_no : chr "1" "1" "1" "1" ... $ pos : int 238 324 340 353 355 357 380 420 435 571 ... $ ref : chr "c" "a" "g" "t" ... $ id : logi na na na na na na ... $ alt : chr na na na na ... $ af : num na na na na 0.807 na na 0.877 na 0.868 ... $ cases_hom : int na na na na 50 na na 58 na 59 ... $ cases_het : int na na na na 15 na na 7 na 6 ... $ cases_count : int na na na na 115 na na 123 na 124 ... $ controls_hom : int na na na na 48 na na 55 na 56 ... $ controls_het : int na na na na 13 na na 6 na 5 ... $ controls_count: int na na na na 109 na na 116 na 117 ... $ cc_trend : num na na na na 0.812 ... $ cc_geno : num na na na na na na na na na na ... $ cc_all : num na na na na 0.492 ... $ cc_dom : num na na na na 0.491 ... $ cc_rec : num na na na na 1 na na 1 na 1 ... $ cmh_p_val : num 0.9267 0.0672 0.0279 0.3939 0.2522 ... - attr(*, ".internal.selfref")=<externalptr> classes ‘data.table’ , 'data.frame': 9915916 obs. of 5 variables: $ chr_no : chr "10" "10" "10" "10" ... $ pos : int 86 126 148 208 232 396 413 413 454 1173 ... $ snp_name: chr "rs459413697" "rs446265986" "rs460495236" "rs437891922" ... $ ref : chr "g" "g" "t" "g" ... $ alt : chr "c,t" "a,t" "c,g" "t" ... - attr(*, ".internal.selfref")=<externalptr
i perform left outer join all.x = true
:
merge(x = df1, y = ref_df, all.x = t, = c("chr_no" , "pos" , "ref"), suffixes=c(".study", ".ref"))
the resulting data table
> dim(result_data) [1] 10154765 20 > sum(duplicated(df1)) [1] 0 > sum(duplicated(ref_df)) [1] 0
so not sure happening. have looked https://github.com/rdatatable/data.table/issues/508 using latest data.table 1.9.5.
try
>>sum(duplicated(df1[, c("chr_no", "pos", "ref"), with=false]))
this give total number of duplicates based on joining keys.
>>table(duplicated(df1[, c("chr_no", "pos", "ref"), with=false]))
this give total number of duplicates , non-duplicates records based on joining keys.
similarly other dataframe,
>>table(duplicated(ref_df[, c("chr_no", "pos", "ref"), with=false]))
Comments
Post a Comment