performance - Fastest way to filter a data.frame list column contents in R / Rcpp -


i have data.frame:

df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b",  "c"))), .names = c("id", "vars"), row.names = c(na, -3l), class = "data.frame") 

with list column (each character vector):

> str(df) 'data.frame':   3 obs. of  2 variables:      $ id  : int  1 2 3      $ vars:list of 3       ..$ : chr "a"       ..$ : chr  "a" "b" "c"       ..$ : chr  "b" "c" 

i want filter data.frame according setdiff(vars,remove_this)

library(dplyr) library(tidyr) res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a")) 

which gets me this:

   > res       id vars     1  1          2  2 b, c     3  3 b, c 

but drop character(0) vars have like:

res %>% unnest(vars) # , equivalent of nest(vars) again after... 

actual datasets:

  • 560k rows , 3800k rows have 10 more columns (to carry along).

(this quite slow, leads question...)

what fastest way in r?

  • is there dplyr/ data.table/ other faster method?
  • how rcpp?

update/extension:

  • can column modification done in place rather copying lapply(vars,setdiff(... result?

  • what's efficient way filter out vars == character(0) if must seperate step.

setting aside algorithmic improvements, analogous data.table solution automatically going faster because won't have copy entire thing add column:

library(data.table) dt = as.data.table(df)  # or use setdt convert in place  dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0] #   id  vars newcol #1:  2 a,b,c    b,c #2:  3   b,c    b,c 

you can delete original column (with 0 cost), adding [, vars := null] @ end). or can overwrite initial column if don't need info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].


now far algorithmic improvements go, assuming id values unique each vars (and if not, add new unique identifier), think faster , automatically takes care of filtering:

dt[, unlist(vars), = id][!v1 %in% 'a', .(vars = list(v1)), = id] #   id vars #1:  2  b,c #2:  3  b,c 

to carry along other columns, think it's easiest merge back:

dt[, othercol := 5:7]  # notice keyby dt[, unlist(vars), = id][!v1 %in% 'a', .(vars = list(v1)), keyby = id][dt, nomatch = 0] #   id vars i.vars othercol #1:  2  b,c  a,b,c        6 #2:  3  b,c    b,c        7 

Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -