performance - Fastest way to filter a data.frame list column contents in R / Rcpp -

i have data.frame:

df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b",  "c"))), .names = c("id", "vars"), row.names = c(na, -3l), class = "data.frame")

with list column (each character vector):

> str(df) 'data.frame':   3 obs. of  2 variables:      $ id  : int  1 2 3      $ vars:list of 3       ..$ : chr "a"       ..$ : chr  "a" "b" "c"       ..$ : chr  "b" "c"

i want filter data.frame according setdiff(vars,remove_this)

library(dplyr) library(tidyr) res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))

which gets me this:

   > res       id vars     1  1          2  2 b, c     3  3 b, c

but drop character(0) vars have like:

res %>% unnest(vars) # , equivalent of nest(vars) again after...

actual datasets:

560k rows , 3800k rows have 10 more columns (to carry along).

(this quite slow, leads question...)

what fastest way in `r`?

is there dplyr/ data.table/ other faster method?
how rcpp?

update/extension:

can column modification done in place rather copying lapply(vars,setdiff(... result?
what's efficient way filter out vars == character(0) if must seperate step.

setting aside algorithmic improvements, analogous data.table solution automatically going faster because won't have copy entire thing add column:

library(data.table) dt = as.data.table(df)  # or use setdt convert in place  dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0] #   id  vars newcol #1:  2 a,b,c    b,c #2:  3   b,c    b,c

you can delete original column (with 0 cost), adding [, vars := null] @ end). or can overwrite initial column if don't need info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].

now far algorithmic improvements go, assuming id values unique each vars (and if not, add new unique identifier), think faster , automatically takes care of filtering:

dt[, unlist(vars), = id][!v1 %in% 'a', .(vars = list(v1)), = id] #   id vars #1:  2  b,c #2:  3  b,c

to carry along other columns, think it's easiest merge back:

dt[, othercol := 5:7]  # notice keyby dt[, unlist(vars), = id][!v1 %in% 'a', .(vars = list(v1)), keyby = id][dt, nomatch = 0] #   id vars i.vars othercol #1:  2  b,c  a,b,c        6 #2:  3  b,c    b,c        7

Search This Blog

Braziel