performance - Fastest way to filter a data.frame list column contents in R / Rcpp -
i have data.frame:
df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b", "c"))), .names = c("id", "vars"), row.names = c(na, -3l), class = "data.frame")
with list column (each character vector):
> str(df) 'data.frame': 3 obs. of 2 variables: $ id : int 1 2 3 $ vars:list of 3 ..$ : chr "a" ..$ : chr "a" "b" "c" ..$ : chr "b" "c"
i want filter data.frame according setdiff(vars,remove_this)
library(dplyr) library(tidyr) res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))
which gets me this:
> res id vars 1 1 2 2 b, c 3 3 b, c
but drop character(0)
vars have like:
res %>% unnest(vars) # , equivalent of nest(vars) again after...
actual datasets:
- 560k rows , 3800k rows have 10 more columns (to carry along).
(this quite slow, leads question...)
what fastest way in r
?
- is there
dplyr
/data.table
/ other faster method? - how
rcpp
?
update/extension:
can column modification done in place rather copying
lapply(vars,setdiff(...
result?what's efficient way filter out
vars == character(0)
if must seperate step.
setting aside algorithmic improvements, analogous data.table
solution automatically going faster because won't have copy entire thing add column:
library(data.table) dt = as.data.table(df) # or use setdt convert in place dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0] # id vars newcol #1: 2 a,b,c b,c #2: 3 b,c b,c
you can delete original column (with 0 cost), adding [, vars := null]
@ end). or can overwrite initial column if don't need info, i.e. dt[, vars := lapply(vars, setdiff, 'a')]
.
now far algorithmic improvements go, assuming id
values unique each vars
(and if not, add new unique identifier), think faster , automatically takes care of filtering:
dt[, unlist(vars), = id][!v1 %in% 'a', .(vars = list(v1)), = id] # id vars #1: 2 b,c #2: 3 b,c
to carry along other columns, think it's easiest merge back:
dt[, othercol := 5:7] # notice keyby dt[, unlist(vars), = id][!v1 %in% 'a', .(vars = list(v1)), keyby = id][dt, nomatch = 0] # id vars i.vars othercol #1: 2 b,c a,b,c 6 #2: 3 b,c b,c 7
Comments
Post a Comment