r - Size of nested vs. unested (tidy) data.frame? -
this question uses data.frame contains list-columns (nested). had me wondering why/if there's advantage working way. assumed want minimize amount of memory each table uses...but when checked surprised:
compare table sizes nested vs. tidy format:
1. generate nested/tidy versions of 2-col , 5-col data.frame:
library(pryr) library(dplyr) library(tidyr) library(ggvis) n <- 1:1e6 df <- data_frame(id = n, vars = lapply(n, function(x) x <- sample(letters,sample(1:26,1)))) dfu <- df %>% unnest(vars) df_morecols <- data_frame(id = n, other1 = n, other2 = n, other3 = n, vars = lapply(n, function(x) x <- sample(letters,sample(1:26,1)))) dfu_morecols <- df_morecols %>% unnest(vars)
they like:
head(df) #> source: local data frame [6 x 2] #> id vars #> 1 1 <chr[16]> #> 2 2 <chr[4]> #> 3 3 <chr[26]> #> 4 4 <chr[9]> #> 5 5 <chr[11]> #> 6 6 <chr[18]> head(dfu) #> source: local data frame [6 x 2] #> id vars #> 1 1 k #> 2 1 d #> 3 1 s #> 4 1 j #> 5 1 m #> 6 1 t head(df_morecols) #> source: local data frame [6 x 5] #> id other1 other2 other3 vars #> 1 1 1 1 1 <chr[4]> #> 2 2 2 2 2 <chr[22]> #> 3 3 3 3 3 <chr[24]> #> 4 4 4 4 4 <chr[6]> #> 5 5 5 5 5 <chr[15]> #> 6 6 6 6 6 <chr[11]> head(dfu_morecols) #> source: local data frame [6 x 5] #> id other1 other2 other3 vars #> 1 1 1 1 1 r #> 2 1 1 1 1 p #> 3 1 1 1 1 s #> 4 1 1 1 1 w #> 5 2 2 2 2 l #> 6 2 2 2 2 j
2. calculate object sizes , col sizes
from: lapply(list(df,dfu,df_morecols,dfu_morecols),object_size)
170 mb vs. 162 mb nested vs. tidy 2-col df
170 mb vs. 324 mb nested vs. tidy 5-col df
col_sizes <- sapply(c(df,dfu,df_morecols,dfu_morecols),object_size) col_names <- names(col_sizes) parent_obj <- c(rep(c('df','dfu'),each = 2), rep(c('df_morecols','dfu_morecols'),each = 5)) res <- data_frame(parent_obj,col_names,col_sizes) %>% unite(elementof, parent_obj,col_names, remove = f)
3. plot columns sizes coloured parent object:
res %>% ggvis(y = ~elementof, x = ~0, x2 = ~col_sizes, fill = ~parent_obj) %>% layer_rects(height = band())
questions:
- what explains smaller footprint of tidy 2-col df compared nested one?
- why doesn't effect change 5-col df?
Comments
Post a Comment