r - Colmeans in a dataframe by factor variable -
i'm trying mean of variables inside dataframe different factors. have:
time geo var1 var2 var3 var4 1 1990 @ 1 7 13 19 2 1991 @ 2 8 14 20 3 1992 @ 3 9 15 21 4 1990 de 4 10 16 22 5 1991 de 5 11 17 23 6 1992 de 6 12 18 24 and want:
time geo var1 var2 var3 var4 m_var2 m_var3 1 1990 @ 1 7 13 19 8 14 2 1991 @ 2 8 14 20 8 14 3 1992 @ 3 9 15 21 8 14 4 1990 de 4 10 16 22 11 17 5 1991 de 5 11 17 23 11 17 6 1992 de 6 12 18 24 11 17 i've tried few things by() , lapply() think goes direction of ddply
require(plyr) dataset <- data.frame(time=rep(c(1990:1992),2),geo=c(rep("at",3),rep("de",3)) ,var1=as.numeric(c(1:6)),var2=as.numeric(c(7:12)),var3=as.numeric(c(13:18)), var4=as.numeric(c(19:24))) newvars <- c("var2","var3") newdata <- dataset[,c("geo",newvars)] currently, can choose between 2 errors:
ddply(newdata,newdata[,"geo"],colmeans) #where r apparently thinks @ variable? ddply(newdata,"geo",colmeans) #where r worries factor variable not being numeric? my lapply attempts got me quite far left me list not dataframe:
lapply(newvars,function(x){ by(dataset[x],dataset[,"geo"],function(x) rep(colmeans(x,na.rm=t),length(unique(dataset[,"time"])))) }) i think must able merge , filters here: lapply in dataframe on different variables using filters , can't together. appreciated!
one option use data.table. can convert data.frame data.table (setdt(df1)), mean (lapply(.sd, mean)) selected columns ('var2' , 'var3') specifying column index in .sdcols, grouped 'geo'. create new columns assigning output (:=) new column names (paste('m', names(df1)[4:5]))
library(data.table) setdt(df1)[, paste('m', names(df1)[4:5], sep="_") :=lapply(.sd, mean) ,by = geo, .sdcols=4:5] # time geo var1 var2 var3 var4 m_var2 m_var3 #1: 1990 @ 1 7 13 19 8 14 #2: 1991 @ 2 8 14 20 8 14 #3: 1992 @ 3 9 15 21 8 14 #4: 1990 de 4 10 16 22 11 17 #5: 1991 de 5 11 17 23 11 17 #6: 1992 de 6 12 18 24 11 17 note: method more general. can create mean columns 100s of variables without major change in code. ie. if need mean of columns 4:100, change .sdcols=4:100 , in paste('m', names(df1)[4:100].
data
df1 <- structure(list(time = c(1990l, 1991l, 1992l, 1990l, 1991l, 1992l ), geo = c("at", "at", "at", "de", "de", "de"), var1 = 1:6, var2 = 7:12, var3 = 13:18, var4 = 19:24), .names = c("time", "geo", "var1", "var2", "var3", "var4"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))
Comments
Post a Comment