r - Create a character vector column of predefined text and bind it to existing dataframe using rbind or bind_rows -
good day,
i present 2 [likely] puny problems excellent review.
problem #1
i have relatively tidy df (dat) dim 10299 x 563. 563 variables common both datasets [that created] dat 'subject' (numeric), 'label' (numeric), 3:563 (variable names text file). observations 1:2947 'test' dataset whereas observations 2948:10299 'training' dataset.
i'd insert column (header = 'type') dat rows 1:2947 comprised of string test , rows 2948:10299 of string train way can group later on dataset or other similar aggregate functions in dplyr/tidyr.
i created test df (testdf = 1:10299: dim(testdf) = 102499 x 1) , then:
testdat[1:2947 , "type"] <- c("test") testdat[2948:10299, "type"] <- c("train") > head(ds, 2);tail(ds, 2) x1.10299 type 1 1 test 2 2 test x1.10299 type 10298 10298 train 10299 10299 train
so don't there column of x1.10299.
questions:
- is there better , more expedient way create column has i'm looking based upon use case above?
- what way insert column 'dat' can use later grouping dplyr?
problem #2
the way arrived @ [nearly] tidy df (dat) above 2 take dfs (test , train) of form dim(2947 x 563 , 7352 x 563), respectively, , rbinding them together.
i confirm of variable names present after binding effort this:
test.names <- names(test) train.names <- names(train) identical(test.names, train.names) > true
what interesting , of primary concern if try use bind_rows function 'dplyr' perform same binding exercise:
dat <- bind_rows(test, train)
it returns dataframe apparently keeps of observations (x: 10299) variable count reduced 563 470!
question:
- does know why variables being chopped?
- is best way combine 2 dfs of same structure later slicing/dicing dplyr/
tidyr?
thank time , consideration of these matters.
sample test/train dfs review (the left numeric df indices):
test df test[1:10, 1:5]
subject labels tbodyacc-mean()-x tbodyacc-mean()-y tbodyacc-mean()-z 1 2 5 0.2571778 -0.02328523 -0.01465376 2 2 5 0.2860267 -0.01316336 -0.11908252 3 2 5 0.2754848 -0.02605042 -0.11815167 4 2 5 0.2702982 -0.03261387 -0.11752018 5 2 5 0.2748330 -0.02784779 -0.12952716 6 2 5 0.2792199 -0.01862040 -0.11390197 7 2 5 0.2797459 -0.01827103 -0.10399988 8 2 5 0.2746005 -0.02503513 -0.11683085 9 2 5 0.2725287 -0.02095401 -0.11447249 10 2 5 0.2757457 -0.01037199 -0.09977589
train df train[1:10, 1:5]
subject label tbodyacc-mean()-x tbodyacc-mean()-y tbodyacc-mean()-z 1 1 5 0.2885845 -0.020294171 -0.1329051 2 1 5 0.2784188 -0.016410568 -0.1235202 3 1 5 0.2796531 -0.019467156 -0.1134617 4 1 5 0.2791739 -0.026200646 -0.1232826 5 1 5 0.2766288 -0.016569655 -0.1153619 6 1 5 0.2771988 -0.010097850 -0.1051373 7 1 5 0.2794539 -0.019640776 -0.1100221 8 1 5 0.2774325 -0.030488303 -0.1253604 9 1 5 0.2772934 -0.021750698 -0.1207508 10 1 5 0.2805857 -0.009960298 -0.1060652
actual code (ignore function calls/i'm doing of testing via console).
[http://archive.ics.uci.edu/ml/machine-learning-databases/00240/]the data set i'm using code. 1
run_analysis <- function () { #vars available use throughout function should preserved vars <- read.table("features.txt", header = false, sep = "") lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6), activity_label = c("walking", "walking_up", "walking_down", "sitting", "standing", "laying")) test <- test_read_process(vars, lookup_table) train <- train_read_process(vars, lookup_table) } test_read_process <- function(vars, lookup_table) { #read in 3 documents cbinding later test.sub <- read.table("test/subject_test.txt", header = false) test.labels <- read.table("test/y_test.txt", header = false) test.obs <- read.table("test/x_test.txt", header = false, sep = "") #cbind cols , set remaining colnames var names in vars test.dat <- cbind(test.sub, test.labels, test.obs) colnames(test.dat) <- c("subject", "labels", as.character(vars[,2])) #use lookup_table set "test_labels" string values correspond #to integer ids #test.lookup <- merge(test, lookup_table, by.x = "labels", # by.y ="activitynum", all.x = t) #remove temporary symbols globalenv/memory rm(test.sub, test.labels, test.obs) #return return(test.dat) } train_read_process <- function(vars, lookup_table) { #read in 3 documents cbinding train.sub <- read.table("train/subject_train.txt", header = false) train.labels <- read.table("train/y_train.txt", header = false) train.obs <- read.table("train/x_train.txt", header = false, sep = "") #cbind cols , set remaining colnames var names in vars train.dat <- cbind(train.sub, train.labels, train.obs) colnames(train.dat) <- c("subject", "label", as.character(vars[,2])) #clean temporary symbols globalenv/memory rm(train.sub, train.labels, train.obs, vars) return(train.dat) }
the problem you're facing stems fact have duplicated names in variable list you're using create data frame objects. if ensure column names unique , shared between objects code run. i've included working example based on code used above (with fixes , various edits noted in comments):
vars <- read.table(file="features.txt", header=f, stringsasfactors=f) ## frs: source of original problem: duplicated(vars[,2]) vars[317:340,2] duplicated(vars[317:340,2]) vars[396:419,2] ## frs: edited following both account data , variable ## issues: test_read_process <- function() { #read in 3 documents cbinding later test.sub <- read.table("test/subject_test.txt", header = false) test.labels <- read.table("test/y_test.txt", header = false) test.obs <- read.table("test/x_test.txt", header = false, sep = "") #cbind cols , set remaining colnames var names in vars test.dat <- cbind(test.sub, test.labels, test.obs) #colnames(test.dat) <- c("subject", "labels", as.character(vars[,2])) colnames(test.dat) <- c("subject", "labels", paste0("v", 1:nrow(vars))) return(test.dat) } train_read_process <- function() { #read in 3 documents cbinding train.sub <- read.table("train/subject_train.txt", header = false) train.labels <- read.table("train/y_train.txt", header = false) train.obs <- read.table("train/x_train.txt", header = false, sep = "") #cbind cols , set remaining colnames var names in vars train.dat <- cbind(train.sub, train.labels, train.obs) #colnames(train.dat) <- c("subject", "labels", as.character(vars[,2])) colnames(train.dat) <- c("subject", "labels", paste0("v", 1:nrow(vars))) return(train.dat) } test_df <- test_read_process() train_df <- train_read_process() identical(names(test_df), names(train_df)) library("dplyr") ## frs: these piped i've kept them separate clarity: train_df %>% mutate(test="train") -> train_df test_df %>% mutate(test="test") -> test_df test_df %>% bind_rows(train_df) -> out_df head(out_df) out_df ## frs: can set column names of original ## variable list still have duplicates deal with: names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test") duplicated(names(out_df))
Comments
Post a Comment