r - Cross-referencing data frames without using for loops -
im having issue speed of using loops cross reference 2 data frames. overall aim identify rows in data frame 2 lie between coordinates specified in data frame 1 (and meet other criteria). e.g. df1:
chr start stop strand 1 chr1 179324331 179327814 + 2 chr21 45176033 45182188 + 3 chr5 126887642 126890780 + 4 chr5 148730689 148734146 +
df2:
chr start strand 1 chr1 179326331 + 2 chr21 45175033 + 3 chr5 126886642 + 4 chr5 148729689 +
my current code is:
for (index in 1:nrow(df1)) { found_mirnas <- "" curr_row = df1[index, ]; (index2 in 1:nrow(df2)){ curr_target = df2[index2, ] if (curr_row$chrm == curr_target$chrm & curr_row$start < curr_target$start & curr_row$stop > curr_target$start & curr_row$strand == curr_target$strand) { found_mirnas <- paste(found_mirnas, curr_target$start, sep=":") } } curr_row$mirnas <- found_mirnas found_log <- rbind(mcf7_short_autrs2,curr_row) }
my actual data frames 400 lines df1 , > 100 000 lines df2 , hoping 500 iterations, so, can imagine unworkably slow. i'm relatively new r hints functions may increase efficiency of great.
you've run 2 of common mistakes people make when coming r programming language. using loops instead of vector-based operations , dynamically appending data object. i'd suggest more fluent take time read patrick burns' r inferno, provides interesting insight these , other problems.
as @david arenburg , @zx8754 have pointed out in comments above there specialized packages can solve problem, , data.table
package , @david's approach can efficient larger datasets. case base r can need efficiently well. i'll document 1 approach here, few more steps necessary clarity, in case you're interested:
set.seed(1001) ranges <- data.frame(beg=rnorm(400)) ranges$end <- ranges$beg + 0.005 test <- data.frame(value=rnorm(100000)) ## add id field duplicate removal: test$id <- 1:nrow(test) ## you'd set criteria. apply() function ## wrapper for() loop on rows in ranges data.frame: out <- apply(ranges, mar=1, function(x) test[ (x[1] < test$value & x[2] > test$value), "id"]) selected <- unlist(out) selected <- unique( selected ) selection <- test[ selected, ]
Comments
Post a Comment