r - Cross-referencing data frames without using for loops -

im having issue speed of using loops cross reference 2 data frames. overall aim identify rows in data frame 2 lie between coordinates specified in data frame 1 (and meet other criteria). e.g. df1:

    chr     start       stop        strand 1   chr1    179324331   179327814   + 2   chr21   45176033    45182188    + 3   chr5    126887642   126890780   + 4   chr5    148730689   148734146   +

df2:

    chr     start       strand 1   chr1    179326331   + 2   chr21   45175033    + 3   chr5    126886642   + 4   chr5    148729689   +

my current code is:

for (index in 1:nrow(df1)) {    found_mirnas <- ""   curr_row = df1[index, ];  (index2 in 1:nrow(df2)){     curr_target = df2[index2, ]     if (curr_row$chrm == curr_target$chrm & curr_row$start < curr_target$start & curr_row$stop > curr_target$start & curr_row$strand == curr_target$strand) {       found_mirnas <- paste(found_mirnas, curr_target$start, sep=":")     }   }   curr_row$mirnas <- found_mirnas   found_log <- rbind(mcf7_short_autrs2,curr_row) }

my actual data frames 400 lines df1 , > 100 000 lines df2 , hoping 500 iterations, so, can imagine unworkably slow. i'm relatively new r hints functions may increase efficiency of great.

you've run 2 of common mistakes people make when coming r programming language. using loops instead of vector-based operations , dynamically appending data object. i'd suggest more fluent take time read patrick burns' r inferno, provides interesting insight these , other problems.

as @david arenburg , @zx8754 have pointed out in comments above there specialized packages can solve problem, , data.table package , @david's approach can efficient larger datasets. case base r can need efficiently well. i'll document 1 approach here, few more steps necessary clarity, in case you're interested:

set.seed(1001)  ranges <- data.frame(beg=rnorm(400)) ranges$end <- ranges$beg + 0.005  test <- data.frame(value=rnorm(100000)) ##  add id field duplicate removal: test$id <- 1:nrow(test)   ##  you'd set criteria.  apply() function  ##      wrapper for() loop on rows in ranges data.frame: out <- apply(ranges, mar=1, function(x) test[ (x[1] < test$value & x[2] > test$value), "id"])  selected <- unlist(out) selected <- unique( selected )  selection <- test[ selected, ]

Search This Blog

Braziel

r - Cross-referencing data frames without using for loops -

Comments

Post a Comment

Popular posts from this blog

javascript - Add class to another page attribute using URL id - Jquery -

android - MPAndroidChart - How to add Annotations or images to the chart -

IF statement in MySQL trigger -