i have large data set, trying summarize question small example below.

lets have 3x3 matrix named x, column names a, b, , c.

x = (1, 10, 0.1,      2, 20, 0.2,      3, 30, 0.3)

where a = c(1, 2, 3) gives numbers of times repeat, b = c(10, 20, 30) gives actual values repeat, , c = c(0.1, 0.2, 0.3) gives values fill out if number of times in a less 4 (the number columns of matrix y).

my goal generate 3x4 matrix y, should this

y = (10, 0.1, 0.1, 0.1,      20,  20, 0.2, 0.2,      30,  30,  30, 0.3)

i understand there might many ways example, real data large (x has million rows, , y has 480 columns), have without loops (like 480 iterations). have tried using function rep, still not this.

solution

it wasn't easy, figured out way accomplish task using single vectorized call rep(), plus scaffolding code:

xr <- 3; yc <- 4; x <- matrix(c(1:xr%%(yc+1),seq(10,by=10,length.out=xr),seq(0.1,by=0.1,length.out=xr)),xr,dimnames=list(null,c('rep','val','fill'))); x; ##      rep val fill ## [1,]   1  10  0.1 ## [2,]   2  20  0.2 ## [3,]   3  30  0.3 y <- matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); y; ##      [,1] [,2] [,3] [,4] ## [1,]   10  0.1  0.1  0.1 ## [2,]   20 20.0  0.2  0.2 ## [3,]   30 30.0 30.0  0.3

(minor point: opted assign column names rep val fill x, rather a b c specified in question, , used column names in solution when indexing x (rather using numeric indexes), reason prefer maximizing human-readability wherever possible, detail negligible respect correctness , performance of solution.)

performance

this has significant performance benefit on @josilber's solution, because uses apply() internally loops on rows of matrix (traditionally called "hidden loop" in r-speak), whereas core of solution single vectorized call rep(). don't knock @josilber's solution, 1 (and gave him upvote!); it's not best possible solution problem.

here's demo of performance benefit using hefty parameters indicated in question:

xr <- 1e6; yc <- 480; x <- matrix(c(1:xr%%(yc+1),seq(10,by=10,length.out=xr),seq(0.1,by=0.1,length.out=xr)),xr,dimnames=list(null,c('rep','val','fill'))); x; ##        rep  val fill ##   [1,]   1   10  0.1 ##   [2,]   2   20  0.2 ##   [3,]   3   30  0.3 ##   [4,]   4   40  0.4 ##   [5,]   5   50  0.5 ##   [6,]   6   60  0.6 ##   [7,]   7   70  0.7 ##   [8,]   8   80  0.8 ##   [9,]   9   90  0.9 ##  [10,]  10  100  1.0 ##  [11,]  11  110  1.1 ##  [12,]  12  120  1.2 ##  [13,]  13  130  1.3 ## ## ... (snip) ... ## ## [477,] 477 4770 47.7 ## [478,] 478 4780 47.8 ## [479,] 479 4790 47.9 ## [480,] 480 4800 48.0 ## [481,]   0 4810 48.1 ## [482,]   1 4820 48.2 ## [483,]   2 4830 48.3 ## [484,]   3 4840 48.4 ## [485,]   4 4850 48.5 ## [486,]   5 4860 48.6 ## [487,]   6 4870 48.7 ## [488,]   7 4880 48.8 ## [489,]   8 4890 48.9 ## [490,]   9 4900 49.0 ## [491,]  10 4910 49.1 ## [492,]  11 4920 49.2 ## ## ... (snip) ... ## ## [999986,] 468  9999860  99998.6 ## [999987,] 469  9999870  99998.7 ## [999988,] 470  9999880  99998.8 ## [999989,] 471  9999890  99998.9 ## [999990,] 472  9999900  99999.0 ## [999991,] 473  9999910  99999.1 ## [999992,] 474  9999920  99999.2 ## [999993,] 475  9999930  99999.3 ## [999994,] 476  9999940  99999.4 ## [999995,] 477  9999950  99999.5 ## [999996,] 478  9999960  99999.6 ## [999997,] 479  9999970  99999.7 ## [999998,] 480  9999980  99999.8 ## [999999,]   0  9999990  99999.9 ## [1e+06,]    1 10000000 100000.0 josilber <- function() t(apply(x,1,function(x) rep(x[2:3],c(x[1],yc-x[1])))); bgoldst <- function() matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); system.time({ josilber(); }); ##    user  system elapsed ##  65.719   3.828  71.623 system.time({ josilber(); }); ##    user  system elapsed ##  60.375   2.609  66.724 system.time({ bgoldst(); }); ##    user  system elapsed ##   5.422   0.593   6.033 system.time({ bgoldst(); }); ##    user  system elapsed ##   5.203   0.797   6.002

and prove @josilber , getting exact same result, large input:

identical(bgoldst(),josilber()); ## [1] true

explanation

now shall attempt explain how solution works. explanation, i'll use following input:

xr <- 6; yc <- 4; x <- matrix(c(1:xr%%(yc+1),seq(10,by=10,length.out=xr),seq(0.1,by=0.1,length.out=xr)),xr,dimnames=list(null,c('rep','val','fill'))); x; ##      rep val fill ## [1,]   1  10  0.1 ## [2,]   2  20  0.2 ## [3,]   3  30  0.3 ## [4,]   4  40  0.4 ## [5,]   0  50  0.5 ## [6,]   1  60  0.6

for solution is:

y <- matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); y; ##      [,1] [,2] [,3] [,4] ## [1,] 10.0  0.1  0.1  0.1 ## [2,] 20.0 20.0  0.2  0.2 ## [3,] 30.0 30.0 30.0  0.3 ## [4,] 40.0 40.0 40.0 40.0 ## [5,]  0.5  0.5  0.5  0.5 ## [6,] 60.0  0.6  0.6  0.6

at high level, solution built around forming single vector combines val , fill vectors, repeats combined vector in way, , builds new matrix out of result.

the repetition step can done using single call of rep() because supports vectorized repetition counts. in other words, given vector input x, can take vector input times specifies how many times repeat each element of x. thus, challenge becomes constructing appropriate x , times arguments.

so, solution begins extracting val , fill columns of x:

x[,c('val','fill')]; ##      val fill ## [1,]  10  0.1 ## [2,]  20  0.2 ## [3,]  30  0.3 ## [4,]  40  0.4 ## [5,]  50  0.5 ## [6,]  60  0.6

as can see, since we've indexed 2 columns, still have matrix, though didn't specify drop=f index operation (see r: extract or replace parts of object). convenient, seen.

in r, underneath "matrix persona" of matrix plain old atomic vector, , "vector persona" of matrix can leveraged vectorized operations. how can pass val , fill data rep() , have elements repeated appropriately.

however, when doing this, important understand how matrix treated vector. answer vector formed following elements across rows , across columns. (for higher-dimensional arrays subsequent dimensions followed. iow, order of vector across rows, columns, z-slices, etc.)

if @ above matrix, you'll see cannot used our x argument rep(), because vals followed first, fills. could construct times argument repeat each element correct number of times, resulting vector out-of-order, , there no way reshape desired matrix y.

actually, why don't demonstrate before moving on explanation:

rep(x[,c('val','fill')],times=c(x[,'rep'],yc-x[,'rep'])) ##  [1] 10.0 20.0 20.0 30.0 30.0 30.0 40.0 40.0 40.0 40.0 60.0  0.1  0.1  0.1  0.2  0.2  0.3  0.5  0.5  0.5  0.5  0.6  0.6  0.6

although above vector has right elements in right repetitions, order such cannot form desired output matrix y.

so, can solve first transposing extract:

t(x[,c('val','fill')]); ##      [,1] [,2] [,3] [,4] [,5] [,6] ## val  10.0 20.0 30.0 40.0 50.0 60.0 ## fill  0.1  0.2  0.3  0.4  0.5  0.6

now have val , fill vectors interleaved 1 another, such that, when flattening vector, happen when pass argument function internally uses vector, such rep()'s x argument, we'll val , corresponding fill values in proper order rebuilding matrix out of them. let me demonstrate explicitly flattening matrix vector show looks (as can see, "flattening" can done simple c() call):

c(t(x[,c('val','fill')])); ##  [1] 10.0  0.1 20.0  0.2 30.0  0.3 40.0  0.4 50.0  0.5 60.0  0.6

so, have our x argument. need construct times argument.

this tricky figure out. first can recognize repetition counts val values provided directly in rep column of x, have in x[,'rep']. , repetition counts fill values can computed difference between number of columns in output matrix y, i've captured in yc, , aforementioned repetition counts val, or iow, yc-x[,'rep']. problem is, need interleave 2 vectors line our x argument.

i not aware of "built-in" way interleave 2 vectors in r; there doesn't appear function it. when working on problem, came 2 different possible solutions task, 1 of appears better in terms of both performance , concision. since wrote original solution use "worse" one, , later (while writing explanation, actually) thought of second , "better" one, i'll explain both approaches here, starting first , worse one.

interleaving solution #1

interleaving 2 vectors can done combining vectors sequentially, , indexing combined vector crafted index vector jumps back-and-forth first half second half of combined vector, sequentially pulling out each element of each half in alternating fashion.

to construct index vector, begin sequential vector of length equal half length of combined vector, each element repeated once:

rep(1:nrow(x),each=2); ##  [1] 1 1 2 2 3 3 4 4 5 5 6 6

next, add two-element vector consisting of 0 , half length of combined vector:

nrow(x)*0:1; ## [1] 0 6

the second addend cycled through first addend, achieving interleaving need:

rep(1:nrow(x),each=2)+nrow(x)*0:1; ##  [1]  1  7  2  8  3  9  4 10  5 11  6 12

and can index combined repetition vector our times argument:

c(x[,'rep'],yc-x[,'rep'])[rep(1:nrow(x),each=2)+nrow(x)*0:1]; ##  [1] 1 3 2 2 3 1 4 0 0 4 1 3

interleaving solution #2

interleaving 2 vectors can accomplished combining 2 vectors matrix , flattening them once again, in such way naturally become interleaved. believe easiest way rbind() them , flatten them c():

c(rbind(x[,'rep'],yc-x[,'rep'])); ##  [1] 1 3 2 2 3 1 4 0 0 4 1 3

based on cursory performance testing, appears solution #2 more performant, , can seen it's more concise. also, additional vectors tacked on rbind() call, there more involved tack on solution #1 (a couple of increments).

performance testing (using large dataset):

il1 <- function() c(x[,'rep'],yc-x[,'rep'])[rep(1:nrow(x),each=2)+nrow(x)*0:1]; il2 <- function() c(rbind(x[,'rep'],yc-x[,'rep'])); identical(il1(),il2()); ## [1] true system.time({ replicate(30,il1()); }); ##    user  system elapsed ##   3.750   0.000   3.761 system.time({ replicate(30,il1()); }); ##    user  system elapsed ##   3.810   0.000   3.815 system.time({ replicate(30,il2()); }); ##    user  system elapsed ##   1.516   0.000   1.512 system.time({ replicate(30,il2()); }); ##    user  system elapsed ##   1.500   0.000   1.503

and full rep() call gives our data in proper order:

rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))); ##  [1] 10.0  0.1  0.1  0.1 20.0 20.0  0.2  0.2 30.0 30.0 30.0  0.3 40.0 40.0 40.0 40.0  0.5  0.5  0.5  0.5 60.0  0.6  0.6  0.6

the last step build matrix out of it, using byrow=t, because that's how data ended being returned rep(). , must specify required number of rows, same input matrix, xr (alternatively, specify number of columns, yc, or both, if wanted):

y <- matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); y; ##      [,1] [,2] [,3] [,4] ## [1,] 10.0  0.1  0.1  0.1 ## [2,] 20.0 20.0  0.2  0.2 ## [3,] 30.0 30.0 30.0  0.3 ## [4,] 40.0 40.0 40.0 40.0 ## [5,]  0.5  0.5  0.5  0.5 ## [6,] 60.0  0.6  0.6  0.6

and we're done!

Search This Blog

Braziel

r - How to create a matrix with different repeats of values in a vector -

solution

performance

explanation

Comments

Post a Comment

Popular posts from this blog

javascript - Add class to another page attribute using URL id - Jquery -

android - MPAndroidChart - How to add Annotations or images to the chart -

IF statement in MySQL trigger -