r - How to create a matrix with different repeats of values in a vector -
i have large data set, trying summarize question small example below.
lets have 3x3 matrix named x, column names a, b, , c.
x = (1, 10, 0.1, 2, 20, 0.2, 3, 30, 0.3)
where a = c(1, 2, 3)
gives numbers of times repeat, b = c(10, 20, 30)
gives actual values repeat, , c = c(0.1, 0.2, 0.3)
gives values fill out if number of times in a
less 4 (the number columns of matrix y).
my goal generate 3x4 matrix y, should this
y = (10, 0.1, 0.1, 0.1, 20, 20, 0.2, 0.2, 30, 30, 30, 0.3)
i understand there might many ways example, real data large (x has million rows, , y has 480 columns), have without loops (like 480 iterations). have tried using function rep
, still not this.
solution
it wasn't easy, figured out way accomplish task using single vectorized call rep()
, plus scaffolding code:
xr <- 3; yc <- 4; x <- matrix(c(1:xr%%(yc+1),seq(10,by=10,length.out=xr),seq(0.1,by=0.1,length.out=xr)),xr,dimnames=list(null,c('rep','val','fill'))); x; ## rep val fill ## [1,] 1 10 0.1 ## [2,] 2 20 0.2 ## [3,] 3 30 0.3 y <- matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); y; ## [,1] [,2] [,3] [,4] ## [1,] 10 0.1 0.1 0.1 ## [2,] 20 20.0 0.2 0.2 ## [3,] 30 30.0 30.0 0.3
(minor point: opted assign column names rep val fill
x
, rather a b c
specified in question, , used column names in solution when indexing x
(rather using numeric indexes), reason prefer maximizing human-readability wherever possible, detail negligible respect correctness , performance of solution.)
performance
this has significant performance benefit on @josilber's solution, because uses apply()
internally loops on rows of matrix (traditionally called "hidden loop" in r-speak), whereas core of solution single vectorized call rep()
. don't knock @josilber's solution, 1 (and gave him upvote!); it's not best possible solution problem.
here's demo of performance benefit using hefty parameters indicated in question:
xr <- 1e6; yc <- 480; x <- matrix(c(1:xr%%(yc+1),seq(10,by=10,length.out=xr),seq(0.1,by=0.1,length.out=xr)),xr,dimnames=list(null,c('rep','val','fill'))); x; ## rep val fill ## [1,] 1 10 0.1 ## [2,] 2 20 0.2 ## [3,] 3 30 0.3 ## [4,] 4 40 0.4 ## [5,] 5 50 0.5 ## [6,] 6 60 0.6 ## [7,] 7 70 0.7 ## [8,] 8 80 0.8 ## [9,] 9 90 0.9 ## [10,] 10 100 1.0 ## [11,] 11 110 1.1 ## [12,] 12 120 1.2 ## [13,] 13 130 1.3 ## ## ... (snip) ... ## ## [477,] 477 4770 47.7 ## [478,] 478 4780 47.8 ## [479,] 479 4790 47.9 ## [480,] 480 4800 48.0 ## [481,] 0 4810 48.1 ## [482,] 1 4820 48.2 ## [483,] 2 4830 48.3 ## [484,] 3 4840 48.4 ## [485,] 4 4850 48.5 ## [486,] 5 4860 48.6 ## [487,] 6 4870 48.7 ## [488,] 7 4880 48.8 ## [489,] 8 4890 48.9 ## [490,] 9 4900 49.0 ## [491,] 10 4910 49.1 ## [492,] 11 4920 49.2 ## ## ... (snip) ... ## ## [999986,] 468 9999860 99998.6 ## [999987,] 469 9999870 99998.7 ## [999988,] 470 9999880 99998.8 ## [999989,] 471 9999890 99998.9 ## [999990,] 472 9999900 99999.0 ## [999991,] 473 9999910 99999.1 ## [999992,] 474 9999920 99999.2 ## [999993,] 475 9999930 99999.3 ## [999994,] 476 9999940 99999.4 ## [999995,] 477 9999950 99999.5 ## [999996,] 478 9999960 99999.6 ## [999997,] 479 9999970 99999.7 ## [999998,] 480 9999980 99999.8 ## [999999,] 0 9999990 99999.9 ## [1e+06,] 1 10000000 100000.0 josilber <- function() t(apply(x,1,function(x) rep(x[2:3],c(x[1],yc-x[1])))); bgoldst <- function() matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); system.time({ josilber(); }); ## user system elapsed ## 65.719 3.828 71.623 system.time({ josilber(); }); ## user system elapsed ## 60.375 2.609 66.724 system.time({ bgoldst(); }); ## user system elapsed ## 5.422 0.593 6.033 system.time({ bgoldst(); }); ## user system elapsed ## 5.203 0.797 6.002
and prove @josilber , getting exact same result, large input:
identical(bgoldst(),josilber()); ## [1] true
explanation
now shall attempt explain how solution works. explanation, i'll use following input:
xr <- 6; yc <- 4; x <- matrix(c(1:xr%%(yc+1),seq(10,by=10,length.out=xr),seq(0.1,by=0.1,length.out=xr)),xr,dimnames=list(null,c('rep','val','fill'))); x; ## rep val fill ## [1,] 1 10 0.1 ## [2,] 2 20 0.2 ## [3,] 3 30 0.3 ## [4,] 4 40 0.4 ## [5,] 0 50 0.5 ## [6,] 1 60 0.6
for solution is:
y <- matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); y; ## [,1] [,2] [,3] [,4] ## [1,] 10.0 0.1 0.1 0.1 ## [2,] 20.0 20.0 0.2 0.2 ## [3,] 30.0 30.0 30.0 0.3 ## [4,] 40.0 40.0 40.0 40.0 ## [5,] 0.5 0.5 0.5 0.5 ## [6,] 60.0 0.6 0.6 0.6
at high level, solution built around forming single vector combines val
, fill
vectors, repeats combined vector in way, , builds new matrix out of result.
the repetition step can done using single call of rep()
because supports vectorized repetition counts. in other words, given vector input x
, can take vector input times
specifies how many times repeat each element of x
. thus, challenge becomes constructing appropriate x
, times
arguments.
so, solution begins extracting val
, fill
columns of x
:
x[,c('val','fill')]; ## val fill ## [1,] 10 0.1 ## [2,] 20 0.2 ## [3,] 30 0.3 ## [4,] 40 0.4 ## [5,] 50 0.5 ## [6,] 60 0.6
as can see, since we've indexed 2 columns, still have matrix, though didn't specify drop=f
index operation (see r: extract or replace parts of object). convenient, seen.
in r, underneath "matrix persona" of matrix plain old atomic vector, , "vector persona" of matrix can leveraged vectorized operations. how can pass val
, fill
data rep()
, have elements repeated appropriately.
however, when doing this, important understand how matrix treated vector. answer vector formed following elements across rows , across columns. (for higher-dimensional arrays subsequent dimensions followed. iow, order of vector across rows, columns, z-slices, etc.)
if @ above matrix, you'll see cannot used our x
argument rep()
, because val
s followed first, fill
s. could construct times
argument repeat each element correct number of times, resulting vector out-of-order, , there no way reshape desired matrix y
.
actually, why don't demonstrate before moving on explanation:
rep(x[,c('val','fill')],times=c(x[,'rep'],yc-x[,'rep'])) ## [1] 10.0 20.0 20.0 30.0 30.0 30.0 40.0 40.0 40.0 40.0 60.0 0.1 0.1 0.1 0.2 0.2 0.3 0.5 0.5 0.5 0.5 0.6 0.6 0.6
although above vector has right elements in right repetitions, order such cannot form desired output matrix y
.
so, can solve first transposing extract:
t(x[,c('val','fill')]); ## [,1] [,2] [,3] [,4] [,5] [,6] ## val 10.0 20.0 30.0 40.0 50.0 60.0 ## fill 0.1 0.2 0.3 0.4 0.5 0.6
now have val
, fill
vectors interleaved 1 another, such that, when flattening vector, happen when pass argument function internally uses vector, such rep()
's x
argument, we'll val
, corresponding fill
values in proper order rebuilding matrix out of them. let me demonstrate explicitly flattening matrix vector show looks (as can see, "flattening" can done simple c()
call):
c(t(x[,c('val','fill')])); ## [1] 10.0 0.1 20.0 0.2 30.0 0.3 40.0 0.4 50.0 0.5 60.0 0.6
so, have our x
argument. need construct times
argument.
this tricky figure out. first can recognize repetition counts val
values provided directly in rep
column of x
, have in x[,'rep']
. , repetition counts fill
values can computed difference between number of columns in output matrix y
, i've captured in yc
, , aforementioned repetition counts val
, or iow, yc-x[,'rep']
. problem is, need interleave 2 vectors line our x
argument.
i not aware of "built-in" way interleave 2 vectors in r; there doesn't appear function it. when working on problem, came 2 different possible solutions task, 1 of appears better in terms of both performance , concision. since wrote original solution use "worse" one, , later (while writing explanation, actually) thought of second , "better" one, i'll explain both approaches here, starting first , worse one.
interleaving solution #1
interleaving 2 vectors can done combining vectors sequentially, , indexing combined vector crafted index vector jumps back-and-forth first half second half of combined vector, sequentially pulling out each element of each half in alternating fashion.
to construct index vector, begin sequential vector of length equal half length of combined vector, each element repeated once:
rep(1:nrow(x),each=2); ## [1] 1 1 2 2 3 3 4 4 5 5 6 6
next, add two-element vector consisting of 0
, half length of combined vector:
nrow(x)*0:1; ## [1] 0 6
the second addend cycled through first addend, achieving interleaving need:
rep(1:nrow(x),each=2)+nrow(x)*0:1; ## [1] 1 7 2 8 3 9 4 10 5 11 6 12
and can index combined repetition vector our times
argument:
c(x[,'rep'],yc-x[,'rep'])[rep(1:nrow(x),each=2)+nrow(x)*0:1]; ## [1] 1 3 2 2 3 1 4 0 0 4 1 3
interleaving solution #2
interleaving 2 vectors can accomplished combining 2 vectors matrix , flattening them once again, in such way naturally become interleaved. believe easiest way rbind()
them , flatten them c()
:
c(rbind(x[,'rep'],yc-x[,'rep'])); ## [1] 1 3 2 2 3 1 4 0 0 4 1 3
based on cursory performance testing, appears solution #2 more performant, , can seen it's more concise. also, additional vectors tacked on rbind()
call, there more involved tack on solution #1 (a couple of increments).
performance testing (using large dataset):
il1 <- function() c(x[,'rep'],yc-x[,'rep'])[rep(1:nrow(x),each=2)+nrow(x)*0:1]; il2 <- function() c(rbind(x[,'rep'],yc-x[,'rep'])); identical(il1(),il2()); ## [1] true system.time({ replicate(30,il1()); }); ## user system elapsed ## 3.750 0.000 3.761 system.time({ replicate(30,il1()); }); ## user system elapsed ## 3.810 0.000 3.815 system.time({ replicate(30,il2()); }); ## user system elapsed ## 1.516 0.000 1.512 system.time({ replicate(30,il2()); }); ## user system elapsed ## 1.500 0.000 1.503
and full rep()
call gives our data in proper order:
rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))); ## [1] 10.0 0.1 0.1 0.1 20.0 20.0 0.2 0.2 30.0 30.0 30.0 0.3 40.0 40.0 40.0 40.0 0.5 0.5 0.5 0.5 60.0 0.6 0.6 0.6
the last step build matrix out of it, using byrow=t
, because that's how data ended being returned rep()
. , must specify required number of rows, same input matrix, xr
(alternatively, specify number of columns, yc
, or both, if wanted):
y <- matrix(rep(t(x[,c('val','fill')]),times=c(rbind(x[,'rep'],yc-x[,'rep']))),xr,byrow=t); y; ## [,1] [,2] [,3] [,4] ## [1,] 10.0 0.1 0.1 0.1 ## [2,] 20.0 20.0 0.2 0.2 ## [3,] 30.0 30.0 30.0 0.3 ## [4,] 40.0 40.0 40.0 40.0 ## [5,] 0.5 0.5 0.5 0.5 ## [6,] 60.0 0.6 0.6 0.6
and we're done!
Comments
Post a Comment