c++ - Cuda: Copy host data to shared memory array -


i have struct defined on host , on device. in host initialize array of struct values.

mystruct *h_s = (mystruct *) malloc(objsize*sizeof(mystruct)); hs[0] = ...  mystruct *d_s; cudamalloc( &d_s, objsize * sizeof(mystruct)); cudamemcpy( d_s, h_s, objsize * sizeof(mystruct), cudamemcpyhosttodevice ); init<<< gridsize, blocksize >>> ( d_s ); 

in kernel have 7 functions should use array. of them global , simple device functions. simplicity , efficiency want use shared memory array.

__shared__ mystruct *d_s;  __global__ void init(mystruct *thestructarray){    //how allocate memory d_s    //how copy thestructarray d_s } 

so question is: how can allocate memory shared array , set values functionparameter?

edit: trying write smallpt code cuda.

struct sphere { double rad;       // radius vec p, e, c;      // position, emission, color refl_t refl;      // reflection type (diffuse, specular, refractive)  sphere(){     rad = 16.5;     p = (vec(27,16.5,47) + vec(73,16.5,78))*0.5;     e = vec();     c = vec(0.75, 0.75, 0.75);     refl = diff; }  sphere(double rad_, vec p_, vec e_, vec c_, refl_t refl_): rad(rad_), p(p_), e(e_), c(c_), refl(refl_) {}  __device__ double intersect(const ray &r) const { // returns distance, 0 if nohit     vec op = p-r.o; // solve t^2*d.d + 2*t*(o-p).d + (o-p).(o-p)-r^2 = 0     double t, eps=1e-4, b=op.dot(r.d), det=b*b-op.dot(op)+rad*rad;     if (det<0) return 0; else det=sqrt(det);     return (t=b-det)>eps ? t : ((t=b+det)>eps ? t : 0); }  

};

if understand scope , size limitations of shared memory, question appears be

  1. how dynamically reserved memory shared memory array
  2. how use dynamic shared memory within kernel

your kernel becomes this:

__shared__ mystruct *d_s;  __global__ void init(mystruct *thestructarray){      int tid = blockdim.x * blockidx.x + threadidx.x;      // load shared memory array     // assumes mystruct has correct copy assignment semantics     d_s[threadidx.x] = thestructarray[tid]      __syncthreads();      // each thread has loaded 1 value block     // scoped shared array } 

[disclaimer: code written in browser, never compiled or tested, , note caveat in comments copy assignment]

the calling host code needs add additional argument kernel call reserve memory shared array:

mystruct *h_s = (mystruct *) malloc(objsize*sizeof(mystruct)); hs[0] = ...  mystruct *d_s; cudamalloc( &d_s, objsize * sizeof(mystruct)); cudamemcpy( d_s, h_s, objsize * sizeof(mystruct), cudamemcpyhosttodevice ); init<<< gridsize, blocksize, blocksize * sizeof(mystruct) >>> ( d_s ); 

note third argument <<< >>> stanza of kernel call. specifies number of bytes of memory reserved per block. there hardware dictated limits on size of shared memory allocations can make, , might have additional effect on performance beyond hardware limits.

shared memory documented feature of cuda, recommend mark harris's blog , stack overflow question starting points on mechanics of shared memory in cuda.


Comments

Popular posts from this blog

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -

javascript - Blogger related post gadget image Resize s72-c [ Need Expert Help ] -