c++ - Cuda: Copy host data to shared memory array -
i have struct defined on host , on device. in host initialize array of struct values.
mystruct *h_s = (mystruct *) malloc(objsize*sizeof(mystruct)); hs[0] = ... mystruct *d_s; cudamalloc( &d_s, objsize * sizeof(mystruct)); cudamemcpy( d_s, h_s, objsize * sizeof(mystruct), cudamemcpyhosttodevice ); init<<< gridsize, blocksize >>> ( d_s ); in kernel have 7 functions should use array. of them global , simple device functions. simplicity , efficiency want use shared memory array.
__shared__ mystruct *d_s; __global__ void init(mystruct *thestructarray){ //how allocate memory d_s //how copy thestructarray d_s } so question is: how can allocate memory shared array , set values functionparameter?
edit: trying write smallpt code cuda.
struct sphere { double rad; // radius vec p, e, c; // position, emission, color refl_t refl; // reflection type (diffuse, specular, refractive) sphere(){ rad = 16.5; p = (vec(27,16.5,47) + vec(73,16.5,78))*0.5; e = vec(); c = vec(0.75, 0.75, 0.75); refl = diff; } sphere(double rad_, vec p_, vec e_, vec c_, refl_t refl_): rad(rad_), p(p_), e(e_), c(c_), refl(refl_) {} __device__ double intersect(const ray &r) const { // returns distance, 0 if nohit vec op = p-r.o; // solve t^2*d.d + 2*t*(o-p).d + (o-p).(o-p)-r^2 = 0 double t, eps=1e-4, b=op.dot(r.d), det=b*b-op.dot(op)+rad*rad; if (det<0) return 0; else det=sqrt(det); return (t=b-det)>eps ? t : ((t=b+det)>eps ? t : 0); } };
if understand scope , size limitations of shared memory, question appears be
- how dynamically reserved memory shared memory array
- how use dynamic shared memory within kernel
your kernel becomes this:
__shared__ mystruct *d_s; __global__ void init(mystruct *thestructarray){ int tid = blockdim.x * blockidx.x + threadidx.x; // load shared memory array // assumes mystruct has correct copy assignment semantics d_s[threadidx.x] = thestructarray[tid] __syncthreads(); // each thread has loaded 1 value block // scoped shared array } [disclaimer: code written in browser, never compiled or tested, , note caveat in comments copy assignment]
the calling host code needs add additional argument kernel call reserve memory shared array:
mystruct *h_s = (mystruct *) malloc(objsize*sizeof(mystruct)); hs[0] = ... mystruct *d_s; cudamalloc( &d_s, objsize * sizeof(mystruct)); cudamemcpy( d_s, h_s, objsize * sizeof(mystruct), cudamemcpyhosttodevice ); init<<< gridsize, blocksize, blocksize * sizeof(mystruct) >>> ( d_s ); note third argument <<< >>> stanza of kernel call. specifies number of bytes of memory reserved per block. there hardware dictated limits on size of shared memory allocations can make, , might have additional effect on performance beyond hardware limits.
shared memory documented feature of cuda, recommend mark harris's blog , stack overflow question starting points on mechanics of shared memory in cuda.
Comments
Post a Comment