c++ - Most efficient way to test a 256-bit YMM AVX register element for equal or less than zero -
i'm implementing particle system using intel avx intrinsics. when y-position of particle less or equal 0 want reset particle.
the particle system ordered in soa-pattern this:
class particlesystem { private: float* mxposition; float* myposition; float* mzposition; .... rest of code not important question my initial approach had in mind iterate through myposition array , check case stated in beginning. perhaps performance improvmentes made approach?
the question if there efficient way implement using avx intrinsics? thank you!
if elements <= 0 relatively sparse 1 simple approach test 8 @ time using avx , drop scalar code when identify vector contains 1 or more such elements, e.g.:
#include <immintrin.h> // avx intrinsics const __m256 vk0 = _mm256_setzero_ps(); // const vector of zeros (int = 0; + 8 <= n; += 8) { __m256 vy = _mm256_loadu_ps(&myposition[i]); // load 8 x floats __m256 vcmp = _mm256_cmp_ps(vy, vk0, _cmp_le_os); // compare <= 0 int mask = _mm256_movemask_ps(vcmp); // ms bits comparison result if (mask != 0) // if bits set { // have 1 or more elements <= 0 (int k = 0; k < 8; ++k) // test each element in vector { // using scalar code... if ((mask & 1) != 0) { // found element @ index + k // it... } mask >>= 1; } } } // deal remaining elements in case n not multiple of 8 (int j = i; j < n; ++j) { if (myposition[j] <= 0.0f) { // found element @ index j // it... } } of course if matching elements not sparse, i.e. if typically finding 1 or more in every vector of 8, isn't going buy performance gain. if elements sparse, such vectors can skipped, should see significant benefit.
Comments
Post a Comment