FastC++: Coding Cpp Efficiently: Fast 3D-vector matrix transformation using SSE4

Thursday, July 28, 2011

Fast 3D-vector matrix transformation using SSE4

In this blog-post we'd like to show how to efficiently transform multiple 3D vectors using an affine transformation matrix. Each vector has three coordinates (x, y and z) and the matrix consists of three rows each with 4 elements (3 for rotation/scale + 1 for translation). In order to multiply a 3-element vector with a 3x4 matrix, we add an additional 1 at the end of the 3D vector:

| T00 T01 T02 T03 |   | x |   | x' |
| T10 T11 T12 T13 | * | y | = | y' |
| T20 T21 T22 T23 |   | z |   | z' |
                      | 1 |

Using SSE, the matrix-vector product can be performed straight forward. However, we want to process an array of float3 elements and exploit 128-bit and 64-bit loads and stores. Therefore, we process two float3 elements at the same time by using loop unrolling. At each iteration, we load the vectors, set the 4-th element to 1, matrix-multiply and store two results.

static const __m128 MM_ONES = _mm_set1_ps(1.0f);

void multhom(float3* dst, const float4* matrix, const float3* src, int num) const
{
 const __m128 r0 = _mm_loadu_ps(matrix+0);
 const __m128 r1 = _mm_loadu_ps(matrix+1);
 const __m128 r2 = _mm_loadu_ps(matrix+2);

 // unrolled loop
 int i = 0;
 for(; i < ROUND_DOWN(num, 2); i+=2)
 {
  // Load [X0,Y0,Z0,1] into xyz1  
  // Load [X1,Y1,Z1,1] into uvw1  
  __m128 xyzu = _mm_loadu_ps(&src[i].x);  // X0 Y0 Z0 X1
  __m128 vw11 = _mm_loadl_pi(_mm_setzero_ps(), (const __m64*)&src[i+1].y); // Y1 Z1 1 1    
         vw11 = _mm_loadh_pi(vw11, (const __m64*)&MM_ONES);  
  // convert xyzuvw11 to xyz1 and uvw1
  __m128 zu1v = _mm_shuffle_ps(xyzu, vw11, _MM_SHUFFLE(0, 2, 3, 2)); // Z0 X1 1 Y1  
  __m128 xyz1 = _mm_shuffle_ps(xyzu, zu1v, _MM_SHUFFLE(2, 0, 1, 0)); // X0 Y0 Z0 1  
  __m128 uvw1 = _mm_shuffle_ps(zu1v, vw11, _MM_SHUFFLE(3, 1, 3, 1)); // X1 Y1 Z1 1

  // Perform matrix multiplication
  __m128 x = _mm_mul_ps(xyz1, r0);
  __m128 y = _mm_mul_ps(xyz1, r1);
  __m128 z = _mm_mul_ps(xyz1, r2);

  __m128 u = _mm_mul_ps(uvw1, r0);
  __m128 v = _mm_mul_ps(uvw1, r1);
  __m128 w = _mm_mul_ps(uvw1, r2);

  __m128 xy = _mm_hadd_ps(x, y);
  __m128 uv = _mm_hadd_ps(u, v);
  __m128 zw = _mm_hadd_ps(z, w);

  __m128 xyzw = _mm_hadd_ps(xy, zw);
  __m128 uvzw = _mm_hadd_ps(uv, zw);

  // Prepare the 6 output values (x'y'z'u'v'w')
  __m128 out_zuvw = _mm_shuffle_ps(uvzw, uvzw, _MM_SHUFFLE(3, 1, 0, 2));
  __m128 out_xyzu = _mm_movelh_ps(xyzw, out_zuvw);
    
  _mm_storeu_ps(&dst[i].x, out_xyzu);           // write X,Y,Z,U
  _mm_storeh_pi((__m64*)&dst[i+1].y, out_zuvw); // write V,W
 }
 // tail for loop unrolling
 if (i < num)
 {
  __m128 v = _mm_loadl_pi(MM_ONES, (__m64*)&src[i].x);
  v = _mm_insert_ps(v, _mm_load_ss(&src[i].z), 0x20);
  __m128 x = _mm_mul_ps(r0, v);
  __m128 y = _mm_mul_ps(r1, v);
  __m128 z = _mm_mul_ps(r2, v);
  __m128 xy = _mm_hadd_ps(x, y);
  __m128 zw = _mm_hadd_ps(z, y);
  __m128 out = _mm_hadd_ps(xy, zw); // x,y,z,?
  _mm_storel_pi((__m64*)&dst[i].x, out);
  _mm_store_ss(&dst[i].z, _mm_movehl_ps(out, out));  
 }
}

This methods allows to transform an arbitrary number of float3 vectors. The following example shows how to call it:

float4 matrix[3] = {/* fill the matrix */};
const float3* src = /* pointer to input values */;
float3* dst = /* pointer to output values */;

multhom(dst, matrix, src, numElements);

5 comments:

Thri KreenAugust 1, 2011 at 7:36 PM
Very interesting. Can this be easily adapted to SSE3?
ReplyDelete
Replies
theowl84August 2, 2011 at 8:52 AM
Yes, it is not so hard to do. The only SSE4 command is the _mm_insert_ps(), so just replace its functionality using SSE3 intrinsics. I rewrote the save-part of the unrolled loop using shuffle and unpacklo. The v-vector of the tail can be loaded using something like:
__m128 xy00 = _mm_loadl_pi(_mm_setzero_ps(), (__m64*)&src[i].x);
__m128 z100 = _mm_unpacklo_ps(_mm_load_ss(&src[i].z), MM_ONES);
__m128 v = _mm_movelh_ps(xy00, z100);
which takes a few more cycles, but since it's called at most once, that should not cause any performance problem.
ReplyDelete
Replies
theowl84August 8, 2011 at 8:36 AM
Note that the code using the _mm_loadl_pi/_mm_insert_ps statement for loading a homogeneous vector might not work in Release build with Visual C++ 2010 compiler. There is a bug in the compiler which I reported recently: https://connect.microsoft.com/VisualStudio/feedback/details/683152/wrong-release-code-generation-for-sse4-intrinsics
ReplyDelete
Replies
AnonymousDecember 19, 2012 at 12:39 PM
theowl84,
Thanks for your researches.
I repeated your code.
Now I stay in a state of shock:
fpu quicker than sse...

for(size_t i = 0; i < num; i++)
{
pDest[i].x = pSrc[i].x*pMatrix[0].x+pSrc*pMatrix[i].y[0].y+pSrc[i].z*pMatrix[0].z + pMatrix[0].u;
pDest[i].y = pSrc[i].x*pMatrix[1].x+pSrc[i].y*pMatrix[1].y+pSrc[i].z*pMatrix[1].z + pMatrix[1].u;
pDest[i].z = pSrc[i].x*pMatrix[2].x+pSrc[i].y*pMatrix[2].y+pSrc[i].z*pMatrix[2].z + pMatrix[2].u;}

nCicl = 1000 nPoints = 1000000
fpu time 15 sec
sse time 17 sec

ReplyDelete
Replies

Add comment

FastC++: Coding Cpp Efficiently

Thursday, July 28, 2011

Fast 3D-vector matrix transformation using SSE4

5 comments:

About Me

Useful Links