Sunday, March 27, 2011

Loading two consecutive 3D vectors into SSE registers

In the previous post we have shown how to load a single float3 vector into an SSE register. This required either one 64bit and one 32bit load or three 32bit loads if the data is not aligned on a 8-byte address. This blog entry will show how to use more efficient loads when loading two float3 values into separate SSE registers.

Suppose we have given an array with two float3 vectors and want to sum up both vectors using the SSE add _mm_add_ps. When loading each vector individually, we could use the following code:
float3 v[2] = { /* some data */ };
__m128 sum = _mm_add_ps(LoadFloat3(v[0]), LoadFloat3(v[1]));
// ...
Since both vectors are stored in a consecutive memory location, we can make use of a 128bit load and two additional 32bit loads as well as shuffling to fill our registers:
float3 v[2] = { /* some data */ };
__m128 xyza = _mm_loadu_ps(&v[0].x); // = {X0, Y0, Z0, X1}
__m128 b =    _mm_load_ss(&v[1].y);  // = {Y1, 0, 0, 0}
__m128 c =    _mm_load_ss(&v[1].z);  // = {Z1, 0, 0, 0}
__m128 ab =  _mm_shuffle_ps(b, xyza, _MM_SHUFFLE(0, 3, 0, 0));
// ab = {Y1, Y1, X1, X0}
__m128 abc = _mm_shuffle_ps(ab, c, _MM_SHUFFLE(3, 0, 0, 2));
// abc = {X1, Y1, Z1, 0}
__m128 sum = _mm_add_ps(xyza, abc);
// ...
Note that the the upper most float in the xyza is equal to v[1].x. Since we only add two 3-element vectors, we do not care about the content of the fourth element. The code above does not require the data to be aligned. If the address of v is 16-byte aligned, it is possible to load the xyza register using the faster _mm_load_ps load.

When four 3D vectors lie in a consecutive memory location, it is possible to load all 12 float values using three 128-bit loads only. This topic is handled in a follow up post.

No comments:

Post a Comment