## Sunday, March 27, 2011

In the previous post we have shown how to load a single `float3` vector into an SSE register. This required either one 64bit and one 32bit load or three 32bit loads if the data is not aligned on a 8-byte address. This blog entry will show how to use more efficient loads when loading two `float3` values into separate SSE registers.

Suppose we have given an array with two `float3` vectors and want to sum up both vectors using the SSE add `_mm_add_ps`. When loading each vector individually, we could use the following code:
Since both vectors are stored in a consecutive memory location, we can make use of a 128bit load and two additional 32bit loads as well as shuffling to fill our registers:
__m128 xyza = _mm_loadu_ps(&v.x); // = {X0, Y0, Z0, X1}
__m128 b =    _mm_load_ss(&v.y);  // = {Y1, 0, 0, 0}
__m128 c =    _mm_load_ss(&v.z);  // = {Z1, 0, 0, 0}
__m128 ab =  _mm_shuffle_ps(b, xyza, _MM_SHUFFLE(0, 3, 0, 0));
// ab = {Y1, Y1, X1, X0}
__m128 abc = _mm_shuffle_ps(ab, c, _MM_SHUFFLE(3, 0, 0, 2));
// abc = {X1, Y1, Z1, 0}
Note that the the upper most float in the `xyza` is equal to `v.x`. Since we only add two 3-element vectors, we do not care about the content of the fourth element. The code above does not require the data to be aligned. If the address of `v` is 16-byte aligned, it is possible to load the `xyza` register using the faster `_mm_load_ps` load.