Saturday, March 26, 2011

Loading a 3D vector into an SSE register

In this blog entry I will show you how to load a three element float vector into an SSE register using C++ intrinsics. This tutorial is based on the float3 datatype which holds three floats (see Common Datatypes). An SSE register is able to store four float values and there are methods to load one, two or all four values - but not three. Therefore we need to split the load into two parts and combine them. First, we use plain SSE code:
inline __m128 LoadFloat3(const float3& value) {
 // load the x and y element of the float3 vector using a 64 bit load
 // and set the upper 64 bits to zero (00YX)
 __m128 xy = _mm_loadl_pi(_mm_setzero_ps(), (const __m64*)&value);
 // now load the z element using a 32 bit load (000Z)
 __m128 z = _mm_load_ss(&value.z);

 // finally, combine both by moving the z component to the high part
 // of the register, while keeping x and y components in the low part
 return _mm_movelh_ps(xy, z); // (0ZYX)
Note that we need to pass an additional value to _mm_loadl_pi in order to define the high part of the result. We pass the value zero that we need to generate using an additional intrinsic _mm_setzero_ps. We can overcome this overhead by using an SSE2 intrinsic which automatically sets the high part of the loaded register to zero: _mm_loadl_epi64. This instruction is intended for loading one 64bit integer, but since we just load arbitrary binary data into the register this does not matter for our application (except that we need to tell the compiler that we are actually dealing with floats by casting the register later on).
inline __m128 LoadFloat3(const float3& value) {
 // load x, y with a 64 bit integer load (00YX)
 __m128i xy = _mm_loadl_epi64((const __m128i*)&value));

 // now load the z element using a 32 bit float load (000Z)
 __m128 z = _mm_load_ss(&value.z);

 // we now need to cast the __m128i register into a __m128 one (0ZYX)
 return _mm_movelh_ps(_mm_castsi128_ps(xy), z);
The compiler now generates only three assembly instructions from the SSE2 code (MOVQ, MOVSS, MOVLHPS) when loading a single float3 value. However, a 64bit MOVQ load is slow when the address of the data is not 8-byte aligned. Therefore, if you cannot guarantee the address alignment it is generally faster to use three 32bit loads which require only 4-byte alignment (most compilers will ensure this alignment for float values or structs containing floats).
inline __m128 LoadFloat3(const float3& value)
 __m128 x = _mm_load_ss(&value.x);
 __m128 y = _mm_load_ss(&value.y);
 __m128 z = _mm_load_ss(&value.z);
 __m128 xy = _mm_movelh_ps(x, y);
 return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0));
When loading an array of float3 values that are stored in consecutive memory, it is possible to optimize the load operations by loading 6 or even 12 floats in a batch. This topic will be handled in a future post.


  1. why not
    __m128 vec = mm_setr_ps(value.x,value.y,value.z,0.0f);

    or is that slow?

    1. i don't know if it's slower or not, basically it does the same as the code above but uses four loads (the 0 must be set). setr_ps is simply a compiler macro, which maps to multiple instructions. However, it is often more convenient to use setr_ps (or set_ps) since it is only a one-liner.

  2. Without 0
    __m128 vec = _mm_setr_ps(value.x,value.y,value.z,value.z);

    Thanks for blog!!!