Wednesday, February 15, 2012

Calculating the Length of a 3D Vector using SSE

This article explains how to calculate the length of a single 3D float vector stored in a SSE register. The length or norm of a vector is defined as the square root of the dot product of the vector with itself:
|v| = length3(v) = sqrt(v.x^2 + v.y^2 + v.z^2)
A single SSE register can be used to hold a 3D vector (the highest 32 bits are unused). In a previous article we show how to load a struct containing 3 float values into a SSE register. You may as well use the _mm_setr_ps(x, y, z, 0) intrinsic. SSE4 introduced the DPPS instruction (accessible via the _mm_dp_ps intrinsic) which allows to calculate the dot product of up to four float values. We will now use this intrinsic to calculate the length of a 3D vector with minimal instructions.
// sqrt(x^2 + y^2 + z^2)
inline float length3(__m128 v)
  return _mm_cvtss_f32(_mm_sqrt_ss(_mm_dp_ps(v, v, 0x71)));
Note that _mm_dp_ps uses three parameters: two times the vector v and the value 0x71 which tells the CPU to multiply the three lower floats of v with themselves and store the sum of the product in the lowest float of the result (0x71 = 0111 0001). Since only the lowest float of the result is used, we can use the single float square root _mm_sqrt_ss and convert the result to the native C++ datatype.
Another common function is the normalization of a 3D vector:
normalize(v) = v / |v|
Thus we need to calculate the norm of vector v and divide each component of v by this value. Instead of shuffling and dividing it is possible to use more specialized SSE intrinsics for this task: _mm_dp_ps can be used to store the dot product in all four floats of a register and there is a special reciprocal square root, which can be computed faster than the square root itself and allows to use a product instead of the much slower division.
// (x,y,z) / sqrt(x^2 + y^2 + z^2)
inline __m128 normalize3(__m128 v)
  __m128 inverse_norm = _mm_rsqrt_ps(_mm_dp_ps(v, v, 0x77));
  return _mm_mul_ps(v, inverse_norm);
Note that we use the parameter 0x77 for _mm_dp_ps which tells the CPU to copy the dot product three times in the lower three registers, thus we also can use the packed version of the reciprocal square root _mm_rsqrt_ps. The reason for the speed of the reciprocal square root is that it only is a good approximation and thus not entirely accurate. It therefore will happen, that the length of the resulting vector is only close to 1 (length3(normalize3(v)) <> 1). A more accurate version of the normalized vector is obtained by
// (x,y,z) / sqrt(x^2 + y^2 + z^2)
inline __m128 normalize3_accurate(__m128 v)
  __m128 norm = _mm_sqrt_ps(_mm_dp_ps(v, v, 0x7F));
  return _mm_div_ps(v, norm);
which is much slower though. Note that here we use 0x7F to avoid division by 0.


  1. I have been gaining a lot of usable and exemplifying stuff in your website.

  2. Sehr interessantes Blog hast Du.


  3. The only issue I find with this is that it does not take into account the distinct possibility of the vector being of nearly zero length, thus there is a big potential for issues with division by zero and use of infinity, in the "accurate" normalization.