## Wednesday, February 15, 2012

### Calculating the Length of a 3D Vector using SSE

This article explains how to calculate the length of a single 3D float vector stored in a SSE register. The length or norm of a vector is defined as the square root of the dot product of the vector with itself:
```|v| = length3(v) = sqrt(v.x^2 + v.y^2 + v.z^2)
```
A single SSE register can be used to hold a 3D vector (the highest 32 bits are unused). In a previous article we show how to load a `struct` containing 3 float values into a SSE register. You may as well use the `_mm_setr_ps(x, y, z, 0)` intrinsic. SSE4 introduced the `DPPS` instruction (accessible via the `_mm_dp_ps` intrinsic) which allows to calculate the dot product of up to four float values. We will now use this intrinsic to calculate the length of a 3D vector with minimal instructions.
```// sqrt(x^2 + y^2 + z^2)
inline float length3(__m128 v)
{
return _mm_cvtss_f32(_mm_sqrt_ss(_mm_dp_ps(v, v, 0x71)));
}
```
Note that `_mm_dp_ps` uses three parameters: two times the vector `v` and the value `0x71` which tells the CPU to multiply the three lower floats of `v` with themselves and store the sum of the product in the lowest float of the result (`0x71 = 0111 0001`). Since only the lowest float of the result is used, we can use the single float square root `_mm_sqrt_ss` and convert the result to the native `C++` datatype.
Another common function is the normalization of a 3D vector:
```normalize(v) = v / |v|
```
Thus we need to calculate the norm of vector `v` and divide each component of `v` by this value. Instead of shuffling and dividing it is possible to use more specialized SSE intrinsics for this task: `_mm_dp_ps` can be used to store the dot product in all four floats of a register and there is a special reciprocal square root, which can be computed faster than the square root itself and allows to use a product instead of the much slower division.
```// (x,y,z) / sqrt(x^2 + y^2 + z^2)
inline __m128 normalize3(__m128 v)
{
__m128 inverse_norm = _mm_rsqrt_ps(_mm_dp_ps(v, v, 0x77));
return _mm_mul_ps(v, inverse_norm);
}
```
Note that we use the parameter `0x77` for `_mm_dp_ps` which tells the CPU to copy the dot product three times in the lower three registers, thus we also can use the packed version of the reciprocal square root `_mm_rsqrt_ps`. The reason for the speed of the reciprocal square root is that it only is a good approximation and thus not entirely accurate. It therefore will happen, that the length of the resulting vector is only close to 1 (`length3(normalize3(v)) <> 1`). A more accurate version of the normalized vector is obtained by
```// (x,y,z) / sqrt(x^2 + y^2 + z^2)
inline __m128 normalize3_accurate(__m128 v)
{
__m128 norm = _mm_sqrt_ps(_mm_dp_ps(v, v, 0x7F));
return _mm_div_ps(v, norm);
}
```
which is much slower though. Note that here we use `0x7F` to avoid division by 0.

#### 1 comment:

1. I have been gaining a lot of usable and exemplifying stuff in your website.
Allvectors.Com