|v| = length3(v) = sqrt(v.x^2 + v.y^2 + v.z^2)A single SSE register can be used to hold a 3D vector (the highest 32 bits are unused). In a previous article we show how to load a

`struct`

containing 3 float values into a SSE register. You may as well use the `_mm_setr_ps(x, y, z, 0)`

intrinsic.
SSE4 introduced the `DPPS`

instruction (accessible via the `_mm_dp_ps`

intrinsic) which allows to calculate the dot product of up to four float values. We will now use this intrinsic to calculate the length of a 3D vector with minimal instructions.// sqrt(x^2 + y^2 + z^2) inline float length3(__m128 v) { return _mm_cvtss_f32(_mm_sqrt_ss(_mm_dp_ps(v, v, 0x71))); }Note that

`_mm_dp_ps`

uses three parameters: two times the vector `v`

and the value `0x71`

which tells the CPU to multiply the three lower floats of `v`

with themselves and store the sum of the product in the lowest float of the result (`0x71 = 0111 0001`

). Since only the lowest float of the result is used, we can use the single float square root `_mm_sqrt_ss`

and convert the result to the native `C++`

datatype.
Another common function is the normalization of a 3D vector:

normalize(v) = v / |v|Thus we need to calculate the norm of vector

`v`

and divide each component of `v`

by this value. Instead of shuffling and dividing it is possible to use more specialized SSE intrinsics for this task: `_mm_dp_ps`

can be used to store the dot product in all four floats of a register and there is a special reciprocal square root, which can be computed faster than the square root itself and allows to use a product instead of the much slower division.
// (x,y,z) / sqrt(x^2 + y^2 + z^2) inline __m128 normalize3(__m128 v) { __m128 inverse_norm = _mm_rsqrt_ps(_mm_dp_ps(v, v, 0x77)); return _mm_mul_ps(v, inverse_norm); }Note that we use the parameter

`0x77`

for `_mm_dp_ps`

which tells the CPU to copy the dot product three times in the lower three registers, thus we also can use the packed version of the reciprocal square root `_mm_rsqrt_ps`

.
The reason for the speed of the reciprocal square root is that it only is a good approximation and thus not entirely accurate.
It therefore will happen, that the length of the resulting vector is only close to 1 (`length3(normalize3(v)) <> 1`

).
A more accurate version of the normalized vector is obtained by
// (x,y,z) / sqrt(x^2 + y^2 + z^2) inline __m128 normalize3_accurate(__m128 v) { __m128 norm = _mm_sqrt_ps(_mm_dp_ps(v, v, 0x7F)); return _mm_div_ps(v, norm); }which is much slower though. Note that here we use

`0x7F`

to avoid division by 0.
I have been gaining a lot of usable and exemplifying stuff in your website.

ReplyDeleteAllvectors.Com

Sehr interessantes Blog hast Du.

ReplyDeletelg.

The only issue I find with this is that it does not take into account the distinct possibility of the vector being of nearly zero length, thus there is a big potential for issues with division by zero and use of infinity, in the "accurate" normalization.

ReplyDelete