|v| = length3(v) = sqrt(v.x^2 + v.y^2 + v.z^2)A single SSE register can be used to hold a 3D vector (the highest 32 bits are unused). In a previous article we show how to load a
struct
containing 3 float values into a SSE register. You may as well use the _mm_setr_ps(x, y, z, 0)
intrinsic.
SSE4 introduced the DPPS
instruction (accessible via the _mm_dp_ps
intrinsic) which allows to calculate the dot product of up to four float values. We will now use this intrinsic to calculate the length of a 3D vector with minimal instructions.// sqrt(x^2 + y^2 + z^2) inline float length3(__m128 v) { return _mm_cvtss_f32(_mm_sqrt_ss(_mm_dp_ps(v, v, 0x71))); }Note that
_mm_dp_ps
uses three parameters: two times the vector v
and the value 0x71
which tells the CPU to multiply the three lower floats of v
with themselves and store the sum of the product in the lowest float of the result (0x71 = 0111 0001
). Since only the lowest float of the result is used, we can use the single float square root _mm_sqrt_ss
and convert the result to the native C++
datatype.
Another common function is the normalization of a 3D vector:
normalize(v) = v / |v|Thus we need to calculate the norm of vector
v
and divide each component of v
by this value. Instead of shuffling and dividing it is possible to use more specialized SSE intrinsics for this task: _mm_dp_ps
can be used to store the dot product in all four floats of a register and there is a special reciprocal square root, which can be computed faster than the square root itself and allows to use a product instead of the much slower division.
// (x,y,z) / sqrt(x^2 + y^2 + z^2) inline __m128 normalize3(__m128 v) { __m128 inverse_norm = _mm_rsqrt_ps(_mm_dp_ps(v, v, 0x77)); return _mm_mul_ps(v, inverse_norm); }Note that we use the parameter
0x77
for _mm_dp_ps
which tells the CPU to copy the dot product three times in the lower three registers, thus we also can use the packed version of the reciprocal square root _mm_rsqrt_ps
.
The reason for the speed of the reciprocal square root is that it only is a good approximation and thus not entirely accurate.
It therefore will happen, that the length of the resulting vector is only close to 1 (length3(normalize3(v)) <> 1
).
A more accurate version of the normalized vector is obtained by
// (x,y,z) / sqrt(x^2 + y^2 + z^2) inline __m128 normalize3_accurate(__m128 v) { __m128 norm = _mm_sqrt_ps(_mm_dp_ps(v, v, 0x7F)); return _mm_div_ps(v, norm); }which is much slower though. Note that here we use
0x7F
to avoid division by 0.
I have been gaining a lot of usable and exemplifying stuff in your website.
ReplyDeleteAllvectors.Com
Sehr interessantes Blog hast Du.
ReplyDeletelg.
The only issue I find with this is that it does not take into account the distinct possibility of the vector being of nearly zero length, thus there is a big potential for issues with division by zero and use of infinity, in the "accurate" normalization.
ReplyDelete