float3
vector into an SSE register. This required either one 64bit and one 32bit load or three 32bit loads if the data is not aligned on a 8-byte address. This blog entry will show how to use more efficient loads when loading two float3
values into separate SSE registers.Suppose we have given an array with two
float3
vectors and want to sum up both vectors using the SSE add _mm_add_ps
. When loading each vector individually, we could use the following code:float3 v[2] = { /* some data */ }; __m128 sum = _mm_add_ps(LoadFloat3(v[0]), LoadFloat3(v[1])); // ...Since both vectors are stored in a consecutive memory location, we can make use of a 128bit load and two additional 32bit loads as well as shuffling to fill our registers:
float3 v[2] = { /* some data */ }; __m128 xyza = _mm_loadu_ps(&v[0].x); // = {X0, Y0, Z0, X1} __m128 b = _mm_load_ss(&v[1].y); // = {Y1, 0, 0, 0} __m128 c = _mm_load_ss(&v[1].z); // = {Z1, 0, 0, 0} __m128 ab = _mm_shuffle_ps(b, xyza, _MM_SHUFFLE(0, 3, 0, 0)); // ab = {Y1, Y1, X1, X0} __m128 abc = _mm_shuffle_ps(ab, c, _MM_SHUFFLE(3, 0, 0, 2)); // abc = {X1, Y1, Z1, 0} __m128 sum = _mm_add_ps(xyza, abc); // ...Note that the the upper most float in the
xyza
is equal to v[1].x
. Since we only add two 3-element vectors, we do not care about the content of the fourth element. The code above does not require the data to be aligned. If the address of v
is 16-byte aligned, it is possible to load the xyza
register using the faster _mm_load_ps
load.When four 3D vectors lie in a consecutive memory location, it is possible to load all 12 float values using three 128-bit loads only. This topic is handled in a follow up post.
No comments:
Post a Comment