`float3`

vector into an SSE register. This required either one 64bit and one 32bit load or three 32bit loads if the data is not aligned on a 8-byte address. This blog entry will show how to use more efficient loads when loading two `float3`

values into separate SSE registers.Suppose we have given an array with two

`float3`

vectors and want to sum up both vectors using the SSE add `_mm_add_ps`

. When loading each vector individually, we could use the following code:float3 v[2] = { /* some data */ }; __m128 sum = _mm_add_ps(LoadFloat3(v[0]), LoadFloat3(v[1])); // ...Since both vectors are stored in a consecutive memory location, we can make use of a 128bit load and two additional 32bit loads as well as shuffling to fill our registers:

float3 v[2] = { /* some data */ }; __m128 xyza = _mm_loadu_ps(&v[0].x); // = {X0, Y0, Z0, X1} __m128 b = _mm_load_ss(&v[1].y); // = {Y1, 0, 0, 0} __m128 c = _mm_load_ss(&v[1].z); // = {Z1, 0, 0, 0} __m128 ab = _mm_shuffle_ps(b, xyza, _MM_SHUFFLE(0, 3, 0, 0)); // ab = {Y1, Y1, X1, X0} __m128 abc = _mm_shuffle_ps(ab, c, _MM_SHUFFLE(3, 0, 0, 2)); // abc = {X1, Y1, Z1, 0} __m128 sum = _mm_add_ps(xyza, abc); // ...Note that the the upper most float in the

`xyza`

is equal to `v[1].x`

. Since we only add two 3-element vectors, we do not care about the content of the fourth element. The code above does not require the data to be aligned. If the address of `v`

is 16-byte aligned, it is possible to load the `xyza`

register using the faster `_mm_load_ps`

load.When four 3D vectors lie in a consecutive memory location, it is possible to load all 12 float values using three 128-bit loads only. This topic is handled in a follow up post.

## No comments:

## Post a Comment