## Monday, April 11, 2011

In the previous two post, we showed you how to load one or two three-element float vectors using SSE intrinsics and a combination of 128-bit, 64-bit and 32-bit loads. When loading 4 3D vectors that are tightly packed in memory, we need to load 12 float values in total, which is possible using only three 128-bit load operations. In order to perform mathematical operations with them, it's practical to have one vector per register. The following code shows you how to load 12 float values and distributes them to four SSE registers, so that each register contains a X, Y and Z value (the fourth component is unused).
```p = [X1Y1Z1][X2Y2Z2][X3Y3Z3][X4Y4Z4]
```
```v[0-3] = [X1Y1Z1??][X2Y2Z2??][X3Y3Z3??][X4Y4Z4??]
```
```const float* p = /* pointer to 4 3-element vectors */;
__m128 p0 = _mm_loadu_ps(p + 0); // X1Y1Z1X2
__m128 p1 = _mm_loadu_ps(p + 4); // Y2Z2X3Y3
__m128 p2 = _mm_loadu_ps(p + 8); // Z3X4Y4Z4

__m128 tmp = _mm_shuffle_ps(p0, p1, _MM_SHUFFLE(0, 1, 0, 3)); // X2??Z2Y2

__m128 v0 = p0;                                                // X1Y1Z1
__m128 v1 = _mm_shuffle_ps(tmp, tmp, _MM_SHUFFLE(0, 2, 3, 0)); // X2Y2Z2
__m128 v2 = _mm_shuffle_ps(p1, p2, _MM_SHUFFLE(0, 0, 3, 2));   // X3Y3Z3
__m128 v3 = _mm_shuffle_ps(p2, p2, _MM_SHUFFLE(0, 3, 2, 1));   // X4Y4Z4
```