## Friday, March 25, 2011

### Matrix vector multiplication using SSE3

A common operation is the matrix vector product. In the following example we multiply a 4-by-4 matrix and a 4-element vector. For multiplication of a 3x4 matrix with a 3D vector click here.

Using standard `C++` code the matrix-vector product would result in 16 multiplications and 12 add operations:
```void A_times_x(float4& out_y,
const float4* in_A,
const float4& in_x) {
out_y.x = dot(in_A, in_x);
out_y.y = dot(in_A, in_x);
out_y.z = dot(in_A, in_x);
out_y.w = dot(in_A, in_x);
}
```
The function above expects as input pointers to `float4` variables (see Common Datatypes). As the matrix `A` you need to pass a pointer to an array `float4 A`.

On a modern processor supporting the SSE3 instruction set it is possible to perform the same work using only 4 vector multiplications and 3 horizontal vector additions:
```void A_times_x(float4& out_y,
const float4* in_A,
const float4& in_x) {
// Load matrix A and vector x into SSE registers
__m128 A0 = _mm_loadu_ps((const float*)(in_A + 0));
__m128 A1 = _mm_loadu_ps((const float*)(in_A + 1));
__m128 A2 = _mm_loadu_ps((const float*)(in_A + 2));
__m128 A3 = _mm_loadu_ps((const float*)(in_A + 3));

// Multiply each matrix row with the vector x
__m128 m0 = _mm_mul_ps(A0, x);
__m128 m1 = _mm_mul_ps(A1, x);
__m128 m2 = _mm_mul_ps(A2, x);
__m128 m3 = _mm_mul_ps(A3, x);

// Finally, store the result
_mm_storeu_ps((float*)&out_y, result);
}
```
Note that the majority of operations are concerned with loading and storing the matrix and vector data. If more than one vector `x` needs to be multiplied with the matrix `A`, we can keep the matrix in the SSE registers:
```void A_times_x(float4* out_y,
const float4* in_A,
const float4* in_x,
int N) {
// Load matrix A into SSE registers
__m128 A0 = _mm_loadu_ps((const float*)(in_A + 0));
__m128 A1 = _mm_loadu_ps((const float*)(in_A + 1));
__m128 A2 = _mm_loadu_ps((const float*)(in_A + 2));
__m128 A3 = _mm_loadu_ps((const float*)(in_A + 3));

for(int i = 0; i < N; i++)
{
__m128 m0 = _mm_mul_ps(A0, x);
__m128 m1 = _mm_mul_ps(A1, x);
__m128 m2 = _mm_mul_ps(A2, x);
__m128 m3 = _mm_mul_ps(A3, x);
Now the function is able to multiply the matrix `A` with an array of `float4` vectors. To improve performance even further, ensure that all `float4` values are allocated at a 16-byte aligned address. Then it is possible to replace all `_mm_loadu_ps` loads and the `_mm_storeu_ps` store with their aligned counterparts `_mm_load_ps` and `_mm_store_ps`.
1. 