## Friday, March 25, 2011

### Matrix vector multiplication using SSE3

A common operation is the matrix vector product. In the following example we multiply a 4-by-4 matrix and a 4-element vector. For multiplication of a 3x4 matrix with a 3D vector click here.

Using standard `C++` code the matrix-vector product would result in 16 multiplications and 12 add operations:
```void A_times_x(float4& out_y,
const float4* in_A,
const float4& in_x) {
out_y.x = dot(in_A[0], in_x);
out_y.y = dot(in_A[1], in_x);
out_y.z = dot(in_A[2], in_x);
out_y.w = dot(in_A[3], in_x);
}
```
The function above expects as input pointers to `float4` variables (see Common Datatypes). As the matrix `A` you need to pass a pointer to an array `float4 A[4]`.

On a modern processor supporting the SSE3 instruction set it is possible to perform the same work using only 4 vector multiplications and 3 horizontal vector additions:
```void A_times_x(float4& out_y,
const float4* in_A,
const float4& in_x) {
// Load matrix A and vector x into SSE registers
__m128 x =  _mm_loadu_ps((const float*)&in_x);
__m128 A0 = _mm_loadu_ps((const float*)(in_A + 0));
__m128 A1 = _mm_loadu_ps((const float*)(in_A + 1));
__m128 A2 = _mm_loadu_ps((const float*)(in_A + 2));
__m128 A3 = _mm_loadu_ps((const float*)(in_A + 3));

// Multiply each matrix row with the vector x
__m128 m0 = _mm_mul_ps(A0, x);
__m128 m1 = _mm_mul_ps(A1, x);
__m128 m2 = _mm_mul_ps(A2, x);
__m128 m3 = _mm_mul_ps(A3, x);

// Using HADD, we add four floats at a time
__m128 sum_01 = _mm_hadd_ps(m0, m1);
__m128 sum_23 = _mm_hadd_ps(m2, m3);
__m128 result = _mm_hadd_ps(sum_01, sum_23);

// Finally, store the result
_mm_storeu_ps((float*)&out_y, result);
}
```
Note that the majority of operations are concerned with loading and storing the matrix and vector data. If more than one vector `x` needs to be multiplied with the matrix `A`, we can keep the matrix in the SSE registers:
```void A_times_x(float4* out_y,
const float4* in_A,
const float4* in_x,
int N) {
// Load matrix A into SSE registers
__m128 A0 = _mm_loadu_ps((const float*)(in_A + 0));
__m128 A1 = _mm_loadu_ps((const float*)(in_A + 1));
__m128 A2 = _mm_loadu_ps((const float*)(in_A + 2));
__m128 A3 = _mm_loadu_ps((const float*)(in_A + 3));

for(int i = 0; i < N; i++)
{
__m128 x =  _mm_loadu_ps((const float*)&in_x[i]);
__m128 m0 = _mm_mul_ps(A0, x);
__m128 m1 = _mm_mul_ps(A1, x);
__m128 m2 = _mm_mul_ps(A2, x);
__m128 m3 = _mm_mul_ps(A3, x);
__m128 sum_01 = _mm_hadd_ps(m0, m1);
__m128 sum_23 = _mm_hadd_ps(m2, m3);
__m128 result = _mm_hadd_ps(sum_01, sum_23);
_mm_storeu_ps((float*)&out_y[i], result);
}
}
```
Now the function is able to multiply the matrix `A` with an array of `float4` vectors. To improve performance even further, ensure that all `float4` values are allocated at a 16-byte aligned address. Then it is possible to replace all `_mm_loadu_ps` loads and the `_mm_storeu_ps` store with their aligned counterparts `_mm_load_ps` and `_mm_store_ps`.

#### 1 comment:

1. You are noob. First of all: why do not use float [4][4] variables?...