Friday, March 25, 2011

Matrix vector multiplication using SSE3

A common operation is the matrix vector product. In the following example we multiply a 4-by-4 matrix and a 4-element vector. For multiplication of a 3x4 matrix with a 3D vector click here.

Using standard C++ code the matrix-vector product would result in 16 multiplications and 12 add operations:
void A_times_x(float4& out_y, 
               const float4* in_A, 
               const float4& in_x) {
 out_y.x = dot(in_A[0], in_x); 
 out_y.y = dot(in_A[1], in_x);
 out_y.z = dot(in_A[2], in_x);
 out_y.w = dot(in_A[3], in_x);
}
The function above expects as input pointers to float4 variables (see Common Datatypes). As the matrix A you need to pass a pointer to an array float4 A[4].

On a modern processor supporting the SSE3 instruction set it is possible to perform the same work using only 4 vector multiplications and 3 horizontal vector additions:
void A_times_x(float4& out_y, 
               const float4* in_A, 
               const float4& in_x) {
 // Load matrix A and vector x into SSE registers
 __m128 x =  _mm_loadu_ps((const float*)&in_x);
 __m128 A0 = _mm_loadu_ps((const float*)(in_A + 0));
 __m128 A1 = _mm_loadu_ps((const float*)(in_A + 1));
 __m128 A2 = _mm_loadu_ps((const float*)(in_A + 2));
 __m128 A3 = _mm_loadu_ps((const float*)(in_A + 3));

 // Multiply each matrix row with the vector x
 __m128 m0 = _mm_mul_ps(A0, x);
 __m128 m1 = _mm_mul_ps(A1, x);
 __m128 m2 = _mm_mul_ps(A2, x);
 __m128 m3 = _mm_mul_ps(A3, x);

 // Using HADD, we add four floats at a time
 __m128 sum_01 = _mm_hadd_ps(m0, m1); 
 __m128 sum_23 = _mm_hadd_ps(m2, m3);
 __m128 result = _mm_hadd_ps(sum_01, sum_23);
 
 // Finally, store the result
 _mm_storeu_ps((float*)&out_y, result);
}
Note that the majority of operations are concerned with loading and storing the matrix and vector data. If more than one vector x needs to be multiplied with the matrix A, we can keep the matrix in the SSE registers:
void A_times_x(float4* out_y, 
               const float4* in_A, 
               const float4* in_x, 
               int N) {
 // Load matrix A into SSE registers
 __m128 A0 = _mm_loadu_ps((const float*)(in_A + 0));
 __m128 A1 = _mm_loadu_ps((const float*)(in_A + 1));
 __m128 A2 = _mm_loadu_ps((const float*)(in_A + 2));
 __m128 A3 = _mm_loadu_ps((const float*)(in_A + 3));

 for(int i = 0; i < N; i++)
 {
   __m128 x =  _mm_loadu_ps((const float*)&in_x[i]);
   __m128 m0 = _mm_mul_ps(A0, x);
   __m128 m1 = _mm_mul_ps(A1, x);
   __m128 m2 = _mm_mul_ps(A2, x);
   __m128 m3 = _mm_mul_ps(A3, x);
   __m128 sum_01 = _mm_hadd_ps(m0, m1); 
   __m128 sum_23 = _mm_hadd_ps(m2, m3);
   __m128 result = _mm_hadd_ps(sum_01, sum_23);
   _mm_storeu_ps((float*)&out_y[i], result);
 }
}
Now the function is able to multiply the matrix A with an array of float4 vectors. To improve performance even further, ensure that all float4 values are allocated at a 16-byte aligned address. Then it is possible to replace all _mm_loadu_ps loads and the _mm_storeu_ps store with their aligned counterparts _mm_load_ps and _mm_store_ps.

1 comment:

  1. You are noob. First of all: why do not use float [4][4] variables?...

    ReplyDelete