Using standard

`C++`

code the matrix-vector product would result in 16 multiplications and 12 add operations:void A_times_x(float4& out_y, const float4* in_A, const float4& in_x) { out_y.x = dot(in_A[0], in_x); out_y.y = dot(in_A[1], in_x); out_y.z = dot(in_A[2], in_x); out_y.w = dot(in_A[3], in_x); }The function above expects as input pointers to

`float4`

variables (see Common Datatypes). As the matrix `A`

you need to pass a pointer to an array `float4 A[4]`

.On a modern processor supporting the SSE3 instruction set it is possible to perform the same work using only 4 vector multiplications and 3 horizontal vector additions:

void A_times_x(float4& out_y, const float4* in_A, const float4& in_x) { // Load matrix A and vector x into SSE registers __m128 x = _mm_loadu_ps((const float*)&in_x); __m128 A0 = _mm_loadu_ps((const float*)(in_A + 0)); __m128 A1 = _mm_loadu_ps((const float*)(in_A + 1)); __m128 A2 = _mm_loadu_ps((const float*)(in_A + 2)); __m128 A3 = _mm_loadu_ps((const float*)(in_A + 3)); // Multiply each matrix row with the vector x __m128 m0 = _mm_mul_ps(A0, x); __m128 m1 = _mm_mul_ps(A1, x); __m128 m2 = _mm_mul_ps(A2, x); __m128 m3 = _mm_mul_ps(A3, x); // Using HADD, we add four floats at a time __m128 sum_01 = _mm_hadd_ps(m0, m1); __m128 sum_23 = _mm_hadd_ps(m2, m3); __m128 result = _mm_hadd_ps(sum_01, sum_23); // Finally, store the result _mm_storeu_ps((float*)&out_y, result); }Note that the majority of operations are concerned with loading and storing the matrix and vector data. If more than one vector

`x`

needs to be multiplied with the matrix `A`

, we can keep the matrix in the SSE registers:void A_times_x(float4* out_y, const float4* in_A, const float4* in_x, int N) { // Load matrix A into SSE registers __m128 A0 = _mm_loadu_ps((const float*)(in_A + 0)); __m128 A1 = _mm_loadu_ps((const float*)(in_A + 1)); __m128 A2 = _mm_loadu_ps((const float*)(in_A + 2)); __m128 A3 = _mm_loadu_ps((const float*)(in_A + 3)); for(int i = 0; i < N; i++) { __m128 x = _mm_loadu_ps((const float*)&in_x[i]); __m128 m0 = _mm_mul_ps(A0, x); __m128 m1 = _mm_mul_ps(A1, x); __m128 m2 = _mm_mul_ps(A2, x); __m128 m3 = _mm_mul_ps(A3, x); __m128 sum_01 = _mm_hadd_ps(m0, m1); __m128 sum_23 = _mm_hadd_ps(m2, m3); __m128 result = _mm_hadd_ps(sum_01, sum_23); _mm_storeu_ps((float*)&out_y[i], result); } }Now the function is able to multiply the matrix

`A`

with an array of `float4`

vectors. To improve performance even further, ensure that all `float4`

values are allocated at a 16-byte aligned address. Then it is possible to replace all `_mm_loadu_ps`

loads and the `_mm_storeu_ps`

store with their aligned counterparts `_mm_load_ps`

and `_mm_store_ps`

.
You are noob. First of all: why do not use float [4][4] variables?...

ReplyDelete