Using standard
C++
code the matrix-vector product would result in 16 multiplications and 12 add operations:void A_times_x(float4& out_y, const float4* in_A, const float4& in_x) { out_y.x = dot(in_A[0], in_x); out_y.y = dot(in_A[1], in_x); out_y.z = dot(in_A[2], in_x); out_y.w = dot(in_A[3], in_x); }The function above expects as input pointers to
float4
variables (see Common Datatypes). As the matrix A
you need to pass a pointer to an array float4 A[4]
.On a modern processor supporting the SSE3 instruction set it is possible to perform the same work using only 4 vector multiplications and 3 horizontal vector additions:
void A_times_x(float4& out_y, const float4* in_A, const float4& in_x) { // Load matrix A and vector x into SSE registers __m128 x = _mm_loadu_ps((const float*)&in_x); __m128 A0 = _mm_loadu_ps((const float*)(in_A + 0)); __m128 A1 = _mm_loadu_ps((const float*)(in_A + 1)); __m128 A2 = _mm_loadu_ps((const float*)(in_A + 2)); __m128 A3 = _mm_loadu_ps((const float*)(in_A + 3)); // Multiply each matrix row with the vector x __m128 m0 = _mm_mul_ps(A0, x); __m128 m1 = _mm_mul_ps(A1, x); __m128 m2 = _mm_mul_ps(A2, x); __m128 m3 = _mm_mul_ps(A3, x); // Using HADD, we add four floats at a time __m128 sum_01 = _mm_hadd_ps(m0, m1); __m128 sum_23 = _mm_hadd_ps(m2, m3); __m128 result = _mm_hadd_ps(sum_01, sum_23); // Finally, store the result _mm_storeu_ps((float*)&out_y, result); }Note that the majority of operations are concerned with loading and storing the matrix and vector data. If more than one vector
x
needs to be multiplied with the matrix A
, we can keep the matrix in the SSE registers:void A_times_x(float4* out_y, const float4* in_A, const float4* in_x, int N) { // Load matrix A into SSE registers __m128 A0 = _mm_loadu_ps((const float*)(in_A + 0)); __m128 A1 = _mm_loadu_ps((const float*)(in_A + 1)); __m128 A2 = _mm_loadu_ps((const float*)(in_A + 2)); __m128 A3 = _mm_loadu_ps((const float*)(in_A + 3)); for(int i = 0; i < N; i++) { __m128 x = _mm_loadu_ps((const float*)&in_x[i]); __m128 m0 = _mm_mul_ps(A0, x); __m128 m1 = _mm_mul_ps(A1, x); __m128 m2 = _mm_mul_ps(A2, x); __m128 m3 = _mm_mul_ps(A3, x); __m128 sum_01 = _mm_hadd_ps(m0, m1); __m128 sum_23 = _mm_hadd_ps(m2, m3); __m128 result = _mm_hadd_ps(sum_01, sum_23); _mm_storeu_ps((float*)&out_y[i], result); } }Now the function is able to multiply the matrix
A
with an array of float4
vectors. To improve performance even further, ensure that all float4
values are allocated at a 16-byte aligned address. Then it is possible to replace all _mm_loadu_ps
loads and the _mm_storeu_ps
store with their aligned counterparts _mm_load_ps
and _mm_store_ps
.
You are noob. First of all: why do not use float [4][4] variables?...
ReplyDelete