This tutorial is based on a simple code example: adding two arrays and storing the result in a third array. Step by step, we will introduce loop SSE intrinsics, loop unrolling and functor based concepts which allow to build a library with different operations.

We start with a simple function for adding two arrays:

void add_vectors(float* C, const float* A, const float* B, size_t N) { for(size_t i = 0; i < N; i++) { C[i] = A[i] + B[i]; } }This code simply sums the arrays element-wise. In a previous article about loop unrolling we explained how to make the loop more efficient by reducing the number of comparisons and conditional jumps:

void add_vectors(float* C, const float* A, const float* B, size_t N) { size_t i = 0; // unrolled loop for(; i < ROUND_DOWN(N, 4); i+=4) { C[i+0] = A[i+0] + B[i+0]; C[i+1] = A[i+1] + B[i+1]; C[i+2] = A[i+2] + B[i+2]; C[i+3] = A[i+3] + B[i+3]; } // process remaining elements for(; i < N; i++) { C[i] = A[i] + B[i]; } }Modern processors have special SIMD units that allow to process more than one float value at a time. In the unrolled loop we can therefore use special SSE intrinsics that process all four values simultaneously:

void add_vectors(float* C, const float* A, const float* B, size_t N) { size_t i = 0; // unrolled loop with SSE commands for(; i < ROUND_DOWN(N, 4); i+=4) { __m128 a = _mm_loadu_ps(A + i); __m128 b = _mm_loadu_ps(B + i); __m128 c = _mm_add_ps(a, b); _mm_storeu_ps(C + i, c); } // process remaining elements for(; i < N; i++) { C[i] = A[i] + B[i]; } }Note that we use unaligned load and store commands since we cannot assume that the pointer

`A`

, `B`

or `C`

point to memory aligned on a 16 byte address, which is required for the much faster aligned load and store versions `_mm_load_ps`

and `_mm_store_ps`

. Writing special implementations for aligned memory access is cumbersome since there are 8 possible combinations of aligned and unaligned pointers in this function. Therefore, we present a template based approach that allows to parametrize the method at compile time. During run-time the appropriate load/store function is used with no additional overhead (supported by function inlining).
We define template functions `load_ps`

and `store_ps`

with an alignment parameter and specialize each:
enum Alignment { Aligned, Unaligned }; template<Alignment type> inline __m128 load_ps(const float* p); template<> inline __m128 load_ps<Aligned>(const float* p) { return _mm_load_ps(p); } template<> inline __m128 load_ps<Unaligned>(const float* p) { return _mm_loadu_ps(p); } template<Alignment type> inline void store_ps(float* p, __m128 a); template<> inline void store_ps<Aligned>(float* p, __m128 a) { return _mm_store_ps(p, a); } template<> inline void store_ps<Unaligned>(float* p, __m128 a) { return _mm_storeu_ps(p, a); }Now, we can rewrite our

`add_vectors`

function with three template parameters, each specifying the alignment of the corresponding pointer.
template<Alignment stC, Alignment ldA, Alignment ldB> void add_vectors(float* C, const float* A, const float* B, size_t N) { size_t i = 0; // unrolled loop with template loading/storing for(; i < ROUND_DOWN(N, 4); i+=4) { __m128 a = load_ps<ldA>(A + i); __m128 b = load_ps<ldB>(B + i); __m128 c = _mm_add_ps(a, b); store_ps<stC>(C + i, c); } // process remaining elements for(; i < N; i++) { C[i] = A[i] + B[i]; } }A call to this function with known alignment information may look like this:

float* A = new[100]; // unaligned new float* B = _aligned_malloc(100, 16); // 16-byte aligned memory float C[100]; // allocated on stack, probably unaligned add_vectors<Unaligned, Aligned, Unaligned>(C, A, B, 100);When developing a complete library of array functions (such as difference, product, etc.), it is desirable to reuse the code of

`add_vectors`

and not rewrite it for every operator. By using *functors*we make the operation itself a parameter (

`op`

).
class plus { public: __m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(a, b); } __m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(a, b); } }; template<Alignment stC, Alignment ldA, Alignment ldB, Op op> void process_vectors(float* C, const float* A, const float* B, size_t N, const Op& op) { size_t i = 0; // unrolled loop with template loading/storing for(; i < ROUND_DOWN(N, 4); i+=4) { __m128 a = load_ps<ldA>(A + i); __m128 b = load_ps<ldB>(B + i); __m128 c = op.eval_ps(a, b); store_ps<stC>(C + i, c); } // process remaining elements for(; i < N; i++) { __m128 a = _mm_load_ss(A + i); __m128 b = _mm_load_ss(B + i); __m128 c = op.eval_ss(a, b); _mm_store_ss(C + i, c); } }The class

`plus`

is used as a functor which can perform any operation with two arguments. We require two evaluation methods: `eval_ps`

for packed operations (4 elements) and `eval_ss`

for operating on single elements.
Two vectors can now be added by the following call:
float *A, *B, *C; // pointers to aligned/unaligned memory process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, plus());The initial coding effort quickly pays off when we extend the library with different operators, such as

`minus`

:
class minus { public: __m128 eval_ps(__m128 a, __m128 b) const { return _mm_sub_ps(a, b); } __m128 eval_ss(__m128 a, __m128 b) const { return _mm_sub_ss(a, b); } }; // ... //perform C = A - B process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, minus());The functor concept has additional benefits. For example, we can define an operator, which performs a weighted sum or linear combination of two values and use it with our

`process_vectors`

without modifications:
_MM_ALIGN16 class scaled_plus { public: scaled_plus(float scale1, float scale2) : s1_(_mm_set1_ps(scale1)), s2_(_mm_set1_ps(scale2)) {} __m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(_mm_mul_ps(a, s1_), _mm_mul_ps(b, s2_)); } __m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(_mm_mul_ss(a, s1_), _mm_mul_ss(b, s2_)); } private: __m128 s1_; __m128 s2_; }; // ... //perform C = 0.3 * A + 0.7 * B process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, scaled_plus(0.3f, 0.7f));Note that the

`_MM_ALIGN16`

prefix tells the compiler, that an `scaled_plus`

object is aligned on 16-byte memory, which is required for accessing `s1_`

and `s2_`

. When compiling this code in a release-build, there is no overhead when calling the functor object, since everything can be inlined.
In summary, we showed how to make the alignment of pointers a compile time argument to ensure optimal load and store operations on an array of float values. By using functors, we are able to create a generic library which supports arbitrary mathematical operations and enables passing of additional parameters.

## No comments:

## Post a Comment