## Friday, December 30, 2011

### Template based vector library with SSE support

This article explains how to set up a vector math library that performs mathematical operations on arbitrary sized float arrays or vectors. It can handle aligned and unaligned pointers with minimal code overhead but optimal runtime performance. This is achieved through template functions and compile-time arguments.
This tutorial is based on a simple code example: adding two arrays and storing the result in a third array. Step by step, we will introduce loop SSE intrinsics, loop unrolling and functor based concepts which allow to build a library with different operations.
```void add_vectors(float* C, const float* A, const float* B, size_t N)
{
for(size_t i = 0; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
```
This code simply sums the arrays element-wise. In a previous article about loop unrolling we explained how to make the loop more efficient by reducing the number of comparisons and conditional jumps:
```void add_vectors(float* C, const float* A, const float* B, size_t N)
{
size_t i = 0;
// unrolled loop
for(; i < ROUND_DOWN(N, 4); i+=4)
{
C[i+0] = A[i+0] + B[i+0];
C[i+1] = A[i+1] + B[i+1];
C[i+2] = A[i+2] + B[i+2];
C[i+3] = A[i+3] + B[i+3];
}
// process remaining elements
for(; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
```
Modern processors have special SIMD units that allow to process more than one float value at a time. In the unrolled loop we can therefore use special SSE intrinsics that process all four values simultaneously:
```void add_vectors(float* C, const float* A, const float* B, size_t N)
{
size_t i = 0;
// unrolled loop with SSE commands
for(; i < ROUND_DOWN(N, 4); i+=4)
{
__m128 a = _mm_loadu_ps(A + i);
__m128 b = _mm_loadu_ps(B + i);
_mm_storeu_ps(C + i, c);
}
// process remaining elements
for(; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
```
Note that we use unaligned load and store commands since we cannot assume that the pointer `A`, `B` or `C` point to memory aligned on a 16 byte address, which is required for the much faster aligned load and store versions `_mm_load_ps` and `_mm_store_ps`. Writing special implementations for aligned memory access is cumbersome since there are 8 possible combinations of aligned and unaligned pointers in this function. Therefore, we present a template based approach that allows to parametrize the method at compile time. During run-time the appropriate load/store function is used with no additional overhead (supported by function inlining). We define template functions `load_ps` and `store_ps` with an alignment parameter and specialize each:
```enum Alignment
{
Aligned,
Unaligned
};

template<Alignment type> inline __m128 load_ps(const float* p);

template<Alignment type> inline void store_ps(float* p, __m128 a);
template<> inline void store_ps<Aligned>(float* p, __m128 a) { return _mm_store_ps(p, a); }
template<> inline void store_ps<Unaligned>(float* p, __m128 a) { return _mm_storeu_ps(p, a); }
```
Now, we can rewrite our `add_vectors` function with three template parameters, each specifying the alignment of the corresponding pointer.
```template<Alignment stC, Alignment ldA, Alignment ldB>
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
size_t i = 0;
for(; i < ROUND_DOWN(N, 4); i+=4)
{
__m128 a = load_ps<ldA>(A + i);
__m128 b = load_ps<ldB>(B + i);
store_ps<stC>(C + i, c);
}
// process remaining elements
for(; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
```
A call to this function with known alignment information may look like this:
```float* A = new; // unaligned new
float* B = _aligned_malloc(100, 16); // 16-byte aligned memory
float C; // allocated on stack, probably unaligned
add_vectors<Unaligned, Aligned, Unaligned>(C, A, B, 100);
```
When developing a complete library of array functions (such as difference, product, etc.), it is desirable to reuse the code of `add_vectors` and not rewrite it for every operator. By using functors we make the operation itself a parameter (`op`).
```class plus
{
public:
__m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(a, b); }
__m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(a, b); }
};

template<Alignment stC, Alignment ldA, Alignment ldB, Op op>
void process_vectors(float* C, const float* A, const float* B, size_t N, const Op& op)
{
size_t i = 0;
for(; i < ROUND_DOWN(N, 4); i+=4)
{
__m128 a = load_ps<ldA>(A + i);
__m128 b = load_ps<ldB>(B + i);
__m128 c = op.eval_ps(a, b);
store_ps<stC>(C + i, c);
}
// process remaining elements
for(; i < N; i++)
{
__m128 a = _mm_load_ss(A + i);
__m128 b = _mm_load_ss(B + i);
__m128 c = op.eval_ss(a, b);
_mm_store_ss(C + i, c);
}
}
```
The class `plus` is used as a functor which can perform any operation with two arguments. We require two evaluation methods: `eval_ps` for packed operations (4 elements) and `eval_ss` for operating on single elements. Two vectors can now be added by the following call:
```float *A, *B, *C; // pointers to aligned/unaligned memory
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, plus());
```
The initial coding effort quickly pays off when we extend the library with different operators, such as `minus`:
```class minus
{
public:
__m128 eval_ps(__m128 a, __m128 b) const { return _mm_sub_ps(a, b); }
__m128 eval_ss(__m128 a, __m128 b) const { return _mm_sub_ss(a, b); }
};

// ...

//perform C = A - B
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, minus());
```
The functor concept has additional benefits. For example, we can define an operator, which performs a weighted sum or linear combination of two values and use it with our `process_vectors` without modifications:
```_MM_ALIGN16 class scaled_plus
{
public:
scaled_plus(float scale1, float scale2) : s1_(_mm_set1_ps(scale1)), s2_(_mm_set1_ps(scale2)) {}
__m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(_mm_mul_ps(a, s1_), _mm_mul_ps(b, s2_)); }
__m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(_mm_mul_ss(a, s1_), _mm_mul_ss(b, s2_)); }
private:
__m128 s1_;
__m128 s2_;
};

// ...

//perform C = 0.3 * A + 0.7 * B
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, scaled_plus(0.3f, 0.7f));
```
Note that the `_MM_ALIGN16` prefix tells the compiler, that an `scaled_plus` object is aligned on 16-byte memory, which is required for accessing `s1_` and `s2_`. When compiling this code in a release-build, there is no overhead when calling the functor object, since everything can be inlined.
In summary, we showed how to make the alignment of pointers a compile time argument to ensure optimal load and store operations on an array of float values. By using functors, we are able to create a generic library which supports arbitrary mathematical operations and enables passing of additional parameters.