This tutorial is based on a simple code example: adding two arrays and storing the result in a third array. Step by step, we will introduce loop SSE intrinsics, loop unrolling and functor based concepts which allow to build a library with different operations.
We start with a simple function for adding two arrays:
void add_vectors(float* C, const float* A, const float* B, size_t N) { for(size_t i = 0; i < N; i++) { C[i] = A[i] + B[i]; } }This code simply sums the arrays element-wise. In a previous article about loop unrolling we explained how to make the loop more efficient by reducing the number of comparisons and conditional jumps:
void add_vectors(float* C, const float* A, const float* B, size_t N) { size_t i = 0; // unrolled loop for(; i < ROUND_DOWN(N, 4); i+=4) { C[i+0] = A[i+0] + B[i+0]; C[i+1] = A[i+1] + B[i+1]; C[i+2] = A[i+2] + B[i+2]; C[i+3] = A[i+3] + B[i+3]; } // process remaining elements for(; i < N; i++) { C[i] = A[i] + B[i]; } }Modern processors have special SIMD units that allow to process more than one float value at a time. In the unrolled loop we can therefore use special SSE intrinsics that process all four values simultaneously:
void add_vectors(float* C, const float* A, const float* B, size_t N) { size_t i = 0; // unrolled loop with SSE commands for(; i < ROUND_DOWN(N, 4); i+=4) { __m128 a = _mm_loadu_ps(A + i); __m128 b = _mm_loadu_ps(B + i); __m128 c = _mm_add_ps(a, b); _mm_storeu_ps(C + i, c); } // process remaining elements for(; i < N; i++) { C[i] = A[i] + B[i]; } }Note that we use unaligned load and store commands since we cannot assume that the pointer
A
, B
or C
point to memory aligned on a 16 byte address, which is required for the much faster aligned load and store versions _mm_load_ps
and _mm_store_ps
. Writing special implementations for aligned memory access is cumbersome since there are 8 possible combinations of aligned and unaligned pointers in this function. Therefore, we present a template based approach that allows to parametrize the method at compile time. During run-time the appropriate load/store function is used with no additional overhead (supported by function inlining).
We define template functions load_ps
and store_ps
with an alignment parameter and specialize each:
enum Alignment { Aligned, Unaligned }; template<Alignment type> inline __m128 load_ps(const float* p); template<> inline __m128 load_ps<Aligned>(const float* p) { return _mm_load_ps(p); } template<> inline __m128 load_ps<Unaligned>(const float* p) { return _mm_loadu_ps(p); } template<Alignment type> inline void store_ps(float* p, __m128 a); template<> inline void store_ps<Aligned>(float* p, __m128 a) { return _mm_store_ps(p, a); } template<> inline void store_ps<Unaligned>(float* p, __m128 a) { return _mm_storeu_ps(p, a); }Now, we can rewrite our
add_vectors
function with three template parameters, each specifying the alignment of the corresponding pointer.
template<Alignment stC, Alignment ldA, Alignment ldB> void add_vectors(float* C, const float* A, const float* B, size_t N) { size_t i = 0; // unrolled loop with template loading/storing for(; i < ROUND_DOWN(N, 4); i+=4) { __m128 a = load_ps<ldA>(A + i); __m128 b = load_ps<ldB>(B + i); __m128 c = _mm_add_ps(a, b); store_ps<stC>(C + i, c); } // process remaining elements for(; i < N; i++) { C[i] = A[i] + B[i]; } }A call to this function with known alignment information may look like this:
float* A = new[100]; // unaligned new float* B = _aligned_malloc(100, 16); // 16-byte aligned memory float C[100]; // allocated on stack, probably unaligned add_vectors<Unaligned, Aligned, Unaligned>(C, A, B, 100);When developing a complete library of array functions (such as difference, product, etc.), it is desirable to reuse the code of
add_vectors
and not rewrite it for every operator. By using functors we make the operation itself a parameter (op
).
class plus { public: __m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(a, b); } __m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(a, b); } }; template<Alignment stC, Alignment ldA, Alignment ldB, Op op> void process_vectors(float* C, const float* A, const float* B, size_t N, const Op& op) { size_t i = 0; // unrolled loop with template loading/storing for(; i < ROUND_DOWN(N, 4); i+=4) { __m128 a = load_ps<ldA>(A + i); __m128 b = load_ps<ldB>(B + i); __m128 c = op.eval_ps(a, b); store_ps<stC>(C + i, c); } // process remaining elements for(; i < N; i++) { __m128 a = _mm_load_ss(A + i); __m128 b = _mm_load_ss(B + i); __m128 c = op.eval_ss(a, b); _mm_store_ss(C + i, c); } }The class
plus
is used as a functor which can perform any operation with two arguments. We require two evaluation methods: eval_ps
for packed operations (4 elements) and eval_ss
for operating on single elements.
Two vectors can now be added by the following call:
float *A, *B, *C; // pointers to aligned/unaligned memory process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, plus());The initial coding effort quickly pays off when we extend the library with different operators, such as
minus
:
class minus { public: __m128 eval_ps(__m128 a, __m128 b) const { return _mm_sub_ps(a, b); } __m128 eval_ss(__m128 a, __m128 b) const { return _mm_sub_ss(a, b); } }; // ... //perform C = A - B process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, minus());The functor concept has additional benefits. For example, we can define an operator, which performs a weighted sum or linear combination of two values and use it with our
process_vectors
without modifications:
_MM_ALIGN16 class scaled_plus { public: scaled_plus(float scale1, float scale2) : s1_(_mm_set1_ps(scale1)), s2_(_mm_set1_ps(scale2)) {} __m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(_mm_mul_ps(a, s1_), _mm_mul_ps(b, s2_)); } __m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(_mm_mul_ss(a, s1_), _mm_mul_ss(b, s2_)); } private: __m128 s1_; __m128 s2_; }; // ... //perform C = 0.3 * A + 0.7 * B process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, scaled_plus(0.3f, 0.7f));Note that the
_MM_ALIGN16
prefix tells the compiler, that an scaled_plus
object is aligned on 16-byte memory, which is required for accessing s1_
and s2_
. When compiling this code in a release-build, there is no overhead when calling the functor object, since everything can be inlined.
In summary, we showed how to make the alignment of pointers a compile time argument to ensure optimal load and store operations on an array of float values. By using functors, we are able to create a generic library which supports arbitrary mathematical operations and enables passing of additional parameters.
No comments:
Post a Comment