This tutorial is based on a simple code example: adding two arrays and storing the result in a third array. Step by step, we will introduce loop SSE intrinsics, loop unrolling and functor based concepts which allow to build a library with different operations.
We start with a simple function for adding two arrays:
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
for(size_t i = 0; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
This code simply sums the arrays element-wise. In a previous article about loop unrolling we explained how to make the loop more efficient by reducing the number of comparisons and conditional jumps:
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
size_t i = 0;
// unrolled loop
for(; i < ROUND_DOWN(N, 4); i+=4)
{
C[i+0] = A[i+0] + B[i+0];
C[i+1] = A[i+1] + B[i+1];
C[i+2] = A[i+2] + B[i+2];
C[i+3] = A[i+3] + B[i+3];
}
// process remaining elements
for(; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
Modern processors have special SIMD units that allow to process more than one float value at a time. In the unrolled loop we can therefore use special SSE intrinsics that process all four values simultaneously:
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
size_t i = 0;
// unrolled loop with SSE commands
for(; i < ROUND_DOWN(N, 4); i+=4)
{
__m128 a = _mm_loadu_ps(A + i);
__m128 b = _mm_loadu_ps(B + i);
__m128 c = _mm_add_ps(a, b);
_mm_storeu_ps(C + i, c);
}
// process remaining elements
for(; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
Note that we use unaligned load and store commands since we cannot assume that the pointer A, B or C point to memory aligned on a 16 byte address, which is required for the much faster aligned load and store versions _mm_load_ps and _mm_store_ps. Writing special implementations for aligned memory access is cumbersome since there are 8 possible combinations of aligned and unaligned pointers in this function. Therefore, we present a template based approach that allows to parametrize the method at compile time. During run-time the appropriate load/store function is used with no additional overhead (supported by function inlining).
We define template functions load_ps and store_ps with an alignment parameter and specialize each:
enum Alignment
{
Aligned,
Unaligned
};
template<Alignment type> inline __m128 load_ps(const float* p);
template<> inline __m128 load_ps<Aligned>(const float* p) { return _mm_load_ps(p); }
template<> inline __m128 load_ps<Unaligned>(const float* p) { return _mm_loadu_ps(p); }
template<Alignment type> inline void store_ps(float* p, __m128 a);
template<> inline void store_ps<Aligned>(float* p, __m128 a) { return _mm_store_ps(p, a); }
template<> inline void store_ps<Unaligned>(float* p, __m128 a) { return _mm_storeu_ps(p, a); }
Now, we can rewrite our add_vectors function with three template parameters, each specifying the alignment of the corresponding pointer.
template<Alignment stC, Alignment ldA, Alignment ldB>
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
size_t i = 0;
// unrolled loop with template loading/storing
for(; i < ROUND_DOWN(N, 4); i+=4)
{
__m128 a = load_ps<ldA>(A + i);
__m128 b = load_ps<ldB>(B + i);
__m128 c = _mm_add_ps(a, b);
store_ps<stC>(C + i, c);
}
// process remaining elements
for(; i < N; i++)
{
C[i] = A[i] + B[i];
}
}
A call to this function with known alignment information may look like this:
float* A = new[100]; // unaligned new float* B = _aligned_malloc(100, 16); // 16-byte aligned memory float C[100]; // allocated on stack, probably unaligned add_vectors<Unaligned, Aligned, Unaligned>(C, A, B, 100);When developing a complete library of array functions (such as difference, product, etc.), it is desirable to reuse the code of
add_vectors and not rewrite it for every operator. By using functors we make the operation itself a parameter (op).
class plus
{
public:
__m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(a, b); }
__m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(a, b); }
};
template<Alignment stC, Alignment ldA, Alignment ldB, Op op>
void process_vectors(float* C, const float* A, const float* B, size_t N, const Op& op)
{
size_t i = 0;
// unrolled loop with template loading/storing
for(; i < ROUND_DOWN(N, 4); i+=4)
{
__m128 a = load_ps<ldA>(A + i);
__m128 b = load_ps<ldB>(B + i);
__m128 c = op.eval_ps(a, b);
store_ps<stC>(C + i, c);
}
// process remaining elements
for(; i < N; i++)
{
__m128 a = _mm_load_ss(A + i);
__m128 b = _mm_load_ss(B + i);
__m128 c = op.eval_ss(a, b);
_mm_store_ss(C + i, c);
}
}
The class plus is used as a functor which can perform any operation with two arguments. We require two evaluation methods: eval_ps for packed operations (4 elements) and eval_ss for operating on single elements.
Two vectors can now be added by the following call:
float *A, *B, *C; // pointers to aligned/unaligned memory process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, plus());The initial coding effort quickly pays off when we extend the library with different operators, such as
minus:
class minus
{
public:
__m128 eval_ps(__m128 a, __m128 b) const { return _mm_sub_ps(a, b); }
__m128 eval_ss(__m128 a, __m128 b) const { return _mm_sub_ss(a, b); }
};
// ...
//perform C = A - B
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, minus());
The functor concept has additional benefits. For example, we can define an operator, which performs a weighted sum or linear combination of two values and use it with our process_vectors without modifications:
_MM_ALIGN16 class scaled_plus
{
public:
scaled_plus(float scale1, float scale2) : s1_(_mm_set1_ps(scale1)), s2_(_mm_set1_ps(scale2)) {}
__m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(_mm_mul_ps(a, s1_), _mm_mul_ps(b, s2_)); }
__m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(_mm_mul_ss(a, s1_), _mm_mul_ss(b, s2_)); }
private:
__m128 s1_;
__m128 s2_;
};
// ...
//perform C = 0.3 * A + 0.7 * B
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, scaled_plus(0.3f, 0.7f));
Note that the _MM_ALIGN16 prefix tells the compiler, that an scaled_plus object is aligned on 16-byte memory, which is required for accessing s1_ and s2_. When compiling this code in a release-build, there is no overhead when calling the functor object, since everything can be inlined.
In summary, we showed how to make the alignment of pointers a compile time argument to ensure optimal load and store operations on an array of float values. By using functors, we are able to create a generic library which supports arbitrary mathematical operations and enables passing of additional parameters.
No comments:
Post a Comment