std::vector
from the STL library. On modern CPUs you can take advantage of vectorized instructions that allow you to operate on multiple data elements at the same time. But how do you combine a std::vector
with SSE code? For example, you want to sum up each element of a float vector of arbitrary length (in C++ this corresponds to std::accumulate(v.begin(), v.end(), 0.0f)
). This article will show you how to access the elements of the vector using SSE intrinsics and accumulate all elements into a single value.First, you need to get a pointer to the data of the
std::vector
v
in order to load multiple elements using the 128-bit _mm_loadu_ps
intrinsic. Note that it is generally not possible to load the data using an aligned load since the std::vector
class does not allocate aligned memory. Now we us the loop-unrolling technique presented in an earlier article in order to sum up 4 elements in parallel using a single _mm_add_ps
instruction. When there are still elements left after all unrolled iterations of the loop, we add them up using the single-element _mm_add_ss
. Since we accumulate the float values in four bins, we need to add those up as well before we can return the total sum over all elements of the vector v
. In this example we use the SSE3 horizontal add intrinsic _mm_hadd_ps
.float accumulate(const std::vector<float>& v) { // copy the length of v and a pointer to the data onto the local stack const size_t N = v.size(); const float* p = (N > 0) ? &v.front() : NULL; __m128 mmSum = _mm_setzero_ps(); size_t i = 0; // unrolled loop that adds up 4 elements at a time for(; i < ROUND_DOWN(N, 4); i+=4) { mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i)); } // add up single values until all elements are covered for(; i < N; i++) { mmSum = _mm_add_ss(mmSum, _mm_load_ss(p + i)); } // add up the four float values from mmSum into a single value and return mmSum = _mm_hadd_ps(mmSum, mmSum); mmSum = _mm_hadd_ps(mmSum, mmSum); return _mm_cvtss_f32(mmSum); }To increase performance even further, we can add more
_mm_add_ps
calls to the body of the unrolled loop. In order to add up 16 floats in a single iteration, insert the snippet below after the size_t i = 0;
line: for(; i < ROUND_DOWN(N, 16); i+=16) { mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 0)); mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 4)); mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 8)); mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 12)); }While this code reduces the loop-overhead, it does
SSE
commands simultaneously by obeying a simple rule: do not read from the register you have just written. In other words, if you use two mutually data-independent instructions, both instructions can be either executed at the same time using instruction paring or at least allow to hide latencies.In our unrolled accumulation code we can for example break the dependency chain on the
mmSum
register by introducing intermediate registers that contain partial sums:for(; i < ROUND_DOWN(N, 16); i+=16) { __m128 v0 = _mm_loadu_ps(p + i + 0); __m128 v1 = _mm_loadu_ps(p + i + 4); __m128 s01 = _mm_add_ps(v0, v1); __m128 v2 = _mm_loadu_ps(p + i + 8); __m128 v3 = _mm_loadu_ps(p + i + 12); __m128 s23 = _mm_add_ps(v2, v3); mmSum = _mm_add_ps(mmSum, s01); mmSum = _mm_add_ps(mmSum, s23); }This code will allow the processor to load data from memory while adding up some values in parallel. Changing the naive unrolled loop to the one above increases the performance of the overall method by more than 40% compared to the naive unrolling.
Benchmark
In order to measure the performance of different accumulation techniques, we called the
accumulate
method one million times and let it add up a 10000 element vector. These are the timing results:Method | Time | Performance increase |
---|---|---|
Native C++ | 9372ms | N/A (timing base) |
Simple unrolled SSE | 2348ms | + 299% (3.9 times faster) |
4x unrolled SSE | 2337ms | + 301% (4.0 times faster) |
4x unrolled SSE w/ partial sums | 1633ms | + 474% (5.7 times faster) |
As you can see, it is possible to not only gain an expected performance increase of four due to adding up 4 floats with a single instruction, but achieve an even greater performance by instruction level parallelism.
That's a great position you folks have been carrying out there.
ReplyDeleteall vectors
this is awesome! thanks
ReplyDeleteCan you do (some of) the remainder at the beginning such that internal loop can use aligned loads?
ReplyDeletedo you mean, add up single float until the first float is aligned, then use aligned floats? why not. The number of iterations of the pre-alignment loop should be quite easy to compute.
ReplyDeleteDid you do any tests with compiler optimizations vs your manually crafted code?
ReplyDelete