Friday, April 15, 2011

How to process a STL vector using SSE code

In C++ it's very convenient to store array data using the std::vector from the STL library. On modern CPUs you can take advantage of vectorized instructions that allow you to operate on multiple data elements at the same time. But how do you combine a std::vector with SSE code? For example, you want to sum up each element of a float vector of arbitrary length (in C++ this corresponds to std::accumulate(v.begin(), v.end(), 0.0f)). This article will show you how to access the elements of the vector using SSE intrinsics and accumulate all elements into a single value.

First, you need to get a pointer to the data of the std::vector v in order to load multiple elements using the 128-bit _mm_loadu_ps intrinsic. Note that it is generally not possible to load the data using an aligned load since the std::vector class does not allocate aligned memory. Now we us the loop-unrolling technique presented in an earlier article in order to sum up 4 elements in parallel using a single _mm_add_ps instruction. When there are still elements left after all unrolled iterations of the loop, we add them up using the single-element _mm_add_ss. Since we accumulate the float values in four bins, we need to add those up as well before we can return the total sum over all elements of the vector v. In this example we use the SSE3 horizontal add intrinsic _mm_hadd_ps.
float accumulate(const std::vector<float>& v)
 // copy the length of v and a pointer to the data onto the local stack
 const size_t N = v.size();
 const float* p = (N > 0) ? &v.front() : NULL;

 __m128 mmSum = _mm_setzero_ps();
 size_t i = 0;

 // unrolled loop that adds up 4 elements at a time
 for(; i < ROUND_DOWN(N, 4); i+=4)
  mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i));

 // add up single values until all elements are covered
 for(; i < N; i++)
  mmSum = _mm_add_ss(mmSum, _mm_load_ss(p + i));

 // add up the four float values from mmSum into a single value and return
 mmSum = _mm_hadd_ps(mmSum, mmSum);
 mmSum = _mm_hadd_ps(mmSum, mmSum);
 return _mm_cvtss_f32(mmSum);
To increase performance even further, we can add more _mm_add_ps calls to the body of the unrolled loop. In order to add up 16 floats in a single iteration, insert the snippet below after the size_t i = 0; line:
for(; i < ROUND_DOWN(N, 16); i+=16)
  mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 0));
  mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 4));
  mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 8));
  mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i + 12));
While this code reduces the loop-overhead, it does not maximize performance. Modern CPUs are capable of executing two operations at the same time using multiple execution units. It is therefore possible to execute multiple SSE commands simultaneously by obeying a simple rule: do not read from the register you have just written. In other words, if you use two mutually data-independent instructions, both instructions can be either executed at the same time using instruction paring or at least allow to hide latencies.

In our unrolled accumulation code we can for example break the dependency chain on the mmSum register by introducing intermediate registers that contain partial sums:
for(; i < ROUND_DOWN(N, 16); i+=16)
  __m128 v0 = _mm_loadu_ps(p + i + 0);
  __m128 v1 = _mm_loadu_ps(p + i + 4);
  __m128 s01 = _mm_add_ps(v0, v1);

  __m128 v2 = _mm_loadu_ps(p + i + 8);
  __m128 v3 = _mm_loadu_ps(p + i + 12);
  __m128 s23 = _mm_add_ps(v2, v3);

  mmSum = _mm_add_ps(mmSum, s01);
  mmSum = _mm_add_ps(mmSum, s23);
This code will allow the processor to load data from memory while adding up some values in parallel. Changing the naive unrolled loop to the one above increases the performance of the overall method by more than 40% compared to the naive unrolling.

In order to measure the performance of different accumulation techniques, we called the accumulate method one million times and let it add up a 10000 element vector. These are the timing results:

MethodTimePerformance increase
Native C++9372msN/A (timing base)
Simple unrolled SSE2348ms+ 299% (3.9 times faster)
4x unrolled SSE2337ms+ 301% (4.0 times faster)
4x unrolled SSE w/ partial sums1633ms+ 474% (5.7 times faster)

As you can see, it is possible to not only gain an expected performance increase of four due to adding up 4 floats with a single instruction, but achieve an even greater performance by instruction level parallelism.


  1. That's a great position you folks have been carrying out there.
    all vectors

  2. this is awesome! thanks

  3. Can you do (some of) the remainder at the beginning such that internal loop can use aligned loads?

  4. do you mean, add up single float until the first float is aligned, then use aligned floats? why not. The number of iterations of the pre-alignment loop should be quite easy to compute.

  5. Did you do any tests with compiler optimizations vs your manually crafted code?