Friday, April 15, 2011

How to process a STL vector using SSE code

In C++ it's very convenient to store array data using the std::vector from the STL library. On modern CPUs you can take advantage of vectorized instructions that allow you to operate on multiple data elements at the same time. But how do you combine a std::vector with SSE code? For example, you want to sum up each element of a float vector of arbitrary length (in C++ this corresponds to std::accumulate(v.begin(), v.end(), 0.0f)). This article will show you how to access the elements of the vector using SSE intrinsics and accumulate all elements into a single value.

3D Point Projection to 2D Image Using SSE

The most common operation in 3D applications is to project 3D points to a 2D image, most often that is your computer screen. For that purpose you normally use your graphics card which is most efficient at this task. But there are times, when you need to further process 2D points on your CPU. This article will show you how to efficiently perform 3D point projection in C++ using SSE intrinsics.

Monday, April 11, 2011

Loading four consecutive 3D vectors into SSE registers

In the previous two post, we showed you how to load one or two three-element float vectors using SSE intrinsics and a combination of 128-bit, 64-bit and 32-bit loads. When loading 4 3D vectors that are tightly packed in memory, we need to load 12 float values in total, which is possible using only three 128-bit load operations. In order to perform mathematical operations with them, it's practical to have one vector per register. The following code shows you how to load 12 float values and distributes them to four SSE registers, so that each register contains a X, Y and Z value (the fourth component is unused).

Vector Cross Product using SSE Code

A common operation for two 3D vectors is the cross product:
|a.x|   |b.x|   | a.y * b.z - a.z * b.y |
|a.y| X |b.y| = | a.z * b.x - a.x * b.z |
|a.z|   |b.z|   | a.x * b.y - a.y * b.x |
Executing this operation using scalar instructions requires 6 multiplications and three subtractions. When using vectorized SSE code, the same operation can be performed using 2 multiplications, one subtraction and 4 shuffle operations:

Tuesday, April 5, 2011

How to Unroll a For-Loop in C++

In this article we will show you how to manually unroll a loop in C++. Loop unrolling has performance advantages due to the reduced overhead of checking and advancing the loop counter at each iteration. Also, when using vectorized mathematical operations such as SSE instructions, it is possible to perform some iterations of the same loop in parallel. Let's first focus on the general concept of loop unrolling with a simple example: