Friday, December 30, 2011

Template based vector library with SSE support

This article explains how to set up a vector math library that performs mathematical operations on arbitrary sized float arrays or vectors. It can handle aligned and unaligned pointers with minimal code overhead but optimal runtime performance. This is achieved through template functions and compile-time arguments.
This tutorial is based on a simple code example: adding two arrays and storing the result in a third array. Step by step, we will introduce loop SSE intrinsics, loop unrolling and functor based concepts which allow to build a library with different operations.

Wednesday, December 28, 2011

Simple vector3 class with SSE support

In this post we show how to write a simple class which represents a 3D vector which uses SSE operations for fast calculations. The class stores three float values (x, y and z) and implements all basic vector operators such as add, subtract, multiply, divide, cross product, dot product and length calculations. It uses aligned 128-bit memory which allows to use SSE intrinsics directly. In addition, it overloads the new and delete operators for arrays which allows to create multiple instances of vector3.

Thursday, July 28, 2011

Fast 3D-vector matrix transformation using SSE4

In this blog-post we'd like to show how to efficiently transform multiple 3D vectors using an affine transformation matrix. Each vector has three coordinates (x, y and z) and the matrix consists of three rows each with 4 elements (3 for rotation/scale + 1 for translation). In order to multiply a 3-element vector with a 3x4 matrix, we add an additional 1 at the end of the 3D vector:

Friday, June 17, 2011

Intel Architecture Code Analyzer

Intel provides a great tool for static code analysis of C++ code at assembly-code-level. It is called the Intel Architecture Code Analyzer and will allow you to analyze how instructions are executed on an Intel CPU (including instruction pairing and the critical path).

Thursday, June 16, 2011

Bilinear Pixel Interpolation using SSE

Bilinear pixel interpolation is a common operation in image processing applications (resizing, distorting, etc.) as well as in computer graphics (texturing, etc.). It allows accessing pixels at non-integer coordinates of the underlying image by building a weighted sum over all neighbors of the specified image position. On GPUs this operation is implemented in hardware. However, some algorithms can not be ported to the GPU easily and a CPU implementation of the Bilinear interpolation is needed.

This article will show how to efficiently implement such an operation in C++ using SSE2 instructions for 8-bit RGBA images. First, we will show how to perform bilinear interpolation using pure C++ code and then present an enhanced example where we utilize SSE intrinsics.

Friday, April 15, 2011

How to process a STL vector using SSE code

In C++ it's very convenient to store array data using the std::vector from the STL library. On modern CPUs you can take advantage of vectorized instructions that allow you to operate on multiple data elements at the same time. But how do you combine a std::vector with SSE code? For example, you want to sum up each element of a float vector of arbitrary length (in C++ this corresponds to std::accumulate(v.begin(), v.end(), 0.0f)). This article will show you how to access the elements of the vector using SSE intrinsics and accumulate all elements into a single value.

3D Point Projection to 2D Image Using SSE

The most common operation in 3D applications is to project 3D points to a 2D image, most often that is your computer screen. For that purpose you normally use your graphics card which is most efficient at this task. But there are times, when you need to further process 2D points on your CPU. This article will show you how to efficiently perform 3D point projection in C++ using SSE intrinsics.

Monday, April 11, 2011

Loading four consecutive 3D vectors into SSE registers

In the previous two post, we showed you how to load one or two three-element float vectors using SSE intrinsics and a combination of 128-bit, 64-bit and 32-bit loads. When loading 4 3D vectors that are tightly packed in memory, we need to load 12 float values in total, which is possible using only three 128-bit load operations. In order to perform mathematical operations with them, it's practical to have one vector per register. The following code shows you how to load 12 float values and distributes them to four SSE registers, so that each register contains a X, Y and Z value (the fourth component is unused).

Vector Cross Product using SSE Code

A common operation for two 3D vectors is the cross product:
|a.x|   |b.x|   | a.y * b.z - a.z * b.y |
|a.y| X |b.y| = | a.z * b.x - a.x * b.z |
|a.z|   |b.z|   | a.x * b.y - a.y * b.x |
Executing this operation using scalar instructions requires 6 multiplications and three subtractions. When using vectorized SSE code, the same operation can be performed using 2 multiplications, one subtraction and 4 shuffle operations:

Tuesday, April 5, 2011

How to Unroll a For-Loop in C++

In this article we will show you how to manually unroll a loop in C++. Loop unrolling has performance advantages due to the reduced overhead of checking and advancing the loop counter at each iteration. Also, when using vectorized mathematical operations such as SSE instructions, it is possible to perform some iterations of the same loop in parallel. Let's first focus on the general concept of loop unrolling with a simple example:

Monday, March 28, 2011

Changing the sign of float values using SSE code

The IEEE 754 floating point format defines the memory layout for the C++ float datatype. It consists of a one bit sign, the 8 bit exponent and 23 bits that store the fractional part of the value.
float x = [sign (1 bit) | exponent (8bit) | fraction (23bit)]
We can use this knowledge about the memory-layout in order to change the sign of floating point values without the need for floating point arithmetic.

Sunday, March 27, 2011

Loading two consecutive 3D vectors into SSE registers

In the previous post we have shown how to load a single float3 vector into an SSE register. This required either one 64bit and one 32bit load or three 32bit loads if the data is not aligned on a 8-byte address. This blog entry will show how to use more efficient loads when loading two float3 values into separate SSE registers.

Saturday, March 26, 2011

Loading a 3D vector into an SSE register

In this blog entry I will show you how to load a three element float vector into an SSE register using C++ intrinsics. This tutorial is based on the float3 datatype which holds three floats (see Common Datatypes). An SSE register is able to store four float values and there are methods to load one, two or all four values - but not three. Therefore we need to split the load into two parts and combine them.

Friday, March 25, 2011

Matrix vector multiplication using SSE3

A common operation is the matrix vector product. In the following example we multiply a 4-by-4 matrix and a 4-element vector. For multiplication of a 3x4 matrix with a 3D vector click here.

Using standard C++ code the matrix-vector product would result in 16 multiplications and 12 add operations:

Fast iteration over STL vector elements

The STL class std::vector is a great container for managing dynamic arrays. Unfortunately it introduces a slight overhead when iterating through it using an index. For example, the following code is not very efficient:

Welcome to Fast C++

Have you ever wondered how to make code in C++ efficient and fast? I have seen many programs that are supposed to run fast but contain common pitfalls that make them slower than necessary. Using just some simple tricks and basic knowledge about high performance programming, you can make your current C++ program a lot faster. So stay tuned for useful tips and tricks to make your C++ program lightning fast!