## Sunday, March 10, 2013

### Efficient Processing of Arrays using SSE/SIMD and C++ Functors

This post is about how to write beautiful SIMD/SSE code that is easy to debug and maintain, yet allows to produce optimal code with zero overhead. It builds upon existing posts such as the Template based vector library and How to unroll a loop in C++. Here we present a pattern that is fully based on functors and similar to array processing in the Standard Template Library (STL).
Given one or multiple array of values (such as `float`, `double`, `int`, etc.), we want to process the array element wise. For example, we want to scale and add the value of an input array and add that to the output array. This is exactly the behavior of the `?AXPY` method of the BLAS linear algebra library:
```Y = A*X + Y
```
Note that this special function is just a running example (we will use the single precision datatype `float`, thus our function of interest is called `saxpy`). The underlying concept is true for any element-wise array function.

## Wednesday, February 15, 2012

### Calculating the Length of a 3D Vector using SSE

This article explains how to calculate the length of a single 3D float vector stored in a SSE register. The length or norm of a vector is defined as the square root of the dot product of the vector with itself:
```|v| = length3(v) = sqrt(v.x^2 + v.y^2 + v.z^2)
```
A single SSE register can be used to hold a 3D vector (the highest 32 bits are unused). In a previous article we show how to load a `struct` containing 3 float values into a SSE register. You may as well use the `_mm_setr_ps(x, y, z, 0)` intrinsic. SSE4 introduced the `DPPS` instruction (accessible via the `_mm_dp_ps` intrinsic) which allows to calculate the dot product of up to four float values. We will now use this intrinsic to calculate the length of a 3D vector with minimal instructions.

## Friday, December 30, 2011

### Template based vector library with SSE support

This article explains how to set up a vector math library that performs mathematical operations on arbitrary sized float arrays or vectors. It can handle aligned and unaligned pointers with minimal code overhead but optimal runtime performance. This is achieved through template functions and compile-time arguments.
This tutorial is based on a simple code example: adding two arrays and storing the result in a third array. Step by step, we will introduce loop SSE intrinsics, loop unrolling and functor based concepts which allow to build a library with different operations.

## Wednesday, December 28, 2011

### Simple vector3 class with SSE support

In this post we show how to write a simple class which represents a 3D vector which uses SSE operations for fast calculations. The class stores three float values (x, y and z) and implements all basic vector operators such as add, subtract, multiply, divide, cross product, dot product and length calculations. It uses aligned 128-bit memory which allows to use SSE intrinsics directly. In addition, it overloads the `new` and `delete` operators for arrays which allows to create multiple instances of `vector3`.

## Thursday, July 28, 2011

### Fast 3D-vector matrix transformation using SSE4

In this blog-post we'd like to show how to efficiently transform multiple 3D vectors using an affine transformation matrix. Each vector has three coordinates (x, y and z) and the matrix consists of three rows each with 4 elements (3 for rotation/scale + 1 for translation). In order to multiply a 3-element vector with a 3x4 matrix, we add an additional 1 at the end of the 3D vector:

## Friday, June 17, 2011

### Intel Architecture Code Analyzer

Intel provides a great tool for static code analysis of C++ code at assembly-code-level. It is called the Intel Architecture Code Analyzer and will allow you to analyze how instructions are executed on an Intel CPU (including instruction pairing and the critical path).

## Thursday, June 16, 2011

### Bilinear Pixel Interpolation using SSE

Bilinear pixel interpolation is a common operation in image processing applications (resizing, distorting, etc.) as well as in computer graphics (texturing, etc.). It allows accessing pixels at non-integer coordinates of the underlying image by building a weighted sum over all neighbors of the specified image position. On GPUs this operation is implemented in hardware. However, some algorithms can not be ported to the GPU easily and a CPU implementation of the Bilinear interpolation is needed.

This article will show how to efficiently implement such an operation in C++ using SSE2 instructions for 8-bit RGBA images. First, we will show how to perform bilinear interpolation using pure C++ code and then present an enhanced example where we utilize SSE intrinsics.