## Friday, April 15, 2011

### 3D Point Projection to 2D Image Using SSE

The most common operation in 3D applications is to project 3D points to a 2D image, most often that is your computer screen. For that purpose you normally use your graphics card which is most efficient at this task. But there are times, when you need to further process 2D points on your CPU. This article will show you how to efficiently perform 3D point projection in C++ using SSE intrinsics.

First, let's take a look at the math: a 3D projection can be performed by multiplying a 3D vector `x` in homogeneous coordinates with a 3x4 projection matrix `P`. This gives you a transformed 3D point `t` in the camera coordinate system.
```| P00 P01 P02 P03 |   | x.x |   | t.x |
| P10 P11 P12 P13 | * | x.y | = | t.y |
| P20 P21 P22 P23 |   | x.z |   | t.z |
|  1  |
```
Finally, you need to map `t` to the 2D image by dividing by it's depth (z-value).
```i.x = t.x / t.z    i.y = t.y / t.z
```
Now let's start talking `SSE` code (SSE3 to be specific) and project a single 3D point. The following code consists of (1) loading a 3 element vector and setting it's 4th component to 1, (2) loading the P matrix, (3) matrix multiplication and (4) point projection. See Common Datatypes for a definition of `float2`, `float3` and `float4`.
```static const __m128 ONES = _mm_set1_ps(1.0f);

// Project a single 3D point x using the 3x4 projection matrix P
void project(float2* out, const float4* P, const float3* x)
{
// (1) Load [X,Y,Z,1] into xyz1
__m128 xy = _mm_loadl_pi(_mm_setzero_ps(), (const __m64*)x); // X Y
__m128 z =  _mm_load_ss(&x->z);    // Z 0 0 0
__m128 z1 = _mm_unpacklo_ps(z, ONES); // Z 1
__m128 xyz1 = _mm_movelh_ps(xy, z1);  // X Y Z 1

// (2) Load the P matrix
__m128 P0 = _mm_loadu_ps((const float*)(P + 0));
__m128 P1 = _mm_loadu_ps((const float*)(P + 1));
__m128 P2 = _mm_loadu_ps((const float*)(P + 2));

// (3) multiply with the P matrix
__m128 tx = _mm_mul_ps(P0, xyz1);
__m128 ty = _mm_mul_ps(P1, xyz1);
__m128 tz = _mm_mul_ps(P2, xyz1);

// use horizontal adds to sum up the X, Y and Z components of t
// txyzz contains [t.x, t.y, t.z, t.z]

// (4) now divide the X and Y components of t by the Z component

// move the Z component from high to low
tzz = _mm_movehl_ps(txyzz, txyzz);

// divide by the Z component of t
__m128 ixy = _mm_div_ps(txyzz, tzz);

// Store the result
_mm_storel_pi((__m64*)out, ixy);
}
```
This code is close to optimal. However there are still some unused SSE resources: the division is by far the most time consuming operation and we only divide 2 out of four possible numbers in parallel. We can optimize the code even further if we project 2 3D points at the same time. The following code loads 2 3D points that lie in consecutive memory and projects them using the same projection matrix P.
```// project 2 3D points
void project2(float2* out, const float4* P, const float3* x)
{
__m128 xyzu = _mm_loadu_ps(&x.x);  // X0 Y0 Z0 X1
__m128 vw11 = _mm_loadl_pi(ONES, (__m64*)&x.y); // Y1 Z1 1 1
__m128 zu1v = _mm_shuffle_ps(xyzu, vw11, _MM_SHUFFLE(0, 2, 3, 2)); // Z0 X1 1 Y1
__m128 xyz1 = _mm_shuffle_ps(xyzu, zu1v, _MM_SHUFFLE(2, 0, 1, 0)); // X0 Y0 Z0 1
__m128 uvw1 = _mm_shuffle_ps(zu1v, vw11, _MM_SHUFFLE(3, 1, 3, 1)); // X1 Y1 Z1 1

__m128 P0 = _mm_loadu_ps((const float*)(P + 0));
__m128 P1 = _mm_loadu_ps((const float*)(P + 1));
__m128 P2 = _mm_loadu_ps((const float*)(P + 2));

// multiply xyz1 with the P matrix
__m128 tx = _mm_mul_ps(P0, xyz1);
__m128 ty = _mm_mul_ps(P1, xyz1);
__m128 tz = _mm_mul_ps(P2, xyz1);

// multiply uvw1 with the P matrix
__m128 tu = _mm_mul_ps(P0, uvw1);
__m128 tv = _mm_mul_ps(P1, uvw1);
__m128 tw = _mm_mul_ps(P2, uvw1);

// use horizontal adds to sum up the X, Y components

// use horizontal adds to sum up the Z components
__m128 tzwzw = _mm_shuffle_ps(tzw, tzw, _MM_SHUFFLE(2, 3, 0, 1));
This code optimally uses all 4 floats that are contained within each SSE register and is able to load and store all values from memory using 64-bit and 128-bit operands. By combining the code from `project` and `project2` it is possible to apply loop unrolling such as described here in order to project an arbitrary number of 3D points.