Friday, April 15, 2011

3D Point Projection to 2D Image Using SSE

The most common operation in 3D applications is to project 3D points to a 2D image, most often that is your computer screen. For that purpose you normally use your graphics card which is most efficient at this task. But there are times, when you need to further process 2D points on your CPU. This article will show you how to efficiently perform 3D point projection in C++ using SSE intrinsics.

First, let's take a look at the math: a 3D projection can be performed by multiplying a 3D vector x in homogeneous coordinates with a 3x4 projection matrix P. This gives you a transformed 3D point t in the camera coordinate system.
| P00 P01 P02 P03 |   | x.x |   | t.x |
| P10 P11 P12 P13 | * | x.y | = | t.y |
| P20 P21 P22 P23 |   | x.z |   | t.z |
                      |  1  |
Finally, you need to map t to the 2D image by dividing by it's depth (z-value).
i.x = t.x / t.z    i.y = t.y / t.z
Now let's start talking SSE code (SSE3 to be specific) and project a single 3D point. The following code consists of (1) loading a 3 element vector and setting it's 4th component to 1, (2) loading the P matrix, (3) matrix multiplication and (4) point projection. See Common Datatypes for a definition of float2, float3 and float4.
static const __m128 ONES = _mm_set1_ps(1.0f);

// Project a single 3D point x using the 3x4 projection matrix P
void project(float2* out, const float4* P, const float3* x)
{ 
 // (1) Load [X,Y,Z,1] into xyz1
 __m128 xy = _mm_loadl_pi(_mm_setzero_ps(), (const __m64*)x); // X Y
 __m128 z =  _mm_load_ss(&x->z);    // Z 0 0 0
 __m128 z1 = _mm_unpacklo_ps(z, ONES); // Z 1
 __m128 xyz1 = _mm_movelh_ps(xy, z1);  // X Y Z 1

 // (2) Load the P matrix
 __m128 P0 = _mm_loadu_ps((const float*)(P + 0));
 __m128 P1 = _mm_loadu_ps((const float*)(P + 1));
 __m128 P2 = _mm_loadu_ps((const float*)(P + 2));

 // (3) multiply with the P matrix
 __m128 tx = _mm_mul_ps(P0, xyz1);
 __m128 ty = _mm_mul_ps(P1, xyz1);
 __m128 tz = _mm_mul_ps(P2, xyz1);

 // use horizontal adds to sum up the X, Y and Z components of t
 // txyzz contains [t.x, t.y, t.z, t.z]
 __m128 txy =   _mm_hadd_ps(tx, ty);
 __m128 tzz =   _mm_hadd_ps(tz, tz);
 __m128 txyzz = _mm_hadd_ps(txy, tzz); 

 // (4) now divide the X and Y components of t by the Z component
 
 // move the Z component from high to low
 tzz = _mm_movehl_ps(txyzz, txyzz); 
 
 // divide by the Z component of t
 __m128 ixy = _mm_div_ps(txyzz, tzz);

 // Store the result
 _mm_storel_pi((__m64*)out, ixy);
}
This code is close to optimal. However there are still some unused SSE resources: the division is by far the most time consuming operation and we only divide 2 out of four possible numbers in parallel. We can optimize the code even further if we project 2 3D points at the same time. The following code loads 2 3D points that lie in consecutive memory and projects them using the same projection matrix P.
// project 2 3D points
void project2(float2* out, const float4* P, const float3* x)
{ 
 // Load [X0,Y0,Z0,1] into xyz1
 // Load [X1,Y1,Z1,1] into uvw1
 __m128 xyzu = _mm_loadu_ps(&x[0].x);  // X0 Y0 Z0 X1 
 __m128 vw11 = _mm_loadl_pi(ONES, (__m64*)&x[1].y); // Y1 Z1 1 1  
 __m128 zu1v = _mm_shuffle_ps(xyzu, vw11, _MM_SHUFFLE(0, 2, 3, 2)); // Z0 X1 1 Y1
 __m128 xyz1 = _mm_shuffle_ps(xyzu, zu1v, _MM_SHUFFLE(2, 0, 1, 0)); // X0 Y0 Z0 1
 __m128 uvw1 = _mm_shuffle_ps(zu1v, vw11, _MM_SHUFFLE(3, 1, 3, 1)); // X1 Y1 Z1 1
 
 // Load the P matrix
 __m128 P0 = _mm_loadu_ps((const float*)(P + 0));
 __m128 P1 = _mm_loadu_ps((const float*)(P + 1));
 __m128 P2 = _mm_loadu_ps((const float*)(P + 2));

 // multiply xyz1 with the P matrix
 __m128 tx = _mm_mul_ps(P0, xyz1);
 __m128 ty = _mm_mul_ps(P1, xyz1);
 __m128 tz = _mm_mul_ps(P2, xyz1);

 // multiply uvw1 with the P matrix
 __m128 tu = _mm_mul_ps(P0, uvw1);
 __m128 tv = _mm_mul_ps(P1, uvw1);
 __m128 tw = _mm_mul_ps(P2, uvw1);

 // use horizontal adds to sum up the X, Y components
 __m128 txy =   _mm_hadd_ps(tx, ty);
 __m128 tuv =   _mm_hadd_ps(tu, tv);
 __m128 txyuv = _mm_hadd_ps(txy, tuv);

 // use horizontal adds to sum up the Z components
 __m128 tzw =   _mm_hadd_ps(tz, tw);
 __m128 tzwzw = _mm_shuffle_ps(tzw, tzw, _MM_SHUFFLE(2, 3, 0, 1));
 __m128 tzzww = _mm_add_ps(tzw, tzwzw);
 
 // now divide the X and Y components of t by the Z component 
 __m128 ixyuv = _mm_div_ps(txyuv, tzzww);

 // Store the result
 _mm_storeu_ps((float*)out, ixyuv);
}
This code optimally uses all 4 floats that are contained within each SSE register and is able to load and store all values from memory using 64-bit and 128-bit operands. By combining the code from project and project2 it is possible to apply loop unrolling such as described here in order to project an arbitrary number of 3D points.

No comments:

Post a Comment