First, let's take a look at the math: a 3D projection can be performed by multiplying a 3D vector

`x`

in homogeneous coordinates with a 3x4 projection matrix `P`

. This gives you a transformed 3D point `t`

in the camera coordinate system.| P00 P01 P02 P03 | | x.x | | t.x | | P10 P11 P12 P13 | * | x.y | = | t.y | | P20 P21 P22 P23 | | x.z | | t.z | | 1 |Finally, you need to map

`t`

to the 2D image by dividing by it's depth (z-value).i.x = t.x / t.z i.y = t.y / t.zNow let's start talking

`SSE`

code (SSE3 to be specific) and project a single 3D point. The following code consists of (1) loading a 3 element vector and setting it's 4th component to 1, (2) loading the P matrix, (3) matrix multiplication and (4) point projection. See Common Datatypes for a definition of `float2`

, `float3`

and `float4`

.static const __m128 ONES = _mm_set1_ps(1.0f); // Project a single 3D point x using the 3x4 projection matrix P void project(float2* out, const float4* P, const float3* x) { // (1) Load [X,Y,Z,1] into xyz1 __m128 xy = _mm_loadl_pi(_mm_setzero_ps(), (const __m64*)x); // X Y __m128 z = _mm_load_ss(&x->z); // Z 0 0 0 __m128 z1 = _mm_unpacklo_ps(z, ONES); // Z 1 __m128 xyz1 = _mm_movelh_ps(xy, z1); // X Y Z 1 // (2) Load the P matrix __m128 P0 = _mm_loadu_ps((const float*)(P + 0)); __m128 P1 = _mm_loadu_ps((const float*)(P + 1)); __m128 P2 = _mm_loadu_ps((const float*)(P + 2)); // (3) multiply with the P matrix __m128 tx = _mm_mul_ps(P0, xyz1); __m128 ty = _mm_mul_ps(P1, xyz1); __m128 tz = _mm_mul_ps(P2, xyz1); // use horizontal adds to sum up the X, Y and Z components of t // txyzz contains [t.x, t.y, t.z, t.z] __m128 txy = _mm_hadd_ps(tx, ty); __m128 tzz = _mm_hadd_ps(tz, tz); __m128 txyzz = _mm_hadd_ps(txy, tzz); // (4) now divide the X and Y components of t by the Z component // move the Z component from high to low tzz = _mm_movehl_ps(txyzz, txyzz); // divide by the Z component of t __m128 ixy = _mm_div_ps(txyzz, tzz); // Store the result _mm_storel_pi((__m64*)out, ixy); }This code is close to optimal. However there are still some unused SSE resources: the division is by far the most time consuming operation and we only divide 2 out of four possible numbers in parallel. We can optimize the code even further if we project 2 3D points at the same time. The following code loads 2 3D points that lie in consecutive memory and projects them using the same projection matrix P.

// project 2 3D points void project2(float2* out, const float4* P, const float3* x) { // Load [X0,Y0,Z0,1] into xyz1 // Load [X1,Y1,Z1,1] into uvw1 __m128 xyzu = _mm_loadu_ps(&x[0].x); // X0 Y0 Z0 X1 __m128 vw11 = _mm_loadl_pi(ONES, (__m64*)&x[1].y); // Y1 Z1 1 1 __m128 zu1v = _mm_shuffle_ps(xyzu, vw11, _MM_SHUFFLE(0, 2, 3, 2)); // Z0 X1 1 Y1 __m128 xyz1 = _mm_shuffle_ps(xyzu, zu1v, _MM_SHUFFLE(2, 0, 1, 0)); // X0 Y0 Z0 1 __m128 uvw1 = _mm_shuffle_ps(zu1v, vw11, _MM_SHUFFLE(3, 1, 3, 1)); // X1 Y1 Z1 1 // Load the P matrix __m128 P0 = _mm_loadu_ps((const float*)(P + 0)); __m128 P1 = _mm_loadu_ps((const float*)(P + 1)); __m128 P2 = _mm_loadu_ps((const float*)(P + 2)); // multiply xyz1 with the P matrix __m128 tx = _mm_mul_ps(P0, xyz1); __m128 ty = _mm_mul_ps(P1, xyz1); __m128 tz = _mm_mul_ps(P2, xyz1); // multiply uvw1 with the P matrix __m128 tu = _mm_mul_ps(P0, uvw1); __m128 tv = _mm_mul_ps(P1, uvw1); __m128 tw = _mm_mul_ps(P2, uvw1); // use horizontal adds to sum up the X, Y components __m128 txy = _mm_hadd_ps(tx, ty); __m128 tuv = _mm_hadd_ps(tu, tv); __m128 txyuv = _mm_hadd_ps(txy, tuv); // use horizontal adds to sum up the Z components __m128 tzw = _mm_hadd_ps(tz, tw); __m128 tzwzw = _mm_shuffle_ps(tzw, tzw, _MM_SHUFFLE(2, 3, 0, 1)); __m128 tzzww = _mm_add_ps(tzw, tzwzw); // now divide the X and Y components of t by the Z component __m128 ixyuv = _mm_div_ps(txyuv, tzzww); // Store the result _mm_storeu_ps((float*)out, ixyuv); }This code optimally uses all 4 floats that are contained within each SSE register and is able to load and store all values from memory using 64-bit and 128-bit operands. By combining the code from

`project`

and `project2`

it is possible to apply loop unrolling such as described here in order to project an arbitrary number of 3D points.
## No comments:

## Post a Comment