First, let's take a look at the math: a 3D projection can be performed by multiplying a 3D vector
x
in homogeneous coordinates with a 3x4 projection matrix P
. This gives you a transformed 3D point t
in the camera coordinate system.| P00 P01 P02 P03 | | x.x | | t.x | | P10 P11 P12 P13 | * | x.y | = | t.y | | P20 P21 P22 P23 | | x.z | | t.z | | 1 |Finally, you need to map
t
to the 2D image by dividing by it's depth (z-value).i.x = t.x / t.z i.y = t.y / t.zNow let's start talking
SSE
code (SSE3 to be specific) and project a single 3D point. The following code consists of (1) loading a 3 element vector and setting it's 4th component to 1, (2) loading the P matrix, (3) matrix multiplication and (4) point projection. See Common Datatypes for a definition of float2
, float3
and float4
.static const __m128 ONES = _mm_set1_ps(1.0f); // Project a single 3D point x using the 3x4 projection matrix P void project(float2* out, const float4* P, const float3* x) { // (1) Load [X,Y,Z,1] into xyz1 __m128 xy = _mm_loadl_pi(_mm_setzero_ps(), (const __m64*)x); // X Y __m128 z = _mm_load_ss(&x->z); // Z 0 0 0 __m128 z1 = _mm_unpacklo_ps(z, ONES); // Z 1 __m128 xyz1 = _mm_movelh_ps(xy, z1); // X Y Z 1 // (2) Load the P matrix __m128 P0 = _mm_loadu_ps((const float*)(P + 0)); __m128 P1 = _mm_loadu_ps((const float*)(P + 1)); __m128 P2 = _mm_loadu_ps((const float*)(P + 2)); // (3) multiply with the P matrix __m128 tx = _mm_mul_ps(P0, xyz1); __m128 ty = _mm_mul_ps(P1, xyz1); __m128 tz = _mm_mul_ps(P2, xyz1); // use horizontal adds to sum up the X, Y and Z components of t // txyzz contains [t.x, t.y, t.z, t.z] __m128 txy = _mm_hadd_ps(tx, ty); __m128 tzz = _mm_hadd_ps(tz, tz); __m128 txyzz = _mm_hadd_ps(txy, tzz); // (4) now divide the X and Y components of t by the Z component // move the Z component from high to low tzz = _mm_movehl_ps(txyzz, txyzz); // divide by the Z component of t __m128 ixy = _mm_div_ps(txyzz, tzz); // Store the result _mm_storel_pi((__m64*)out, ixy); }This code is close to optimal. However there are still some unused SSE resources: the division is by far the most time consuming operation and we only divide 2 out of four possible numbers in parallel. We can optimize the code even further if we project 2 3D points at the same time. The following code loads 2 3D points that lie in consecutive memory and projects them using the same projection matrix P.
// project 2 3D points void project2(float2* out, const float4* P, const float3* x) { // Load [X0,Y0,Z0,1] into xyz1 // Load [X1,Y1,Z1,1] into uvw1 __m128 xyzu = _mm_loadu_ps(&x[0].x); // X0 Y0 Z0 X1 __m128 vw11 = _mm_loadl_pi(ONES, (__m64*)&x[1].y); // Y1 Z1 1 1 __m128 zu1v = _mm_shuffle_ps(xyzu, vw11, _MM_SHUFFLE(0, 2, 3, 2)); // Z0 X1 1 Y1 __m128 xyz1 = _mm_shuffle_ps(xyzu, zu1v, _MM_SHUFFLE(2, 0, 1, 0)); // X0 Y0 Z0 1 __m128 uvw1 = _mm_shuffle_ps(zu1v, vw11, _MM_SHUFFLE(3, 1, 3, 1)); // X1 Y1 Z1 1 // Load the P matrix __m128 P0 = _mm_loadu_ps((const float*)(P + 0)); __m128 P1 = _mm_loadu_ps((const float*)(P + 1)); __m128 P2 = _mm_loadu_ps((const float*)(P + 2)); // multiply xyz1 with the P matrix __m128 tx = _mm_mul_ps(P0, xyz1); __m128 ty = _mm_mul_ps(P1, xyz1); __m128 tz = _mm_mul_ps(P2, xyz1); // multiply uvw1 with the P matrix __m128 tu = _mm_mul_ps(P0, uvw1); __m128 tv = _mm_mul_ps(P1, uvw1); __m128 tw = _mm_mul_ps(P2, uvw1); // use horizontal adds to sum up the X, Y components __m128 txy = _mm_hadd_ps(tx, ty); __m128 tuv = _mm_hadd_ps(tu, tv); __m128 txyuv = _mm_hadd_ps(txy, tuv); // use horizontal adds to sum up the Z components __m128 tzw = _mm_hadd_ps(tz, tw); __m128 tzwzw = _mm_shuffle_ps(tzw, tzw, _MM_SHUFFLE(2, 3, 0, 1)); __m128 tzzww = _mm_add_ps(tzw, tzwzw); // now divide the X and Y components of t by the Z component __m128 ixyuv = _mm_div_ps(txyuv, tzzww); // Store the result _mm_storeu_ps((float*)out, ixyuv); }This code optimally uses all 4 floats that are contained within each SSE register and is able to load and store all values from memory using 64-bit and 128-bit operands. By combining the code from
project
and project2
it is possible to apply loop unrolling such as described here in order to project an arbitrary number of 3D points.
No comments:
Post a Comment