Thursday, June 16, 2011

Bilinear Pixel Interpolation using SSE

Bilinear pixel interpolation is a common operation in image processing applications (resizing, distorting, etc.) as well as in computer graphics (texturing, etc.). It allows accessing pixels at non-integer coordinates of the underlying image by building a weighted sum over all neighbors of the specified image position. On GPUs this operation is implemented in hardware. However, some algorithms can not be ported to the GPU easily and a CPU implementation of the Bilinear interpolation is needed.

This article will show how to efficiently implement such an operation in C++ using SSE2 instructions for 8-bit RGBA images. First, we will show how to perform bilinear interpolation using pure C++ code and then present an enhanced example where we utilize SSE intrinsics.

There are fivesteps involved in interpolation:
  1. Determining the position of the pixel
    px=frac(x) py=frac(y)
  2. Loading the four neighboring pixels
    P = [p1, p2, p3, p4]
  3. Calculation of the weights for each pixel
    W = [(1-px)*(1-py), px*(1-py), (1-px)*py, px*py]
  4. Calculating the weighted sum of pixels
    N = dot(P, W)
  5. Returning the resulting bilinear interpolated pixel N

For simplicity and readability, we use a image-structure called Image that will contain a pointer to an array of Pixel objects. The overloaded operators in the Pixel class will not be needed for the SSE implementation. The float4 structure is defined in the Common Datatypes page.
struct Pixel
{
 Pixel() :r(0), g(0), b(0), a(0) {}
 Pixel(int rgba) { *((int*)this) = rgba; }
 Pixel(unsigned char r, unsigned char g, unsigned char b, unsigned char a) : r(r), g(g), b(b), a(a) {}

 unsigned char r, g, b, a;
};

struct Image
{
 int width;
 int height; 
 Pixel* data;
};

The bilinear interpolation is performed by the following set of functions. We calculate the bilinear weights using floating point numbers but perform the actual weighting of pixel values with fixed-point arithmetic. This allows faster execution times.
inline Pixel GetPixel(const Image* img, float x, float y)
{
 int px = (int)x; // floor of x
 int py = (int)y; // floor of y
 const int stride = img->width;
 const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel

 // load the four neighboring pixels
 const Pixel& p1 = p0[0 + 0 * stride];
 const Pixel& p2 = p0[1 + 0 * stride];
 const Pixel& p3 = p0[0 + 1 * stride];
 const Pixel& p4 = p0[1 + 1 * stride];

 // Calculate the weights for each pixel
 float fx = x - px;
 float fy = y - py;
 float fx1 = 1.0f - fx;
 float fy1 = 1.0f - fy;
 
 int w1 = fx1 * fy1 * 256.0f;
 int w2 = fx  * fy1 * 256.0f;
 int w3 = fx1 * fy  * 256.0f;
 int w4 = fx  * fy  * 256.0f;

 // Calculate the weighted sum of pixels (for each color channel)
 int outr = p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4;
 int outg = p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4;
 int outb = p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4;
 int outa = p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4;

 return Pixel(outr >> 8, outg >> 8, outb >> 8, outa >> 8);
}

Now, let's take a look at the SSE2 version of the same code:
// constant values that will be needed
static const __m128 CONST_1111 = _mm_set1_ps(1);
static const __m128 CONST_256 = _mm_set1_ps(256);

inline __m128 CalcWeights(float x, float y)
{
 __m128 ssx = _mm_set_ss(x);
 __m128 ssy = _mm_set_ss(y);
 __m128 psXY = _mm_unpacklo_ps(ssx, ssy);      // 0 0 y x

 //__m128 psXYfloor = _mm_floor_ps(psXY); // use this line for if you have SSE4
 __m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvtps_epi32(psXY));
 __m128 psXYfrac = _mm_sub_ps(psXY, psXYfloor); // = frac(psXY)
 
 __m128 psXYfrac1 = _mm_sub_ps(CONST_1111, psXYfrac); // ? ? (1-y) (1-x)
 __m128 w_x = _mm_unpacklo_ps(psXYfrac1, psXYfrac);   // ? ?     x (1-x)
        w_x = _mm_movelh_ps(w_x, w_x);      // x (1-x) x (1-x)
 __m128 w_y = _mm_shuffle_ps(psXYfrac1, psXYfrac, _MM_SHUFFLE(1, 1, 1, 1)); // y y (1-y) (1-y)

 // complete weight vector
 return _mm_mul_ps(w_x, w_y);
}

inline Pixel GetPixelSSE(const Image* img, float x, float y)
{
 const int stride = img->width;
 const Pixel* p0 = img->data + (int)x + (int)y * stride; // pointer to first pixel

 // Load the data (2 pixels in one load)
 __m128i p12 = _mm_loadl_epi64((const __m128i*)&p0[0 * stride]); 
 __m128i p34 = _mm_loadl_epi64((const __m128i*)&p0[1 * stride]); 

 __m128 weight = CalcWeights(x, y);

 // extend to 16bit
 p12 = _mm_unpacklo_epi8(p12, _mm_setzero_si128());
 p34 = _mm_unpacklo_epi8(p34, _mm_setzero_si128());

 // convert floating point weights to 16bit integer
 weight = _mm_mul_ps(weight, CONST_256); 
 __m128i weighti = _mm_cvtps_epi32(weight); // w4 w3 w2 w1
         weighti = _mm_packs_epi32(weighti, _mm_setzero_si128()); // 32->16bit

 // prepare the weights
 __m128i w12 = _mm_shufflelo_epi16(weighti, _MM_SHUFFLE(1, 1, 0, 0));
 __m128i w34 = _mm_shufflelo_epi16(weighti, _MM_SHUFFLE(3, 3, 2, 2));
 w12 = _mm_unpacklo_epi16(w12, w12); // w2 w2 w2 w2 w1 w1 w1 w1
 w34 = _mm_unpacklo_epi16(w34, w34); // w4 w4 w4 w4 w3 w3 w3 w3
 
 // multiply each pixel with its weight (2 pixel per SSE mul)
 __m128i L12 = _mm_mullo_epi16(p12, w12);
 __m128i L34 = _mm_mullo_epi16(p34, w34);

 // sum the results
 __m128i L1234 = _mm_add_epi16(L12, L34); 
 __m128i Lhi = _mm_shuffle_epi32(L1234, _MM_SHUFFLE(3, 2, 3, 2));
 __m128i L = _mm_add_epi16(L1234, Lhi);
  
 // convert back to 8bit
 __m128i L8 = _mm_srli_epi16(L, 8); // divide by 256
 L8 = _mm_packus_epi16(L8, _mm_setzero_si128());
 
 // return
 return _mm_cvtsi128_si32(L8);
}
Note that in SSE2 there is no code to perform the floor function (truncate towards zero) of a floating point number. We therefore use a conversion to integer and back to float. In order to make this work, we need to ensure that during conversion, floating point numbers are truncated. This can be achieved by calling
_MM_SET_ROUNDING_MODE(_MM_ROUND_TOWARD_ZERO);
before making any call to the GetPixelSSE function. If you have a processor capable of the SSE4 instruction set, you can use the _mm_floor_ps intrinsic instead (then you will not need to set the rounding mode explicitly).

Our measurements have shown that the SSE2 code is up to 55% faster compared to the native C++ implementation and another +10% faster when using the SSE4 _mm_floor_ps intrinsic.

Further speed improvements can be achieved by transforming the four RGBA values into RRRR, GGGG, BBBB and AAAA groups (convert Array of Structures (AoS) to Structure of Arrays (SoA)). Such storage allows the use of multiply-add _mm_madd_epi16 and horizontal-add _mm_hadd_epi32 (since SSSE3) instructions:
inline Pixel GetPixelSSE3(const Image<Pixel>* img, float x, float y)
{
 const int stride = img->width;
 const Pixel* p0 = img->data + (int)x + (int)y * stride; // pointer to first pixel

 // Load the data (2 pixels in one load)
 __m128i p12 = _mm_loadl_epi64((const __m128i*)&p0[0 * stride]); 
 __m128i p34 = _mm_loadl_epi64((const __m128i*)&p0[1 * stride]); 

 __m128 weight = CalcWeights(x, y);

 // convert RGBA RGBA RGBA RGAB to RRRR GGGG BBBB AAAA (AoS to SoA)
 __m128i p1234 = _mm_unpacklo_epi8(p12, p34);
 __m128i p34xx = _mm_unpackhi_epi64(p1234, _mm_setzero_si128());
 __m128i p1234_8bit = _mm_unpacklo_epi8(p1234, p34xx);

 // extend to 16bit 
 __m128i pRG = _mm_unpacklo_epi8(p1234_8bit, _mm_setzero_si128());
 __m128i pBA = _mm_unpackhi_epi8(p1234_8bit, _mm_setzero_si128());
 
 // convert weights to integer
 weight = _mm_mul_ps(weight, CONST_256); 
 __m128i weighti = _mm_cvtps_epi32(weight); // w4 w3 w2 w1
         weighti = _mm_packs_epi32(weighti, weighti); // 32->2x16bit

 //outRG = [w1*R1 + w2*R2 | w3*R3 + w4*R4 | w1*G1 + w2*G2 | w3*G3 + w4*G4]
 __m128i outRG = _mm_madd_epi16(pRG, weighti);
 //outBA = [w1*B1 + w2*B2 | w3*B3 + w4*B4 | w1*A1 + w2*A2 | w3*A3 + w4*A4]
 __m128i outBA = _mm_madd_epi16(pBA, weighti);

 // horizontal add that will produce the output values (in 32bit)
 __m128i out = _mm_hadd_epi32(outRG, outBA);
 out = _mm_srli_epi32(out, 8); // divide by 256
 
 // convert 32bit->8bit
 out = _mm_packus_epi32(out, _mm_setzero_si128());
 out = _mm_packus_epi16(out, _mm_setzero_si128());

 // return
 return _mm_cvtsi128_si32(out);
}
Now, the weights do not need to be shuffled anymore, instead the RGBA values are reorganized. However, the benefit is only a performance increase of about +3% compared to the former version. On the positive side, the number of code lines is reduced.

21 comments:

  1. Hi,
    your standard GetPixel() method contains 6 float -> int and 2 int -> float conversions which are probably very slow.
    The experienced SSE speedup is likely due to the fact that SSE has got "native" vector conversion functions.
    Have you tried writing and profiling a pure fixed-point GetPixel() method (fixed point arguments, too!), avoiding float<->int conversions?

    Regards,
    Rafael

    ReplyDelete
  2. @Anonymous, you mean something like this?

    [code]
    inline Pixel GetPixel(const Image* img, float x, float y)
    {
    long Fx = (long)(x * 65536.0); // convert to Fixed16
    long Fy = (long)(y * 65536.0); // convert to Fixed16
    long px = Fx & 0xFFFF0000; // floor function
    long py = Fy & 0xFFFF0000; // floor function

    const int stride = img->width;
    const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel

    // load the four neighboring pixels
    const Pixel& p1 = p0[0 + 0 * stride];
    const Pixel& p2 = p0[1 + 0 * stride];
    const Pixel& p3 = p0[0 + 1 * stride];
    const Pixel& p4 = p0[1 + 1 * stride];

    // Calculate the weights for each pixel
    long fx = Fx & 0x0000FFFF; // frac function
    long fy = Fy & 0x0000FFFF; // frac function
    long fx1 = 0x00010000 - fx; // 1 - fx
    long fy1 = 0x00010000 - fy; // 1 - fy;

    long w1 = (fx1 * fy1) << 8;
    long w2 = (fx * fy1) << 8;
    long w3 = (fx1 * fy ) << 8;
    long w4 = (fx * fy ) << 8;

    // Calculate the weighted sum of pixels (for each color channel)
    int outr = (int)((p1.r * w1 + p2 * w2 + p3.r * w3 + p4.r * w4) >> 16);
    int outg = (int)((p1.g * w1 + p2 * w2 + p3.g * w3 + p4.g * w4) >> 16);
    int outb = (int)((p1.b * w1 + p2 * w2 + p3.b * w3 + p4.b * w4) >> 16);
    int outa = (int)((p1.a * w1 + p2 * w2 + p3.a * w3 + p4.a * w4) >> 16);

    return Pixel(outr >> 8, outg >> 8, outb >> 8, outa >> 8);
    }
    [/code]

    -Jesse

    ReplyDelete
  3. I guess I haven’t read such unique material anywhere else online.

    vector free

    ReplyDelete
  4. I ran GetPixel, GetPixel_SSE, GetPixel_SSE3, and GetPixel_fixedpoint on my system. GetPixel_fixedpoint is about as fast as GetPixelSSE and much faster than GetPixel. GetPixel_SSE3 is the fastest.

    I tested these on an image to make sure the results were correct. However, I had to modify Jesse's code above to get the fixedpoint working. Here is my code if anyone cares.

    [code]
    inline Pixel GetPixel_fixedpoint(const int* data, float x, float y, const int stride) {
    const unsigned int shift = 8; // shift can have values 8 to 16
    const unsigned int fixed = 1<>shift;
    unsigned int py = (Fy & -fixed)>>shift;

    const int* p0 = &data[px + py * stride]; // pointer to first pixel

    const Pixel& p1 = p0[0 + 0 * stride];
    const Pixel& p2 = p0[1 + 0 * stride];
    const Pixel& p3 = p0[0 + 1 * stride];
    const Pixel& p4 = p0[1 + 1 * stride];

    unsigned int fx = Fx & (fixed-1);
    unsigned int fy = Fy & (fixed-1);
    unsigned int fx1 = fixed - fx;
    unsigned int fy1 = fixed - fy;

    unsigned int w1 = (fx1 * fy1) >> shift;
    unsigned int w2 = (fx * fy1) >> shift;
    unsigned int w3 = (fx1 * fy ) >> shift;
    unsigned int w4 = (fx * fy ) >> shift;

    // Calculate the weighted sum of pixels (for each color channel)
    int outr = (int)((p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4) >> shift);
    int outg = (int)((p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4) >> shift);
    int outb = (int)((p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4) >> shift);
    int outa = (int)((p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4) >> shift);

    return Pixel(outr, outg, outb, outa);
    }
    [/code]

    ReplyDelete
    Replies
    1. const unsigned int fixed = 1<>shift

      is that an error or what does that operator (<>) do?

      Delete
    2. Yeah, plus - what is px defined as?

      Delete
  5. Sorry, when I posted the code I added a bug. Here is the correction
    unsigned int Fx = (unsigned int) (x * fixed); // convert to Fixed
    unsigned int Fy = (unsigned int) (y * fixed); // convert to Fixed

    ReplyDelete
  6. One point to add, GetPixel_fixel was faster than GetPixel in MSVC2010. However, when I compile with MSVC2012 GetPixel is faster. GetPixelSSE(3) is still the fastest by far though.

    ReplyDelete
  7. A few posts here have been useful to me while working on a software rasterizer, so I thought I'd give back a bit...

    Here is an SSE bilinear sampler which processes 4 outputs at a time (intended for rasterizing 2x2 quads)

    Input is GL style normalized UV coords, for 4 lanes:
    U1U2U3U4
    V1V2V3V4

    Output is 4 vectors in SoA format
    R1R2R3R4
    B1B2B3B4
    etc

    This example is setup for sampling from a 512x512 texture (RGBA as floats) which wraps at the edges, but it should be trivial to adapt it to something else.

    (Wrapped in a struct so it can be passed to a template)

    transpose4x4() is the same as _mm__MM_TRANSPOSE4_PS() in the Intel intrinsic reference.

    [code]
    struct ts_512i {

    FloatingPointPixel * __restrict texdata;

    ts_512i( FloatingPointPixel * const ptr ) :texdata(ptr) {}

    __forceinline void fetch_texel( __m128 *px, const __m128i& x, const __m128i& y )
    {
    static const __m128i texmod = _mm_set1_epi32( 512-1 );

    __m128i tx = _mm_and_si128( x, texmod );
    __m128i ty = _mm_sub_epi32( texmod, _mm_and_si128( y, texmod ) );

    __m128i offset = _mm_or_si128( _mm_slli_epi32( ty, 9 ), tx );

    px[0] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[0] ] );
    px[1] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[1] ] );
    px[2] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[2] ] );
    px[3] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[3] ] );
    }

    __forceinline void sample( __m128* out, __m128* uv )
    {
    static const __m128 texsize = _mm_set1_ps( 512 );
    static const __m128 one = _mm_set1_ps( 1.0f );

    // scale normalized coords to texture coords
    __m128 up = _mm_mul_ps( uv[0], texsize );
    __m128 vp = _mm_mul_ps( uv[1], texsize );

    // floor() and offsets to grab adjacent texels
    __m128i tx0 = _mm_cvttps_epi32(up);
    __m128i ty0 = _mm_cvttps_epi32(vp);
    __m128i tx1 = _mm_add_epi32(tx0,_mm_set1_epi32(1));
    __m128i ty1 = _mm_add_epi32(ty0,_mm_set1_epi32(1));

    // get fractional parts
    __m128 fx = _mm_sub_ps(up,_mm_cvtepi32_ps(tx0));
    __m128 fy = _mm_sub_ps(vp,_mm_cvtepi32_ps(ty0));
    __m128 fx1 = _mm_sub_ps( one, fx );
    __m128 fy1 = _mm_sub_ps( one, fy );

    // calculate weights
    __m128 w1 = _mm_mul_ps( fx1, fy1 );
    __m128 w2 = _mm_mul_ps( fx, fy1 );
    __m128 w3 = _mm_mul_ps( fx1, fy );
    __m128 w4 = _mm_mul_ps( fx, fy );

    // temporary load AoS data, then transpose to make SoA rrrr/gggg/bbbb/aaaa
    __m128 p1[4]; fetch_texel( p1, tx0,ty0); transpose4x4( p1[0],p1[1],p1[2],p1[3] );
    __m128 p2[4]; fetch_texel( p2, tx1,ty0); transpose4x4( p2[0],p2[1],p2[2],p2[3] );
    __m128 p3[4]; fetch_texel( p3, tx0,ty1); transpose4x4( p3[0],p3[1],p3[2],p3[3] );
    __m128 p4[4]; fetch_texel( p4, tx1,ty1); transpose4x4( p4[0],p4[1],p4[2],p4[3] );

    // sum data from each texels using the weights
    out[0] = _mm_add_ps(_mm_add_ps(_mm_add_ps(
    _mm_mul_ps(p1[0],w1),_mm_mul_ps(p2[0],w2)),_mm_mul_ps(p3[0],w3)),_mm_mul_ps(p4[0],w4));
    out[1] = _mm_add_ps(_mm_add_ps(_mm_add_ps(
    _mm_mul_ps(p1[1],w1),_mm_mul_ps(p2[1],w2)),_mm_mul_ps(p3[1],w3)),_mm_mul_ps(p4[1],w4));
    out[2] = _mm_add_ps(_mm_add_ps(_mm_add_ps(
    _mm_mul_ps(p1[2],w1),_mm_mul_ps(p2[2],w2)),_mm_mul_ps(p3[2],w3)),_mm_mul_ps(p4[2],w4));
    out[3] = _mm_add_ps(_mm_add_ps(_mm_add_ps(
    _mm_mul_ps(p1[3],w1),_mm_mul_ps(p2[3],w2)),_mm_mul_ps(p3[3],w3)),_mm_mul_ps(p4[3],w4));
    }

    };

    [/code]

    ReplyDelete
  8. awesome, thanks for the code!

    ReplyDelete
  9. In GetPixel, when high precision is desired, you may want rounding instead of truncation:
    int w1 = fx1 * fy1 * 256.0f;
    =>
    int w1 = fx1 * fy1 * 256.0f +0.5f;

    return Pixel(outr >> 8, outg >> 8, outb >> 8, outa >> 8);
    =>
    return Pixel((outr + 128) >> 8, (outg +128) >> 8, (outb + 128) >> 8, (outa + 128) >> 8);

    ReplyDelete
  10. One important correction:
    __m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvtps_epi32(psXY));
    =>
    __m128 psXYfloor = _mm_cvttepi32_ps(_mm_cvtps_epi32(psXY));

    _mm_cvttepi32_ps does truncation, while _mm_cvtepi32_ps does rounding, the latter would lead to negative weights.

    ReplyDelete
    Replies
    1. Should be:
      __m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvtps_epi32(psXY));
      =>
      __m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvttps_epi32(psXY));

      Delete
    2. You are right. _mm_cvttps_epi32() shoud be used to take the place of _mm_cvtps_epi32(). _mm_cvtps_epi32() did causes some wrong pixel values in my experiment and _mm_cvttps_epi32() did not.

      Delete
  11. Nice, very helpful.

    ~Victor

    ReplyDelete
  12. I tried the original GetPixel function above, but there is a problem with the code. When converting from float to int in the w1, w2, w3, w4 variables, the truncation can decrease the interpolated pixel way too much. Try for example if p1 = 971, p2=p3= 972, p4 = 973, with fx = .751 and fy = .406 (some numbers I was using). The interpolated result becomes 964! That's way lower than the surrounding pixels. Using floating point versions of the w1, w2, w3, w4 variables solves this and gives the correct answer of 972.

    ReplyDelete
  13. I have a very naive question but I want to know how pixels at non integer locations are stored (as all array indexes are discrete). Please do reply to my question.

    ReplyDelete
    Replies
    1. I do not completely understand your question. Pixels are always discrete and if you access a non-integer coordinate, you use interpolation. This gives you a new pixel value and you have to store that pixel at a discrete address again.

      Delete
  14. hi every one;

    I am new on SIMD and I would like to use this code but I don't know how to write a main to use the function, i mean principally ho to load an image because actually i am working on opencv and i don't know whether there is a way to relate all these stuff together.

    ReplyDelete
  15. Hi every one.
    i want to use bilinear interpolation in my project, the one implemented in SSE2 but i am working in opencv, i really don't know how to relate these, i mean for example how do i load an image to use the function in SSE, please, i need help of how to write a test to use this function

    ReplyDelete
  16. Isn't this code missing a check for x == img->width - 1 ?

    ReplyDelete