This article will show how to efficiently implement such an operation in C++ using SSE2 instructions for 8-bit RGBA images. First, we will show how to perform bilinear interpolation using pure C++ code and then present an enhanced example where we utilize SSE intrinsics.

There are fivesteps involved in interpolation:

- Determining the position of the pixel

`px=frac(x) py=frac(y)`

- Loading the four neighboring pixels

`P = [p1, p2, p3, p4]`

- Calculation of the weights for each pixel

`W = [(1-px)*(1-py), px*(1-py), (1-px)*py, px*py]`

- Calculating the weighted sum of pixels

`N = dot(P, W)`

- Returning the resulting bilinear interpolated pixel
`N`

For simplicity and readability, we use a image-structure called

`Image`

that will contain a pointer to an array of `Pixel`

objects. The overloaded operators in the `Pixel`

class will not be needed for the SSE implementation. The `float4`

structure is defined in the Common Datatypes page.struct Pixel { Pixel() :r(0), g(0), b(0), a(0) {} Pixel(int rgba) { *((int*)this) = rgba; } Pixel(unsigned char r, unsigned char g, unsigned char b, unsigned char a) : r(r), g(g), b(b), a(a) {} unsigned char r, g, b, a; }; struct Image { int width; int height; Pixel* data; };

The bilinear interpolation is performed by the following set of functions. We calculate the bilinear weights using floating point numbers but perform the actual weighting of pixel values with fixed-point arithmetic. This allows faster execution times.

inline Pixel GetPixel(const Image* img, float x, float y) { int px = (int)x; // floor of x int py = (int)y; // floor of y const int stride = img->width; const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel // load the four neighboring pixels const Pixel& p1 = p0[0 + 0 * stride]; const Pixel& p2 = p0[1 + 0 * stride]; const Pixel& p3 = p0[0 + 1 * stride]; const Pixel& p4 = p0[1 + 1 * stride]; // Calculate the weights for each pixel float fx = x - px; float fy = y - py; float fx1 = 1.0f - fx; float fy1 = 1.0f - fy; int w1 = fx1 * fy1 * 256.0f; int w2 = fx * fy1 * 256.0f; int w3 = fx1 * fy * 256.0f; int w4 = fx * fy * 256.0f; // Calculate the weighted sum of pixels (for each color channel) int outr = p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4; int outg = p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4; int outb = p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4; int outa = p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4; return Pixel(outr >> 8, outg >> 8, outb >> 8, outa >> 8); }

Now, let's take a look at the SSE2 version of the same code:

// constant values that will be needed static const __m128 CONST_1111 = _mm_set1_ps(1); static const __m128 CONST_256 = _mm_set1_ps(256); inline __m128 CalcWeights(float x, float y) { __m128 ssx = _mm_set_ss(x); __m128 ssy = _mm_set_ss(y); __m128 psXY = _mm_unpacklo_ps(ssx, ssy); // 0 0 y x //__m128 psXYfloor = _mm_floor_ps(psXY); // use this line for if you have SSE4 __m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvtps_epi32(psXY)); __m128 psXYfrac = _mm_sub_ps(psXY, psXYfloor); // = frac(psXY) __m128 psXYfrac1 = _mm_sub_ps(CONST_1111, psXYfrac); // ? ? (1-y) (1-x) __m128 w_x = _mm_unpacklo_ps(psXYfrac1, psXYfrac); // ? ? x (1-x) w_x = _mm_movelh_ps(w_x, w_x); // x (1-x) x (1-x) __m128 w_y = _mm_shuffle_ps(psXYfrac1, psXYfrac, _MM_SHUFFLE(1, 1, 1, 1)); // y y (1-y) (1-y) // complete weight vector return _mm_mul_ps(w_x, w_y); } inline Pixel GetPixelSSE(const Image* img, float x, float y) { const int stride = img->width; const Pixel* p0 = img->data + (int)x + (int)y * stride; // pointer to first pixel // Load the data (2 pixels in one load) __m128i p12 = _mm_loadl_epi64((const __m128i*)&p0[0 * stride]); __m128i p34 = _mm_loadl_epi64((const __m128i*)&p0[1 * stride]); __m128 weight = CalcWeights(x, y); // extend to 16bit p12 = _mm_unpacklo_epi8(p12, _mm_setzero_si128()); p34 = _mm_unpacklo_epi8(p34, _mm_setzero_si128()); // convert floating point weights to 16bit integer weight = _mm_mul_ps(weight, CONST_256); __m128i weighti = _mm_cvtps_epi32(weight); // w4 w3 w2 w1 weighti = _mm_packs_epi32(weighti, _mm_setzero_si128()); // 32->16bit // prepare the weights __m128i w12 = _mm_shufflelo_epi16(weighti, _MM_SHUFFLE(1, 1, 0, 0)); __m128i w34 = _mm_shufflelo_epi16(weighti, _MM_SHUFFLE(3, 3, 2, 2)); w12 = _mm_unpacklo_epi16(w12, w12); // w2 w2 w2 w2 w1 w1 w1 w1 w34 = _mm_unpacklo_epi16(w34, w34); // w4 w4 w4 w4 w3 w3 w3 w3 // multiply each pixel with its weight (2 pixel per SSE mul) __m128i L12 = _mm_mullo_epi16(p12, w12); __m128i L34 = _mm_mullo_epi16(p34, w34); // sum the results __m128i L1234 = _mm_add_epi16(L12, L34); __m128i Lhi = _mm_shuffle_epi32(L1234, _MM_SHUFFLE(3, 2, 3, 2)); __m128i L = _mm_add_epi16(L1234, Lhi); // convert back to 8bit __m128i L8 = _mm_srli_epi16(L, 8); // divide by 256 L8 = _mm_packus_epi16(L8, _mm_setzero_si128()); // return return _mm_cvtsi128_si32(L8); }Note that in SSE2 there is no code to perform the

`floor`

function (truncate towards zero) of a floating point number. We therefore use a conversion to integer and back to float. In order to make this work, we need to ensure that during conversion, floating point numbers are truncated. This can be achieved by calling_MM_SET_ROUNDING_MODE(_MM_ROUND_TOWARD_ZERO);before making any call to the

`GetPixelSSE`

function. If you have a processor capable of the SSE4 instruction set, you can use the `_mm_floor_ps`

intrinsic instead (then you will not need to set the rounding mode explicitly).Our measurements have shown that the SSE2 code is up to 55% faster compared to the native C++ implementation and another +10% faster when using the SSE4

`_mm_floor_ps`

intrinsic.Further speed improvements can be achieved by transforming the four RGBA values into RRRR, GGGG, BBBB and AAAA groups (convert Array of Structures (AoS) to Structure of Arrays (SoA)). Such storage allows the use of multiply-add

`_mm_madd_epi16`

and horizontal-add `_mm_hadd_epi32`

(since SSSE3) instructions:inline Pixel GetPixelSSE3(const Image<Pixel>* img, float x, float y) { const int stride = img->width; const Pixel* p0 = img->data + (int)x + (int)y * stride; // pointer to first pixel // Load the data (2 pixels in one load) __m128i p12 = _mm_loadl_epi64((const __m128i*)&p0[0 * stride]); __m128i p34 = _mm_loadl_epi64((const __m128i*)&p0[1 * stride]); __m128 weight = CalcWeights(x, y); // convert RGBA RGBA RGBA RGAB to RRRR GGGG BBBB AAAA (AoS to SoA) __m128i p1234 = _mm_unpacklo_epi8(p12, p34); __m128i p34xx = _mm_unpackhi_epi64(p1234, _mm_setzero_si128()); __m128i p1234_8bit = _mm_unpacklo_epi8(p1234, p34xx); // extend to 16bit __m128i pRG = _mm_unpacklo_epi8(p1234_8bit, _mm_setzero_si128()); __m128i pBA = _mm_unpackhi_epi8(p1234_8bit, _mm_setzero_si128()); // convert weights to integer weight = _mm_mul_ps(weight, CONST_256); __m128i weighti = _mm_cvtps_epi32(weight); // w4 w3 w2 w1 weighti = _mm_packs_epi32(weighti, weighti); // 32->2x16bit //outRG = [w1*R1 + w2*R2 | w3*R3 + w4*R4 | w1*G1 + w2*G2 | w3*G3 + w4*G4] __m128i outRG = _mm_madd_epi16(pRG, weighti); //outBA = [w1*B1 + w2*B2 | w3*B3 + w4*B4 | w1*A1 + w2*A2 | w3*A3 + w4*A4] __m128i outBA = _mm_madd_epi16(pBA, weighti); // horizontal add that will produce the output values (in 32bit) __m128i out = _mm_hadd_epi32(outRG, outBA); out = _mm_srli_epi32(out, 8); // divide by 256 // convert 32bit->8bit out = _mm_packus_epi32(out, _mm_setzero_si128()); out = _mm_packus_epi16(out, _mm_setzero_si128()); // return return _mm_cvtsi128_si32(out); }Now, the weights do not need to be shuffled anymore, instead the RGBA values are reorganized. However, the benefit is only a performance increase of about +3% compared to the former version. On the positive side, the number of code lines is reduced.

Hi,

ReplyDeleteyour standard GetPixel() method contains 6 float -> int and 2 int -> float conversions which are probably very slow.

The experienced SSE speedup is likely due to the fact that SSE has got "native" vector conversion functions.

Have you tried writing and profiling a pure fixed-point GetPixel() method (fixed point arguments, too!), avoiding float<->int conversions?

Regards,

Rafael

@Anonymous, you mean something like this?

ReplyDelete[code]

inline Pixel GetPixel(const Image* img, float x, float y)

{

long Fx = (long)(x * 65536.0); // convert to Fixed16

long Fy = (long)(y * 65536.0); // convert to Fixed16

long px = Fx & 0xFFFF0000; // floor function

long py = Fy & 0xFFFF0000; // floor function

const int stride = img->width;

const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel

// load the four neighboring pixels

const Pixel& p1 = p0[0 + 0 * stride];

const Pixel& p2 = p0[1 + 0 * stride];

const Pixel& p3 = p0[0 + 1 * stride];

const Pixel& p4 = p0[1 + 1 * stride];

// Calculate the weights for each pixel

long fx = Fx & 0x0000FFFF; // frac function

long fy = Fy & 0x0000FFFF; // frac function

long fx1 = 0x00010000 - fx; // 1 - fx

long fy1 = 0x00010000 - fy; // 1 - fy;

long w1 = (fx1 * fy1) << 8;

long w2 = (fx * fy1) << 8;

long w3 = (fx1 * fy ) << 8;

long w4 = (fx * fy ) << 8;

// Calculate the weighted sum of pixels (for each color channel)

int outr = (int)((p1.r * w1 + p2 * w2 + p3.r * w3 + p4.r * w4) >> 16);

int outg = (int)((p1.g * w1 + p2 * w2 + p3.g * w3 + p4.g * w4) >> 16);

int outb = (int)((p1.b * w1 + p2 * w2 + p3.b * w3 + p4.b * w4) >> 16);

int outa = (int)((p1.a * w1 + p2 * w2 + p3.a * w3 + p4.a * w4) >> 16);

return Pixel(outr >> 8, outg >> 8, outb >> 8, outa >> 8);

}

[/code]

-Jesse

I guess I haven’t read such unique material anywhere else online.

ReplyDeletevector free

I ran GetPixel, GetPixel_SSE, GetPixel_SSE3, and GetPixel_fixedpoint on my system. GetPixel_fixedpoint is about as fast as GetPixelSSE and much faster than GetPixel. GetPixel_SSE3 is the fastest.

ReplyDeleteI tested these on an image to make sure the results were correct. However, I had to modify Jesse's code above to get the fixedpoint working. Here is my code if anyone cares.

[code]

inline Pixel GetPixel_fixedpoint(const int* data, float x, float y, const int stride) {

const unsigned int shift = 8; // shift can have values 8 to 16

const unsigned int fixed = 1<>shift;

unsigned int py = (Fy & -fixed)>>shift;

const int* p0 = &data[px + py * stride]; // pointer to first pixel

const Pixel& p1 = p0[0 + 0 * stride];

const Pixel& p2 = p0[1 + 0 * stride];

const Pixel& p3 = p0[0 + 1 * stride];

const Pixel& p4 = p0[1 + 1 * stride];

unsigned int fx = Fx & (fixed-1);

unsigned int fy = Fy & (fixed-1);

unsigned int fx1 = fixed - fx;

unsigned int fy1 = fixed - fy;

unsigned int w1 = (fx1 * fy1) >> shift;

unsigned int w2 = (fx * fy1) >> shift;

unsigned int w3 = (fx1 * fy ) >> shift;

unsigned int w4 = (fx * fy ) >> shift;

// Calculate the weighted sum of pixels (for each color channel)

int outr = (int)((p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4) >> shift);

int outg = (int)((p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4) >> shift);

int outb = (int)((p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4) >> shift);

int outa = (int)((p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4) >> shift);

return Pixel(outr, outg, outb, outa);

}

[/code]

const unsigned int fixed = 1<>shift

Deleteis that an error or what does that operator (<>) do?

Yeah, plus - what is px defined as?

DeleteSorry, when I posted the code I added a bug. Here is the correction

ReplyDeleteunsigned int Fx = (unsigned int) (x * fixed); // convert to Fixed

unsigned int Fy = (unsigned int) (y * fixed); // convert to Fixed

One point to add, GetPixel_fixel was faster than GetPixel in MSVC2010. However, when I compile with MSVC2012 GetPixel is faster. GetPixelSSE(3) is still the fastest by far though.

ReplyDeleteA few posts here have been useful to me while working on a software rasterizer, so I thought I'd give back a bit...

ReplyDeleteHere is an SSE bilinear sampler which processes 4 outputs at a time (intended for rasterizing 2x2 quads)

Input is GL style normalized UV coords, for 4 lanes:

U1U2U3U4

V1V2V3V4

Output is 4 vectors in SoA format

R1R2R3R4

B1B2B3B4

etc

This example is setup for sampling from a 512x512 texture (RGBA as floats) which wraps at the edges, but it should be trivial to adapt it to something else.

(Wrapped in a struct so it can be passed to a template)

transpose4x4() is the same as _mm__MM_TRANSPOSE4_PS() in the Intel intrinsic reference.

[code]

struct ts_512i {

FloatingPointPixel * __restrict texdata;

ts_512i( FloatingPointPixel * const ptr ) :texdata(ptr) {}

__forceinline void fetch_texel( __m128 *px, const __m128i& x, const __m128i& y )

{

static const __m128i texmod = _mm_set1_epi32( 512-1 );

__m128i tx = _mm_and_si128( x, texmod );

__m128i ty = _mm_sub_epi32( texmod, _mm_and_si128( y, texmod ) );

__m128i offset = _mm_or_si128( _mm_slli_epi32( ty, 9 ), tx );

px[0] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[0] ] );

px[1] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[1] ] );

px[2] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[2] ] );

px[3] = _mm_load_ps( (float*)&texdata[ offset.m128i_i32[3] ] );

}

__forceinline void sample( __m128* out, __m128* uv )

{

static const __m128 texsize = _mm_set1_ps( 512 );

static const __m128 one = _mm_set1_ps( 1.0f );

// scale normalized coords to texture coords

__m128 up = _mm_mul_ps( uv[0], texsize );

__m128 vp = _mm_mul_ps( uv[1], texsize );

// floor() and offsets to grab adjacent texels

__m128i tx0 = _mm_cvttps_epi32(up);

__m128i ty0 = _mm_cvttps_epi32(vp);

__m128i tx1 = _mm_add_epi32(tx0,_mm_set1_epi32(1));

__m128i ty1 = _mm_add_epi32(ty0,_mm_set1_epi32(1));

// get fractional parts

__m128 fx = _mm_sub_ps(up,_mm_cvtepi32_ps(tx0));

__m128 fy = _mm_sub_ps(vp,_mm_cvtepi32_ps(ty0));

__m128 fx1 = _mm_sub_ps( one, fx );

__m128 fy1 = _mm_sub_ps( one, fy );

// calculate weights

__m128 w1 = _mm_mul_ps( fx1, fy1 );

__m128 w2 = _mm_mul_ps( fx, fy1 );

__m128 w3 = _mm_mul_ps( fx1, fy );

__m128 w4 = _mm_mul_ps( fx, fy );

// temporary load AoS data, then transpose to make SoA rrrr/gggg/bbbb/aaaa

__m128 p1[4]; fetch_texel( p1, tx0,ty0); transpose4x4( p1[0],p1[1],p1[2],p1[3] );

__m128 p2[4]; fetch_texel( p2, tx1,ty0); transpose4x4( p2[0],p2[1],p2[2],p2[3] );

__m128 p3[4]; fetch_texel( p3, tx0,ty1); transpose4x4( p3[0],p3[1],p3[2],p3[3] );

__m128 p4[4]; fetch_texel( p4, tx1,ty1); transpose4x4( p4[0],p4[1],p4[2],p4[3] );

// sum data from each texels using the weights

out[0] = _mm_add_ps(_mm_add_ps(_mm_add_ps(

_mm_mul_ps(p1[0],w1),_mm_mul_ps(p2[0],w2)),_mm_mul_ps(p3[0],w3)),_mm_mul_ps(p4[0],w4));

out[1] = _mm_add_ps(_mm_add_ps(_mm_add_ps(

_mm_mul_ps(p1[1],w1),_mm_mul_ps(p2[1],w2)),_mm_mul_ps(p3[1],w3)),_mm_mul_ps(p4[1],w4));

out[2] = _mm_add_ps(_mm_add_ps(_mm_add_ps(

_mm_mul_ps(p1[2],w1),_mm_mul_ps(p2[2],w2)),_mm_mul_ps(p3[2],w3)),_mm_mul_ps(p4[2],w4));

out[3] = _mm_add_ps(_mm_add_ps(_mm_add_ps(

_mm_mul_ps(p1[3],w1),_mm_mul_ps(p2[3],w2)),_mm_mul_ps(p3[3],w3)),_mm_mul_ps(p4[3],w4));

}

};

[/code]

awesome, thanks for the code!

ReplyDeleteIn GetPixel, when high precision is desired, you may want rounding instead of truncation:

ReplyDeleteint w1 = fx1 * fy1 * 256.0f;

=>

int w1 = fx1 * fy1 * 256.0f +0.5f;

return Pixel(outr >> 8, outg >> 8, outb >> 8, outa >> 8);

=>

return Pixel((outr + 128) >> 8, (outg +128) >> 8, (outb + 128) >> 8, (outa + 128) >> 8);

One important correction:

ReplyDelete__m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvtps_epi32(psXY));

=>

__m128 psXYfloor = _mm_cvttepi32_ps(_mm_cvtps_epi32(psXY));

_mm_cvttepi32_ps does truncation, while _mm_cvtepi32_ps does rounding, the latter would lead to negative weights.

Should be:

Delete__m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvtps_epi32(psXY));

=>

__m128 psXYfloor = _mm_cvtepi32_ps(_mm_cvttps_epi32(psXY));

You are right. _mm_cvttps_epi32() shoud be used to take the place of _mm_cvtps_epi32(). _mm_cvtps_epi32() did causes some wrong pixel values in my experiment and _mm_cvttps_epi32() did not.

DeleteNice, very helpful.

ReplyDelete~Victor

I tried the original GetPixel function above, but there is a problem with the code. When converting from float to int in the w1, w2, w3, w4 variables, the truncation can decrease the interpolated pixel way too much. Try for example if p1 = 971, p2=p3= 972, p4 = 973, with fx = .751 and fy = .406 (some numbers I was using). The interpolated result becomes 964! That's way lower than the surrounding pixels. Using floating point versions of the w1, w2, w3, w4 variables solves this and gives the correct answer of 972.

ReplyDeleteI have a very naive question but I want to know how pixels at non integer locations are stored (as all array indexes are discrete). Please do reply to my question.

ReplyDeleteI do not completely understand your question. Pixels are always discrete and if you access a non-integer coordinate, you use interpolation. This gives you a new pixel value and you have to store that pixel at a discrete address again.

Deletehi every one;

ReplyDeleteI am new on SIMD and I would like to use this code but I don't know how to write a main to use the function, i mean principally ho to load an image because actually i am working on opencv and i don't know whether there is a way to relate all these stuff together.

Hi every one.

ReplyDeletei want to use bilinear interpolation in my project, the one implemented in SSE2 but i am working in opencv, i really don't know how to relate these, i mean for example how do i load an image to use the function in SSE, please, i need help of how to write a test to use this function

Isn't this code missing a check for x == img->width - 1 ?

ReplyDelete