16-bit Alpha Blending Using an 8-bit Mask
Information in this document is provided in
connection with Intel products. No license, express or
implied, by estoppel or otherwise, to any intellectual
property rights is granted by this document. Except as
provided in Intel's Terms and Conditions of Sale for such
products, Intel assumes no liability whatsoever, and
Intel disclaims any express or implied warranty, relating
to sale and/or use of Intel products including liability
or warranties relating to fitness for a particular
purpose, merchantability, or infringement of any patent,
copyright or other intellectual property right. Intel
products are not intended for use in medical, life
saving, or life sustaining applications. Intel may make
changes to specifications and product descriptions at any
time, without notice. Copyright (c)
Intel Corporation 1997.
*Third-party brands and names are the
property of their respective owners.
|
This application note will cover the optimization techniques
used to implement 16-bit alpha blending with a predefined 8-bit
mask using MMXTM Technology for Microsoft* DirectDraw.
16-bit alpha blending using 8-bit masks is normally used in image
manipulation and anti-aliasing sprites for games. The 16-bit
alpha blending routine uses MMX Technology to calculate the alpha
blending result for four pixels (per iteration) using 16-bit
source, 16-bit destination, and 8-bit alpha bitmaps. This appnote
will provide an overview of the alpha blending equation,
implementation/optimization techniques for 16-bit RGB alpha
blending on a Pentium II(R) Processor, performance results, and
functions for 565 and 555 color space alpha blending. The
functions provided in this appnote are optimized to work with
DirectDraw surfaces in system memory on Pentium II Processors.
For a general coverage of alpha blending using MMX Technology
please refer to the appnote Using MMX
TM Instructions to Implement Alpha Blending.
The technique for alpha blending uses the simple
mathmatical function:
Co=Cs*(A)+(1-A)*Cd
Figure 1: Floating Point Alpha Blend Equation
Where the values for C are either the red, green,
or blue component of a pixel in the source and destination image.
The subscript s denotes the source color and d
denotes the destination color. In this equation the source pixels
are multiplied by an alpha factor A while the destination pixels
are multiplied by the inverse alpha value (1-A). The range for
the alpha value in A is between zero and one. Each color
component (R,G,B) must have the same dynamic range for the source
and destination bitmap (i.e. five bits for red in source and
destination). However dynamic ranges between color components
(i.e. Red is five bits and Green is six bits) within a pixel need
not be the same.
This section describes the various aspects of writing the
16-bit alpha blending routine shown in the Appendix. These factors consist of the
make up of function inputs, video hardware, pixel data storage
and arrangement, Pentium II Processor cache and instruction
execution, integer alpha blending calculations, MMX Technology
implementation, work reduction, and video surfaces.
The inputs for both alpha blending routines consist of the
memory location, pitch for a row of pixels, and coordinates for
the alpha, source, and destination bitmaps (shown in Figure 2).
These attributes can be obtained from the DirectDraw data
structure DDSURFACEDESC after a call to lock the DirectDraw
surface. For the function to work properly the alpha mask must
have a one to one mapping with the source surface. The alpha
blending is bounded by iDstW(width) and iDstH(height) which limit
the operation to a rectangular area for the alpha mask, source
bitmap, and destination bitmap.
|
void ablend_555(unsigned char
*lpAlpha,unsigned int iAlpPitch,
unsigned char *lpSrc,unsigned int iSrcX, unsigned int
iSrcY,
unsigned int iSrcPitch, unsigned char *lpDst,
unsigned int iDstX, unsigned int iDstY,
unsigned int iDstW, unsigned int iDstH,
unsigned int iDstPitch)
|
|
Figure 2: 555 Alpha Blending Function Prototype
The make up of the video hardware affects the display of the
resulting alpha blended image. The image must conform to the bit
allocations for the video hardware so that the surface is
displayed properly. There are two methods to represent 16-bit
color in video hardware. Both methods 555 and 565 describe bit
allocations for the RGB color components that make up a pixel. In
the case of 555, each color component is allocated five bits with
the most significant bit left unused. For 565 the extra bit is
use in the green component. The arrangement of color components
(either RGB or BRG) is dependent on the implementation of the
video hardware.
In DirectDraw applications, the pixel format is video card
dependent and stored sequentially in RGB format. This applies to
the source and destination bitmaps which use 16-bit RGB color.
The alpha mask is stored in memory in eight bit format
independent of the video hardware. Knowledge of bitmap and mask
storage is useful when attempting to access the pixels for alpha
blending via surface pointers. In DirectDraw, surfaces are
allocated on four byte boundaries for each row of pixels. A
surface that does not fit in the four byte boundaries is padded
with extra bytes. For example a DirectDraw surface with
dimensions of 23 pixels wide, 23 pixels high, and 16-bit depth,
will create a span of 46 bytes per row (23 pixels*2 bytes).
However when DirectDraw allocates the memory it will create a
surface with a pitch of 48 bytes per row to maintain four byte
alignement. Thus the width of some bitmaps will not correspond to
the amount of actual system or video memory used by a row of
pixels. The padding will skew the pixel accesses for each
successive line. To correct the problem pitch is used to advance
the surface pointer to the next line of the image.
The two significant features of the Pentium II Processor
affecting the implementation of the alpha blending algorithm, are
non-aligned cache line access, and dynamic instruction execution.
A non-aligned cache line access occurs when an application
accesses data on a cache line that is not four byte aligned. For
example, when an application accesses a single byte at the
beginning of a cache line and then proceeds to access a word or
dword, a non-aligned access occurs. On a Pentium Processor with
MMX Technology this non-aligned cache line access causes a one
cycle penalty each time the data is read from the cache. The
Pentium II Processor allows non-aligned cache line accesses
without a penalty. Therefore pre and post alignment of data
accesses is no longer necessary. The Pentium II Processor also
utilizes dynamic execution with instruction look ahead to
determine instructions for execution in its five execution units
thus doing away with the need for pairing optimizations. Assembly
instructions written for the Pentium II Processor will no longer
require pre/post data alignment code and pairing optimizations in
MMX Technology code.
The equation for alpha blending discussed in Section 2.1
showed that the alpha value was a floating point number between
zero and one. However the data for the alpha mask, as well as the
source and destination bitmaps are represented in integer format.
Since the conversion of each value to floating point for the
equation would cost too many cycles, the alpha blending algorithm
utilizes an integer scaling factor to calculate color components
in integer format. An indepth discussion of integer scaling is
described in the November 1995 issue of IEEE Computer Graphics
and Applications article ,"Three Wrongs Make a Right"
by James F. Blinn. By applying the integer scaling method
described in the article, the 555 alpha blending algorithm is
reduced to the simple series of shifts and multiplies shown in
Figure 3. The alpha value A, which is in eight bits, is scaled by
shifting right three bits to a five bit integer.
C1= Cs*(A)+(31-A)*Cd
Co=(C1+16)+((C1+16)>>5)>>5
Figure 3: 5-Bit Integer Alpha Blend Equation
In the case of 565 alpha blending, the green
component utilizes the six bit integer alpha blending equation
shown in Figure 4. The alpha value A is scaled to a six bit
integer by shifting right by two bits.
C1= Cs*(A)+(63-A)*Cd
Co=(C1+32)+((C1+32)>>6)>>6
Figure 4: 6-Bit Integer Alpha Blend Equation
There are two optimizations that improve the performance of
the alpha blending by reducing the amount of instructions to be
executed. In the first method early out testing, the algorithm
takes into account the fact that the alpha blending equation will
display destination pixels when the alpha mask is zero and source
pixels when the alpha mask is a five or six bit run of ones. By
using a simple two dword comparison for a run of zeros and ones
the alpha blending equation can be skipped. This method works
well for long runs of ones or zeros in the alpha mask (such as a
large control panel that is alpha blended against a background
image). The second method, pre-processing alpha masks, require
that each alpha mask be reduced to the dynamic range of each
color component. For 555 alpha blend routine the alpha mask can
be preprocessed to a five bit mask value to avoid the in function
shift. For the 565 alpha blend algorithm two preproccessed masks
are required -- one for five bits (for red and blue calculations)
and six bits (for green calculations). This technique was not
used in the alpha blending functions below because it is easier
to generate an eight bit alpha mask for use in both 565 and 555
alpha blending routines.
The functions provided in the Appendix
will operate on surfaces allocated in system or video memory.
However when applied to video surfaces (either as a source,
destination or both) there will be a significant performance
degradation. The degradation occurs when video memory is accessed
by the PCI bus from a PCI video card. Every request for a pixel
in video memory is passed through the PCI bus which is
significantly slower than the memory bus. This is also compounded
by non-cacheable video memory so that every time the pixel is
reused (such as the case for isolating the RGB components of the
pixel in the algorithm below) a request is places on the PCI bus
for the pixel information. There are two solutions for resolving
the performance bottleneck. The first solution is to apply alpha
blending to surfaces allocated in system memory and then copy the
completed image into video memory. Thus the PCI bus is only used
for large block transfers to video memory. If the surfaces must
remain in video memory then all requests for pixel information on
video surfaces should be cached. For example, the first request
for a pixel from a video surface should read multiple bytes into
a temporary storage location in system memory, while subsequent
access to the video surface is directed to the cache area.
The MMX Technology implementation takes advantage of work
reduction and integer calculations to create resulting alpha
blended pixel map. The SIMD instructions read in four values from
the alpha stream to determine if the four runs of pixels require
alpha blending. For pixel runs that require alpha blending the
integer equation is applied to the source and destination pixel
streams. The MMX Technology data flow is shown in Figure 5 for
the green component implementation in the five bit alpha blending
algorithm. This is an exact replication of the integer equation
shown in Figure 3.
Figure 5: MMX Technology Data Flow for Green Component Alpha
Blend
The following test results were obtained from a prototype
233MHz Pentium II using system memory blocks ranging from 2k to
64k in size. Both source and destination surfaces were allocated
the same sized memory blocks. The source was zero filled and the
destination was filled by the maximum pixel value (0x7FFF for 555
pixels and 0xFFFF for 565 pixels). The eight bit alpha values
used in the calculations were randomly generated. The two
routines, 565 and 555 alpha blending, generated very similar test
results. Therefore the results of both routines were approximated
to generate a combined MMX Technology result versus scalar
assembly performance shown in Figure 6. The performance of the
MMX Technology routines were between four to five times faster
than the scalar assembly routine. For small pixel runs the
benefits of MMX Technology is minor compared to scalar assembly
because of the data setup overhead for alpha blend values and
pixel component manipulation. With larger runs of pixels the
benefits of work reduction and SIMD shows in the cases of 8K to
64K block which translate to sixteen bit pixel blocks of 64x64 to
256x128.
Figure 6: Alpha Blending Vs. Memory Block Size on a Pentium II
Processor
16-bit alpha blending routine with an 8-bit mask produces a
performance boost on the Pentium II processor when the
optimizations discussed in Section 2.2
are implemented. This was shown in Section 3.0, where the
performance boost was between four to five times faster than the
equivalent assembly routine. MMXTM Technology was the
primary contributor to the speed increase by processing four
pixels per loop iteration. However the Pentium II Processor eased
code implementation by dispensing the need for pairing
instructions (by using dynamic execution) and aligned cache line
accesses for surfaces in system memory. The combinatory effect is
an algorithm that is significantly faster than its scalar
assembly counterpart.
/*Alpha blending routine for 555 video hardware
lpAlpha=Pointer to location of 8-bit alpha mask
iAlpPitch=The span of a row in the alpha mask
lpSrc=Pointer to location of 16-bit 555 source bitmap
iSrcX=Starting X location of 16-bit source bitmap
iSrcY=Starting Y location of 16-bit source bitmap
iSrcPitch=The span of a row of pixels in the Source bitmap
lpDst=Pointer to location of 16-bit 555 destination bitmap
iDstX=Starting X location of 16-bit destination bitmap
iDstY=Starting Y location of 16-bit destination bitmap
iDstW=Width in pixels for alpha mask, source and destination bitmaps
iDstH=Height in pixels for alpha mask, source and destination bitmaps
iDstPitch=The span of a row of pixels in the destination bitmap*/
void ablend_555(unsigned char *lpAlpha,unsigned int iAlpPitch,
unsigned char *lpSrc,unsigned int iSrcX, unsigned int iSrcY,
unsigned int iSrcPitch, unsigned char *lpDst,
unsigned int iDstX, unsigned int iDstY,
unsigned int iDstW, unsigned int iDstH,
unsigned int iDstPitch)
{
//Mask for isolating the red,green, and blue components
static __int64 MASKR=0x001F001F001F001F;
static __int64 MASKG=0x03E003E003E003E0;
static __int64 MASKB=0x7C007C007C007C00;
//constants used by the integer alpha blending equation
static __int64 SIXTEEN=0x0010001000100010;
static __int64 FIVETWELVE=0x0200020002000200;
pointer for linear destination
unsigned char *lpLinearDstBp=(iDstX<<1)+(iDstY*iDstPitch)+lpDst; //base for linear source
pointer unsigned char *lpLinearSrcBp=(iSrcX<<1)+(iSrcY*iSrcPitch)+lpSrc; //base linear alpha
for unsigned char *lpLinearAlpBp=iSrcX+(iSrcY*iAlpPitch)+lpAlpha; //base pointer
_asm{
mov esi,lpLinearSrcBp; //src
mov edi,lpLinearDstBp; //dst
mov eax,lpLinearAlpBp; //alpha
mov ecx,iDstH; //ecx=number of lines to copy
mov ebx,iDstW; //ebx=span width to copy
test esi,6; //check if source address is qword aligned
//since addr coming in is always word aligned(16bit)
jnz done; //if not qword aligned we don't do anything
primeloop:
movd mm1,[eax]; //mm1=00 00 00 00 a3 a2 a1 a0
pxor mm2,mm2; //mm2=2;
movq mm4,[esi]; //g1: mm4=src3 src2 src1 src0
punpcklbw mm1,mm2; //mm1=00a3 00a2 00a1 00a0
loopqword:
mov edx,[eax];
test ebx,0xFFFFFFFC; //check if only 3 pixels left
jz checkback; //3 or less pixels left
//early out tests
cmp edx,0xffffffff; //test for alpha value of 1
je copyback; //if 1's copy the source pixels to the destination
test edx,0xffffffff; //test for alpha value of 0
jz leavefront; //if so go to the next 4 pixels
//the alpha blend starts
//i=a*sr+(31-a)*dr;
//i=(i+16)+((i+16)>>5)>>5;
movq mm5,[edi]; //g2: mm5=dst3 dst2 dst1 dst0
psrlw mm1,3; //mm1=a?>>3 nuke out lower 3 bits
movq mm7,MASKG; //g3: mm7=green mask
movq mm0,mm1; //mm0=00a3 00a2 00a1 00a0
movq mm2,MASKR; //g4: mm2=31
pand mm4,mm7; //g5: mm4=sg3 sg2 sg1 sg0
psubsb mm2,mm0; //g6: mm2=31-a3 31-a2 31-a1 31-a0
pand mm5,mm7; //g7: mm5=dg3 dg2 dg1 dg0
movq mm3,[esi]; //b1: mm3=src3 src2 src1 src0
pmullw mm4,mm1; //g8: mm4=sg?*a?
movq mm7,MASKR; //b2: mm7=blue mask
pmullw mm5,mm2; //g9: mm5=dg?*(1-a?)
movq mm0,[edi]; //b3: mm0=dst3 dst2 dst1 dst0
pand mm3,mm7; //b4: mm3=sb3 sb2 sb1 sb0
pand mm0,mm7; //b5: mm0=db3 db2 db1 db0
pmullw mm3,mm1; //b6: mm3=sb?*a?
movq mm7,[esi]; //r1: mm7=src3 src2 src1 src0
paddw mm4,mm5; //g10: mm4=sg?*a?+dg?*(1-a?)
paddw mm4,FIVETWELVE; //g11: mm4=(mm4+512) green
pmullw mm0,mm2; //b7: mm0=db?*(1-a?)
pand mm7,MASKB; //r2: mm7=sr3 sr2 sr1 sr0
movq mm5,mm4; //g12: mm5=mm4 green
psrlw mm4,5; //g13: mm4=mm4>>5
paddw mm4,mm5; //g14: mm4=mm5+mm4 green
movq mm5,[edi]; //r3: mm5=dst3 dst2 dst1 dst0
paddw mm0,mm3; //b8: mm0=sb?*a?+db?*(1-a?)
paddw mm0,SIXTEEN; //b9: mm0=(mm0+16) blue
psrlw mm7,10; //r4: shift src red down to position 0
pand mm5,MASKB; //r5: mm5=dr3 dr2 dr1 dr0
psrlw mm4,5; //g15: mm4=0?g0 0?g0 0?g0 0?g0 green
movq mm3,mm0; //b10: mm3=mm0 blue
psrlw mm0,5; //b11: mm0=mm0>>5 blue
psrlw mm5,10; //r6: shift dst red down to position 0
paddw mm0,mm3; //b12: mm0=mm3+mm0 blue
psrlw mm0,5; //b13: mm0=000b 000b 000b 000b blue
pmullw mm7,mm1; //mm7=sr?*a?
pand mm4,MASKG; //g16: mm4=00g0 00g0 00g0 00g0 green
pmullw mm5,mm2; //r7: mm5=dr?*(31-a?)
por mm0,mm4; //mm0=00gb 00gb 00gb 00gb
add eax,4; //move to next 4 alphas
add esi,8; //move to next 4 pixels in src
add edi,8; //move to next 4 pixels in dst
movd mm1,[eax]; //mm1=00 00 00 00 a3 a2 a1 a0
paddw mm5,mm7; //r8: mm5=sr?*a?+dr?*(31-a?)
paddw mm5,SIXTEEN; //r9: mm5=(mm5+16) red
pxor mm2,mm2; //mm0=0;
movq mm7,mm5; //r10: mm7=mm5 red
psrlw mm5,5; //r11: mm5=mm5>>5 red
movq mm4,[esi]; //g1: mm4=src3 src2 src1 src0
paddw mm5,mm7; //r12: mm5=mm7+mm5 red
punpcklbw mm1,mm2; //mm1=00a3 00a2 00a1 00a0
psrlw mm5,5; //r13: mm5=mm5>>5 red
psllw mm5,10; //r14: mm5=mm5<<10 red
por mm0,mm5; //mm0=0rgb 0rgb 0rgb 0rgb
sub ebx,4; //polished off 4 pixels
movq [edi-8],mm0; //dst=0rgb 0rgb 0rgb 0rgb
jmp loopqword; //go back to start
copyback:
movq [edi],mm4; //copy source to destination
leavefront:
add edi,8; //advance destination by 4 pixels
add eax,4; //advance alpha by 4
add esi,8; //advance source by 4 pixels
sub ebx,4; //decrease pixel count by 4
jmp primeloop;
checkback:
test ebx,0xFF; //check if 0 pixels left
jz nextline; //done with this span
//backalign: //work out back end pixels
movq mm5,[edi]; //g2: mm5=dst3 dst2 dst1 dst0
psrlw mm1,3; //mm1=a?>>3 nuke out lower 3 bits
movq mm7,MASKG; //g3: mm7=green mask
movq mm0,mm1; //mm0=00a3 00a2 00a1 00a0
movq mm2,MASKR; //g4: mm2=31 mask
pand mm4,mm7; //g5: mm4=sg3 sg2 sg1 sg0
psubsb mm2,mm0; //g6: mm2=31-a3 31-a2 31-a1 31-a0
pand mm5,mm7; //g7: mm5=dg3 dg2 dg1 dg0
movq mm3,[esi]; //b1: mm3=src3 src2 src1 src0
pmullw mm4,mm1; //g8: mm4=sg?*a?
movq mm7,MASKR; //b2: mm7=blue mask
pmullw mm5,mm2; //g9: mm5=dg?*(1-a?)
movq mm0,[edi]; //b3: mm0=dst3 dst2 dst1 dst0
pand mm3,mm7; //b4: mm3=sb3 sb2 sb1 sb0
pand mm0,mm7; //b5: mm0=db3 db2 db1 db0
pmullw mm3,mm1; //b6: mm3=sb?*a?
movq mm7,[esi]; //r1: mm7=src3 src2 src1 src0
paddw mm4,mm5; //g10: mm4=sg?*a?+dg?*(1-a?)
paddw mm4,FIVETWELVE; //g11: mm4=(mm4+512) green
pmullw mm0,mm2; //b7: mm0=db?*(1-a?)
pand mm7,MASKB; //r2: mm7=sr3 sr2 sr1 sr0
movq mm5,mm4; //g12: mm5=mm4 green
psrlw mm4,5; //g13: mm4=mm4>>5
paddw mm4,mm5; //g14: mm4=mm4+mm5 green
movq mm5,[edi]; //r3: mm5=dst3 dst2 dst1 dst0
paddw mm0,mm3; //b8: mm0=sb?*a?+db?*(1-a?)
paddw mm0,SIXTEEN; //b9: mm0=mm0 blue
psrlw mm7,10; //r4: shift src red down to position 0
pand mm5,MASKB; //r5: mm5=dr3 dr2 dr1 dr0
psrlw mm4,5; //g15: mm4=0?g0 0?g0 0?g0 0?g0 green
movq mm3,mm0; //b10: mm3=mm0 blue
psrlw mm0,5; //b11: mm0=mm0>>5 blue
psrlw mm5,10; //r6: shift dst red down to position 0
paddw mm0,mm3; //b12: mm0=mm0+mm3 blue
psrlw mm0,5; //b13: mm0=000b 000b 000b 000b blue
pmullw mm7,mm1; //mm7=sr?*a?
pand mm4,MASKG; //g16: mm4=00g0 00g0 00g0 00g0 green
pmullw mm5,mm2; //r7: mm5=dr?*(31-a?)
por mm0,mm4; //mm0=00gb 00gb 00gb 00gb
//stall
paddw mm5,mm7; //r8: mm5=sr?*a?+dr?*(31-a?)
paddw mm5,SIXTEEN; //r9: mm5=mm5 red
movq mm7,mm5; //r10: mm7=mm5 red
psrlw mm5,5; //r11: mm5=mm5>>5 red
paddw mm5,mm7; //r12: mm5=mm5+mm7 red
psrlw mm5,5; //r13: mm5=mm5>>5 red
psllw mm5,10; //r14: mm5=mm5<<10 red
por mm0,mm5; //mm0=0rgb 0rgb 0rgb 0rgb
test ebx,2; //check if there are 2 pixels
jz oneendpixel; //goto one pixel if that's it
movd [edi],mm0; //dst=0000 0000 0rgb 0rgb
psrlq mm0,32; //mm0>>32
add edi,4; //edi=edi+4
sub ebx,2; //saved 2 pixels
jz nextline; //all done goto next line
oneendpixel: //work on last pixel
movd edx,mm0; //edx=0rgb
mov [edi],dx; //dst=0rgb
nextline: //goto next line
dec ecx; //nuke one line
jz done; //all done
mov eax,lpLinearAlpBp; //alpha
mov esi,lpLinearSrcBp; //src
mov edi,lpLinearDstBp; //dst
add eax,iAlpPitch; //inc alpha ptr by 1 line
add esi,iSrcPitch; //inc src ptr by 1 line
add edi,iDstPitch; //inc dst ptr by 1 line
mov lpLinearAlpBp,eax; //save new alpha base ptr
mov ebx,iDstW; //ebx=span width to copy
mov lpLinearSrcBp,esi; //save new src base ptr
mov lpLinearDstBp,edi; //save new dst base ptr
jmp primeloop; //start the next span
done:
emms
}
}
/*Alpha blending routine for 565 video hardware
lpAlpha=Pointer to location of 8-bit alpha mask
iAlpPitch=The span of a row in the alpha mask
lpSrc=Pointer to location of 16-bit 565 source bitmap
iSrcX=Starting X location of 16-bit source bitmap
iSrcY=Starting Y location of 16-bit source bitmap
iSrcPitch=The span of a row of pixels in the Source bitmap
lpDst=Pointer to location of 16-bit 565 destination bitmap
iDstX=Starting X location of 16-bit destination bitmap
iDstY=Starting Y location of 16-bit destination bitmap
iDstW=Width in pixels for alpha mask, source and destination bitmaps
iDstH=Height in pixels for alpha mask, source and destination bitmaps
iDstPitch=The span of a row of pixels in the destination bitmap*/
void ablend_565(unsigned char *lpAlpha,unsigned int iAlpPitch,
unsigned char *lpSrc,unsigned int iSrcX, unsigned int iSrcY,
unsigned int iSrcPitch, unsigned char *lpDst,
unsigned int iDstX, unsigned int iDstY,
unsigned int iDstW, unsigned int iDstH,
unsigned int iDstPitch)
{
//Mask for isolating the red,green, and blue components
static __int64 MASKB=0x001F001F001F001F;
static __int64 MASKG=0x07E007E007E007E0;
static __int64 MASKSHIFTG=0x03F003F003F003F0;
static __int64 MASKR=0xF800F800F800F800;
//constants used by the integer alpha blending equation
static __int64 SIXTEEN=0x0010001000100010;
static __int64 FIVETWELVE=0x0200020002000200;
static __int64 SIXONES=0x003F003F003F003F;
unsigned char *lpLinearDstBp=(iDstX<<1)+(iDstY*iDstPitch)+lpDst; //base pointer for linear destination
unsigned char *lpLinearSrcBp=(iSrcX<<1)+(iSrcY*iSrcPitch)+lpSrc; //base pointer for linear source
unsigned char *lpLinearAlpBp=iSrcX+(iSrcY*iAlpPitch)+lpAlpha; //base pointer for linear alpha
_asm{
mov esi,lpLinearSrcBp; //src
mov edi,lpLinearDstBp; //dst
mov eax,lpLinearAlpBp; //alpha
mov ecx,iDstH; //ecx=number of lines to copy
mov ebx,iDstW; //ebx=span width to copy
test esi,6; //check if source address is qword aligned
//since addr coming in is always word aligned(16bit)
jnz done; //if not qword aligned we don't do anything
primeloop:
movd mm1,[eax]; //mm1=00 00 00 00 a3 a2 a1 a0
pxor mm2,mm2; //mm2=0;
movq mm4,[esi]; //g1: mm4=src3 src2 src1 src0
punpcklbw mm1,mm2; //mm1=00a3 00a2 00a1 00a0
loopqword:
mov edx,[eax];
test ebx,0xFFFFFFFC; //check if only 3 pixels left
jz checkback; //3 or less pixels left
//early out tests
cmp edx,0xffffffff; //test for alpha value of 1
je copyback; //if 1's copy the source pixels to the destination
test edx,0xffffffff; //test for alpha value of 0
jz leavefront; //if so go to the next 4 pixels
//the alpha blend starts
//green
//i=a*sg+(63-a)*dg;
//i=(i+32)+((i+32)>>6)>>6;
//red
//i=a*sr+(31-a)*dr;
//i=(i+16)+((i+16)>>5)>>5;
movq mm5,[edi]; //g2: mm5=dst3 dst2 dst1 dst0
psrlw mm1,2; //mm1=a?>>2 nuke out lower 2 bits
movq mm7,MASKSHIFTG; //g3: mm7=1 bit shifted green mask
psrlw mm4,1; //g3a: move src green down by 1 so that we won't overflow
movq mm0,mm1; //mm0=00a3 00a2 00a1 00a0
psrlw mm5,1; //g3b: move dst green down by 1 so that we won't overflow
psrlw mm1,1; //mm1=a?>>1 nuke out lower 1 bits
pand mm4,mm7; //g5: mm4=sg3 sg2 sg1 sg0
movq mm2,SIXONES;//g4: mm2=63
pand mm5,mm7; //g7: mm5=dg3 dg2 dg1 dg0
movq mm3,[esi]; //b1: mm3=src3 src2 src1 src0
psubsb mm2,mm0; //g6: mm2=63-a3 63-a2 63-a1 63-a0
movq mm7,MASKB; //b2: mm7=BLUE MASK
pmullw mm4,mm0; //g8: mm4=sg?*a?
movq mm0,[edi]; //b3: mm0=dst3 dst2 dst1 dst0
pmullw mm5,mm2; //g9: mm5=dg?*(1-a?)
movq mm2,mm7; //b4: mm2=fiveones
pand mm3,mm7; //b4: mm3=sb3 sb2 sb1 sb0
pmullw mm3,mm1; //b6: mm3=sb?*a?
pand mm0,mm7; //b5: mm0=db3 db2 db1 db0
movq mm7,[esi]; //r1: mm7=src3 src2 src1 src0
paddw mm4,mm5; //g10: mm4=sg?*a?+dg?*(1-a?)
pand mm7,MASKR; //r2: mm7=sr3 sr2 sr1 sr0
psubsb mm2,mm1; //b5a: mm2=31-a3 31-a2 31-a1 31-a0
paddw mm4,FIVETWELVE; //g11: mm4=(mm4+512) green
pmullw mm0,mm2; //b7: mm0=db?*(1-a?)
movq mm5,mm4; //g12: mm5=mm4 green
psrlw mm7,11; //r4: shift src red down to position 0
psrlw mm4,6; //g13: mm4=mm4>>6
paddw mm4,mm5; //g14: mm4=mm4+mm5 green
paddw mm0,mm3; //b8: mm0=sb?*a?+db?*(1-a?)
movq mm5,[edi]; //r3: mm5=dst3 dst2 dst1 dst0
paddw mm0,SIXTEEN; //b9: mm0=(mm0+16) blue
pand mm5,MASKR; //r5: mm5=dr3 dr2 dr1 dr0
psrlw mm4,5; //g15: mm4=0?g0 0?g0 0?g0 0?g0 green
movq mm3,mm0; //b10: mm3=mm0 blue
psrlw mm0,5; //b11: mm0=mm0>>5 blue
psrlw mm5,11; //r6: shift dst red down to position 0
paddw mm0,mm3; //b12: mm0=mm3+mm0 blue
psrlw mm0,5; //b13: mm0=000b 000b 000b 000b blue
pmullw mm7,mm1; //mm7=sr?*a?
pand mm4,MASKG; //g16: mm4=00g0 00g0 00g0 00g0 green
pmullw mm5,mm2; //r7: mm5=dr?*(31-a?)
por mm0,mm4; //mm0=00gb 00gb 00gb 00gb
add eax,4; //move to next 4 alphas
add esi,8; //move to next 4 pixels in src
add edi,8; //move to next 4 pixels in dst
movd mm1,[eax]; //mm1=00 00 00 00 a3 a2 a1 a0
paddw mm5,mm7; //r8: mm5=sr?*a?+dr?*(31-a?)
paddw mm5,SIXTEEN; //r9: mm5=(mm5+16) red
pxor mm2,mm2; //mm2=0;
movq mm7,mm5; //r10: mm7=mm5 red
psrlw mm5,5; //r11: mm5=mm5>>5 red
movq mm4,[esi]; //g1: mm4=src3 src2 src1 src0
paddw mm5,mm7; //r12: mm5=mm7+mm5 red
punpcklbw mm1,mm2; //mm1=00a3 00a2 00a1 00a0
psrlw mm5,5; //r13: mm5=mm5>>5 red
psllw mm5,11; //r14: mm5=mm5<<10 red
por mm0,mm5; //mm0=0rgb 0rgb 0rgb 0rgb
sub ebx,4; //polished off 4 pixels
movq [edi-8],mm0; //dst=0rgb 0rgb 0rgb 0rgb
jmp loopqword; //go back to start
copyback:
movq [edi],mm4; //copy source to destination
leavefront:
add edi,8; //advance destination by 4 pixels
add eax,4; //advance alpha by 4
add esi,8; //advance source by 4 pixels
sub ebx,4; //decrease pixel count by 4
jmp primeloop;
checkback:
test ebx,0xFF; //check if 0 pixels left
jz nextline; //done with this span
//backalign: //work out back end pixels
movq mm5,[edi]; //g2: mm5=dst3 dst2 dst1 dst0
psrlw mm1,2; //mm1=a?>>2 nuke out lower 2 bits
movq mm7,MASKSHIFTG; //g3: mm7=shift 1 bit green mask
psrlw mm4,1; //g3a: move src green down by 1 so that we won't overflow
movq mm0,mm1; //mm0=00a3 00a2 00a1 00a0
psrlw mm5,1; //g3b: move dst green down by 1 so that we won't overflow
psrlw mm1,1; //mm1=a?>>1 nuke out lower 1 bits
pand mm4,mm7; //g5: mm4=sg3 sg2 sg1 sg0
movq mm2,SIXONES;//g4: mm2=63
pand mm5,mm7; //g7: mm5=dg3 dg2 dg1 dg0
movq mm3,[esi]; //b1: mm3=src3 src2 src1 src0
psubsb mm2,mm0; //g6: mm2=63-a3 63-a2 63-a1 63-a0
movq mm7,MASKB; //b2: mm7=BLUE MASK
pmullw mm4,mm0; //g8: mm4=sg?*a?
movq mm0,[edi]; //b3: mm0=dst3 dst2 dst1 dst0
pmullw mm5,mm2; //g9: mm5=dg?*(1-a?)
movq mm2,mm7; //b4: mm2=fiveones
pand mm3,mm7; //b4: mm3=sr3 sr2 sr1 sr0
pmullw mm3,mm1; //b6: mm3=sb?*a?
pand mm0,mm7; //b5: mm0=db3 db2 db1 db0
movq mm7,[esi]; //r1: mm7=src3 src2 src1 src0
paddw mm4,mm5; //g10: mm4=sg?*a?+dg?*(1-a?)
pand mm7,MASKR; //r2: mm7=sr3 sr2 sr1 sr0
psubsb mm2,mm1; //b5a: mm2=31-a3 31-a2 31-a1 31-a0
paddw mm4,FIVETWELVE; //g11: mm4=(i+512) green
pmullw mm0,mm2; //b7: mm0=db?*(1-a?)
movq mm5,mm4; //g12: mm5=(i+512) green
psrlw mm7,11; //r4: shift src red down to position 0
psrlw mm4,6; //g13: mm4=(i+512)>>6
paddw mm4,mm5; //g14: mm4=(i+512)+((i+512)>>6) green
paddw mm0,mm3; //b8: mm0=sb?*a?+db?*(1-a?)
movq mm5,[edi]; //r3: mm5=dst3 dst2 dst1 dst0
paddw mm0,SIXTEEN; //b9: mm0=(i+16) blue
pand mm5,MASKR; //r5: mm5=dr3 dr2 dr1 dr0
psrlw mm4,5; //g15: mm4=0?g0 0?g0 0?g0 0?g0 green
movq mm3,mm0; //b10: mm3=(i+16) blue
psrlw mm0,5; //b11: mm0=(i+16)>>5 blue
psrlw mm5,11; //r6: shift dst red down to position 0
paddw mm0,mm3; //b12: mm0=(i+16)+(i+16)>>5 blue
psrlw mm0,5; //b13: mm0=000r 000r 000r 000r blue
pmullw mm7,mm1; //mm7=sr?*a?
pand mm4,MASKG; //g16: mm4=00g0 00g0 00g0 00g0 green
pmullw mm5,mm2; //r7: mm5=dr?*(31-a?)
por mm0,mm4; //mm0=00gb 00gb 00gb 00gb
add eax,4; //move to next 4 alphas
//stall
paddw mm5,mm7; //r8: mm5=sr?*a?+dr?*(31-a?)
paddw mm5,SIXTEEN; //r9: mm5=(i+16) red
movq mm7,mm5; //r10: mm7=(i+16) red
psrlw mm5,5; //r11: mm5=(i+16)>>5 red
paddw mm5,mm7; //r12: mm5=(i+16)+((i+16)>>5) red
psrlw mm5,5; //r13: mm5=(i+16)+((i+16)>>5)>>5 red
psllw mm5,11; //r14: mm5=mm5<<10 red
por mm0,mm5; //mm0=0rgb 0rgb 0rgb 0rgb
test ebx,2; //check if there are 2 pixels
jz oneendpixel; //goto one pixel if that's it
movd [edi],mm0; //dst=0000 0000 0rgb 0rgb
psrlq mm0,32; //mm0>>32
add edi,4; //edi=edi+4
sub ebx,2; //saved 2 pixels
jz nextline; //all done goto next line
oneendpixel: //work on last pixel
movd edx,mm0; //edx=0rgb
mov [edi],dx; //dst=0rgb
nextline: //goto next line
dec ecx; //nuke one line
jz done; //all done
mov eax,lpLinearAlpBp; //alpha
mov esi,lpLinearSrcBp; //src
mov edi,lpLinearDstBp; //dst
add eax,iAlpPitch; //inc alpha ptr by 1 line
add esi,iSrcPitch; //inc src ptr by 1 line
add edi,iDstPitch; //inc dst ptr by 1 line
mov lpLinearAlpBp,eax; //save new alpha base ptr
mov ebx,iDstW; //ebx=span width to copy
mov lpLinearSrcBp,esi; //save new src base ptr
mov lpLinearDstBp,edi; //save new dst base ptr
jmp primeloop; //start the next span
done:
emms
}
}