SSE vs Altivec

Chris Clepper (one of the developers of GEM, a highly optimized Graphics Environment for Multimedia – a PD and Max/MSP library) makes a great distinction on why the loss of altivec on the new Intel chips worries developers of highly optimized applications:

“Altivec has the three and four operand PPC instructions which make possible fused instructions and the permute unit. Floating point multiplication is only available as the fused instruction in Altivec for example.”

“On the first G4 and the G5 the permute unit is separate from the ALU. This means that a permute can be done at the same time as any other arithmetic instruction. This makes shuffling around data into more efficient structures on the fly fairly easy. The G4+ extends this to include the execution of any two of the simple integer, complex integer, floating point or permute units in a single cycle. This can help out integer codes quite a bit in long unrolled loops.”

“Altivec has instructions (apart from fused and permute) not found in SSE. One of those which I use in doing image and video compositing requires multiple SSE operations to achieve and it goes from one cycle to issue to 6 and the total latency from two cycles to no less than 12.”

“Those of us working on Altviec and SSE code have had some tips and tricks handed out by the Apple performance team. The methods for optimization between the G5 and new Intel chips is very different and reflects the design of the two ISAs. The Intel chips like to see as few instructions per loop pass as possible due to the lack of registers, and the reorder buffer is leaned on quite heavily to make the code efficient. Unrolling loops should only be done serially to allow the reorder function to do its thing, and may not be a win at all. On PPC the idea behind optimization is that efficiency goes up the more work done per loop pass. Making use of all of the registers, scheduling around stalls and reducing the relative time spend loading and storing makes for faster code. PPC is really geared towards very complex or at least lengthy amounts of computation per iteration and it does not perform optimally on short blitter loops or functions where setup and load/store overhead is more than the arithmetic.”

Link: Ars Technica Post

This obviously isnt the last word on the subject, but gives an idea on what issues developers have to think about architecturally when recoding their app for SSE.

Leave a Reply