MMX ops would be potentially useful in places, but in particular I expect
to use it in the amplitude-modulation and mixing stage.
Problem with MMX is apparently in changing *back* to FP, not changing *to*
MMX. FP needs certain parts of the register set up in a certain way to
avoid errors apparently.

The MMX thing would be a *billion* times more useful if:
 -There was an 8x8 multiply
 -There was an "add all elements together" operation


Mixing step for 4x16bit channels probably isn't worth doing (except perhaps
in that the modulation would be done there?)
For 8x8bit channels, would be along the lines of:

 x 1 x 2 x 3 x 4   in MMX1
 x 5 x 6 x 7 x 8   in MMX2

 4x16 add into MMX1
  A   B   C   D
 1+5 2+6 3+7 4+8   in MMX1

 movq into MMX2
  A   B   C   D    in MMX2
 right-shift MMX2 by 32 bits
  x   x   A   B    in MMX2
 4x16 add MMX1 and MMX2 into MMX2
  x   x  A+C B+D   in MMX2
  Values should currently be about 2-bits higher than normal.
 Copy into memory
  x   x   E   F    in buffer
 [w1  w2  w3  w4]

Now do w3+w4, result should be 3 bits too high, shift right 3.Then divide
by 8. Or do right-shift by 3. How do we deal with signedness there?
After dealing with signedness, we have the output value! :)

 HOLD ON!!! ISTR there's bit-packing ops that do interleaving or somesuch
for bytes vs words, etc?? Look into them. Might make the A+C,B+D step easier
to reach :)

http://www.tommesani.com/MMXPrimer.html
http://neumann.cem.msu.edu/docs/icc/c_ug/comm1011.htm
http://codeproject.com/cpp/mmxintro.asp
http://softpixel.com/~cwright/programming/simd/3dn.php
http://www.novalis.org/documents/mmx.html
