FFT, a stroll thru  WHOLY @#%$&@#$%!!!!!

So as the path of a tangent of a project, I decided to implement a FFT.
For years I'd known of FFTs and the simple magic they performed. but
easy as they supposedly were to implement, I'd NEVER DONE IT.
so, its time!

Now, this would supposedly be easy,
a) download fft code
b) feed values into it
c) display the results

so, I set out to find code.
Amongst looking for code it became clear that I needed to know more about what I was doing
The results of an FFT depend on the sample rate of the data you put in, and 
the size of the sample block. ok, fair enough. I sat down and wrote a program
that would allow me to define an input sample rate and block size, and would
tell me the meaning of the FFT results or 'bins'.
Each 'bin' represent the amplitude of some frequency in the given signal.
From this, set some goal posts, I wanted to have 16 bins, such that 2 bins bordered 1khz.

So then I went back over the results of my search for code.
There are two main types of code for this, itterative and recursive.
Because I want to implement this on a tiny microcontroller with limited memory
I chose to use an itterative version.

and thus started the problem...
Almost ALL the itterative code I could find looked exactly the same
and it WAS NOT pretty. The code is full of what seem to be excessive variables
and loops.

-----------------------

// http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199500857
void four1(double* data, unsigned long nn)
{
    unsigned long n, mmax, m, j, istep, i;
    double wtemp, wr, wpr, wpi, wi, theta;
    double tempr, tempi;

    // reverse-binary reindexing
    n = nn<<1;
    j=1;
    for (i=1; i<n; i+=2) {
        if (j>i) {
            swap(data[j-1], data[i-1]);
            swap(data[j], data[i]);
        }
        m = nn;
        while (m>=2 && j>m) {
            j -= m;
            m >>= 1;
        }
        j += m;
    };

    // here begins the Danielson-Lanczos section
    mmax=2;
    while (n>mmax) {
        istep = mmax<<1;
        theta = -(2*M_PI/mmax);
        wtemp = sin(0.5*theta);
        wpr = -2.0*wtemp*wtemp;
        wpi = sin(theta);
        wr = 1.0;
        wi = 0.0;
        for (m=1; m < mmax; m += 2) {
            for (i=m; i <= n; i += istep) {
                j=i+mmax;
                tempr = wr*data[j-1] - wi*data[j];
                tempi = wr * data[j] + wi*data[j-1];

                data[j-1] = data[i-1] - tempr;
                data[j] = data[i] - tempi;
                data[i-1] += tempr;
                data[i] += tempi;
            }
            wtemp=wr;
            wr += wr*wpr - wi*wpi;
            wi += wi*wpr + wtemp*wpi;
        }
        mmax=istep;
    }
}

-----------------------

The code has two parts, the first part re-organizes the data, 
the second part does the FFT calculations.

The code that does the "binary re-indexing" is horrid, in implementation 
and in practicality for my application.

a) There are WAY too many loops and variables going on there, for a task like this.
b) Here is a good joke, its moving the imaginary numbers around, why is this funny?
  cause, when you build the input array, you make all the imaginary numbers zero.
  so "why would anyone move them around" you might ask?, well, This version has
  been modified from the source to be specific to doing FORWARD FFT, the origional
  code in the Numerical methods book has a bi-directional FFT function, and there
  is a variable that can determine which direction the function works in. But not 
  this version, its just FORWARD.
  
  And WTF Numerical Methods, the code was almost a "mentioned in passing"!?!?!
  
OK, so, I isolated that first section, and just looked at the indexes it swapped.

-------------

  n = nn<<1;
  j=1;
  for (i=1; i<n; i+=2) {
      if (j>i) {
        printf("Swap %02X   %02X\n", i-1, j-1);
        printf("Swap %02X   %02X\n", j, i);
      }

      m = nn;
      while ((m>=2) && (j>m)) {
          j -= m;
          m >>= 1;
      }        

      j += m;                
   };

-------------
Giving it a low-order number, (nn = 8)  to see what its up to.

Swap 02   08
Swap 09   03
Swap 06   0C
Swap 0D   07


ok, so, this is supposed to be the reversal of values based on a mirror image
of their offset in binary... error.
You have to stand on your head for it to make sense, but you have to be inside out
while doing it.

This dosn't make sense, not even when you look at it in binary.
This should swap 100 <-> 001 etc...
looking at it in binary:

Swap 0b00000011  0b00001001 
Swap 0b00000010  0b00001000 
Swap 0b00000111  0b00001101 
Swap 0b00000110  0b00001100 

still dosn't make sense, these are not mirror images of anything, till you start to 
squint.  (nn = 8) so were working with 8 value pairs here, which is 3 bits...
The LSB selects if your on the real or imaginary value, that accounted for

Swap 0b0000001  0b0000100 
Swap 0b0000001  0b0000100 
Swap 0b0000011  0b0000110 
Swap 0b0000011  0b0000110

if we further ditch the extra lefthand bits...

Swap 001  100 
Swap 001  100 
Swap 011  110 
Swap 011  110

and hey, there we have it! thats what its supposed to do, but UGH THE CODE.

now, dont get me wrong, our modern computers were NOT optimized to 
mirror bits like this, there isn't one nice instruction to pull it off.
In the long run, I'm not going to. 
I have an ADC that I'm pulling results from, what I'll do is build a state 
machine that generates the indexes that the ADC loads the values in.
The data will be in the right order for the FFT in the first place.
I set upon my own code to generate the swap map.

-------------------------------


uint8_t uniReverse(uint8_t i, uint8_t bits) {

  uint8_t r = 0;
  
  do {
    r <<=1;    // The first extra shift dosn't matter, cause were going in with a value of 0
    if (i & 1) r++;
    i >>= 1;
  } while(--bits);
  return r;  

}


int main(void) {

  int i;
  
  // nn = 8
  #define BITS 3
  
  for(i = 0; i < (1<<BITS)/2; i++) {  
    if (i != uniReverse(i, BITS)) {
     printf("swap( &real[%02d],  &real[%02d]);\n", i, uniReverse(i, BITS));     
    }  
  }  

  return 0;
  
}


-------------------------------

This wasn't as pretty as I wanted, but it seemed generate the correct results.
It works by shifting the input value to the right, and the output value to the
left, setting the 1 place in the result every time there is a 1 in the input value.

yea, there's an artifact, the result of this is:

swap( &real[01],  &real[04]);
swap( &real[03],  &real[06]);
swap( &real[04],  &real[01]);

notice how it reversed the 1 and 4 pair twice! need to remove those.
Not a concern to me right now, I wanted to make sure I understood the general problem.


More digging....
Turns out, the code is from Numerical Methods in C (this took some digging)
The book said it was converted to C from the origional N M Brenner source.
Now, I needed to know what that origional source looked like.
After a fair bit of digging, I found out the origional source was 
in a paper published in 1976. WOW.
The two qusetions that came to mind first, are:
a) WTF were they running it on?
b) Where were they getting data to feed it?


written in FORTRAN IV, published in 1976, but WRITTEN in 1967
8-|
this just gets better, as it turns out, FORTRAN was only around since 1957,
which brings up a good question. WTF were they running it on!?!?
(slide rules can't be programmed in FORTRAN)

so, I found the source code. It was a HORRID scan, of a bad copy, that had been PDF'd 
and stored in an unroutable directory in an unlised archive server, that you could only
type out if you understood what it was supposed to be doing!

(I'm the guy with the lyrics for "Louie, Louie" by the way (Don't mess with my curiosity!))

Here you are:

-----------------------------------------------------------------------------


      SUBROUTINE FOUR1(DATA,NN,ISIGN)
C     THE COOLEY-TUKEY FAST ROURIER TRANSFORM IN USASI BASIC FORTRAN
C     TRANSFORM(J) = SUM(DATA(I)+W**((I-1)*(J-1)). WHERE I AND J RUN
C     FROM 1 TO NN AND W = EXP(ISIGN*2*PI+SQRT(-1)/NN). DATA IS ONE-
C     DIMENSIONAL COMPLEX ARRAY (I.E.: THE REAL AND IMAGINARY PARTS OF
C     THE DATA ARE LOCATE IMMEDIATELY ADJACENT IN STORAGE, SUCH AS
C     FORTRAN IV PLACES THEM) WHOSE LENGTH NN IS A POWER OF TWO. ISIGN
C     IS +1 OR -1, GIVING THE SIGN OF THE TRANSFORM, TRANSFORM VALUES
C     ARE RETURNED IN ARRAY DATA, REPLACING THE INPUT DATA. THE TIME IS
C     PROPORTIONAL TO N*LOG2(N), RATHER THAN THE USUAL N**2. WRITTEN BY
C     NORMAN BRENNER, JUNE 1967, THIS IS THE SHORTEST VERSION
C     OF FFT KNOWN THE THE AUTHOR, AND IS INTENDED MAINLY FOR
C     DEMONSTRATION. PROGRAMS FOUR2 AND FOURT ARE AVAILABLE THAT RUN
C     TWICE AS FAST AND OPERATE ON MULTIDIMENSIONAL ARRAYS WHOSE
C     DIMENSIONS ARE NOT RESTRICTED TO POWERS OR TWO. (LOOKING UP SINES
C     AND COSINES IN A TABLE WILL CUT RUNNING TIME OF FOUR1 BY A THIRD.)
C     SEE-- IEEE AUDIO TRANSACTIONS (JUNE 1967), SPECIAL ISSUE ON FFT.

      DIMENSION DATA(1)
      N=2*NN
      J=1
      DO 5 I=1,N,2
      IF(I-J)1,2,2
1     TEMPR=DATA(J)
      TEMPI=DATA(J+1)
      DATA(J)=DATA(I)
      DATA(J+1)=DATA(I+1)
      DATA(I)=TEMPR
      DATA(I+1)=TEMPI
2     M=N/2
3     IF(J-M)5,5,4
4     J=J-M
      M=M/2
      IF(M-2)5,3,3
5     J=J+M     

      MMAX=2
6     IF(MMAX-N)7,9,9
7     ISTEP=2*MMAX
      DO 8 M=1,MMAX,2
      THETA=3.1415926535*FLOAT(ISIGN*(M-1))/FLOAT(MMAX)
      WR=COS(THETA)
      WI=SIN(THETA)
      DO 8 I=M,N,ISTEP
      J=I+MMAX
      TEMPR=WR*DATA(J)-WI*DATA(J+1)
      TEMPI=WR*DATA(J+1)+WI*DATA(J)
      DATA(J)=DATA(I)-TEMPR
      DATA(J+1)=DATA(I+1)-TEMPI
      DATA(I)=DATA(I)+TEMPR
8     DATA(I+1)=DATA(I+1)+TEMPI
      MMAX=ISTEP
      GO TO 6
9     RETURN
      END


-----------------------------------------------------------------------------

IEEE is a huge pay-wall, and they CUT THAT ISSUE UP into tiny little untracable
sub-documents. moving on...

Lots of things stand out when you look at these two.

Like in the numerical recipies book, this function is bi-directional, HOWEVER
in the paper they point out that when ISIGN = -1 it does an FFT, 
and when ISIGN is +1 is does a IFFT, this is backwards in the Numerical methods book. 
OOPS.

some differences also stood out to me. look at the indexes to DATA betwen the two codes:

--------------------------------------------------------

  DATA(J)=DATA(I)-TEMPR
  DATA(J+1)=DATA(I+1)-TEMPI
  DATA(I)=DATA(I)+TEMPR
  DATA(I+1)=DATA(I+1)+TEMPI

  -- vs --
    
  data[j-1] = data[i-1] - tempr;
  data[j] = data[i] - tempi;
  data[i-1] += tempr;
  data[i] += tempi; 
  
-------------------------------------------------------

other things you learn. computer monitors were 40 and 80 column, as it turns out
the punchcards that the FORTRAN used, held 80 characters. I'v known lots of people
to insist that a line of code should never be more than 80 characters. They need to give
up the fortran days, were not going back to punchcards!

Anyhow:

online code:
   data[j-1] = data[i-1] - tempr;
vs
numberical recipies:
   data[j] = data[i] - tempr;
vs
1967 FORTRAN: 
  DATA(J)=DATA(I)-TEMPR

Is the C code wrong?
turns out no, its a hack. In FORTRAN, arrays start at index 1, in C, we use 0
the code in numerical recpeies maintained the 1 indexing.
WTF?


There are 3 things going on in the main FFT code, 
  - build the data indexes
  - build a "weighing" value
  - apply value to data indexes.
  
What REALLY bothered me is that fact that this should NOT be a 3 layer loop, 
it REALLY seems like a 2 layer operation. 

Also bothering me is the fact that the values in this are alternated, I have not
 evaluated the math at this point, but it seems to me that a modern system would not have a problem
 with this being two arrays.....

so, the first task is to pair down the function to its indexing code.

----------------------------------------------------


 nn = 8;

 n = nn << 1;
 mmax = 2;
 while(mmax < n) {
   istep = mmax << 1;
   for( m = 1; m <= mmax; m+=2) {
     printf("         m= %02d, mmax= %02d \n", m,  mmax);
     for (i = m; i <= n; i+=istep) {
       j = i + mmax;
       printf("i= %02d, j= %02d \n", (i-1)/2, (j-1)/2);               
     }     
   }
     mmax = istep;  
 }

-----------------------------------------------------

The other thing I did was to reduce the indexes,
I only show the index of the complex pair.
For knowing what this should do, you want to look at the 
butterly diagrams, for a 8 element array it should be (1 indexed)

First, swap offsets of 1
   1-2 3-4 5-6 7-8

then offsets of 2:
   1-3 2-4 (3 is already done) 5-7 6-8

then offsets of 4:
   1-5 2-6 3-7 4-8


note how there are the same number of operations for each pass.

and the output is (0 indexed)

         m= 01, mmax= 02, 
i= 00, j= 01 
i= 02, j= 03 
i= 04, j= 05 
i= 06, j= 07 
         m= 01, mmax= 04, 
i= 00, j= 02 
i= 04, j= 06 
         m= 03, mmax= 04, 
i= 01, j= 03 
i= 05, j= 07 
         m= 01, mmax= 08, 
i= 00, j= 04 
         m= 03, mmax= 08, 
i= 01, j= 05 
         m= 05, mmax= 08, 
i= 02, j= 06 
         m= 07, mmax= 08, 
i= 03, j= 07 

At this point, the focus is on i and j.
so I pondered this for a while, it really seemed it should NOT be 3 layers.

and so the solution hit me, binary rollover, how ironic, the theme of the whole thing.
... tho, as it turns out, the rollover isn't quite binary.

---------------------------------------------------------

  for(p = 1; p < nn; p <<= 1 ) {                   
    for(i = 0, j = 0; j != (nn-1); i += (p << 1) ) {            // this will always itterate nn/2 times
      if (i >= nn) i -= (nn-1);                                 // dont use i %= (nn-1);  its MUCH slower
      j = i + p;      
  
      printf("p= %02d, i= %02d, j= %02d\n", p,  i, j);
  
    }
  }

----------------------------------------------------------

there!, only 2 layers!

output is:

p= 01, i= 00, j= 01
p= 01, i= 02, j= 03
p= 01, i= 04, j= 05
p= 01, i= 06, j= 07
p= 02, i= 00, j= 02
p= 02, i= 04, j= 06
p= 02, i= 01, j= 03
p= 02, i= 05, j= 07
p= 04, i= 00, j= 04
p= 04, i= 01, j= 05
p= 04, i= 02, j= 06
p= 04, i= 03, j= 07

perfect! 
and notice how p = mmax/2.... 

The next step to rebuilding this code, is to generate the weighing values
This is a point where you notice a strange difference between the origional FORTRAN and the C


      while (n>mmax) {
        istep = mmax<<1;                
        
   theta = -(2*M_PI/mmax);
        wtemp = sin(0.5*theta);
        wpr = -2.0*wtemp*wtemp;
        wpi = sin(theta);
        wr = 1.0;
        wi = 0.0;
        for (m=1; m < mmax; m += 2) {
        .
        .
        .
        }
        wtemp=wr;
    wr += wr*wpr - wi*wpi;
    wi += wi*wpr + wtemp*wpi;
        
        
vs 

      THETA=3.1415926535*FLOAT(ISIGN*(M-1))/FLOAT(MMAX)
      WR=COS(THETA)
      WI=SIN(THETA)

Are they the same???

so I put both to the test, and yes, the values in the inner loop for wr and wi came out the same
so WTF?
then I looked at it more, SOMEONE between the 6502 and the 80386 decided to optimize this code for
integer math, apparently the 6 multiplies, 4 additions, and copy were cheaper than a cos() call (!?!?!)
ok, but we have floating point processors now!
You might comment that I was going to run this on an tiny microcontroller that dosn't have a floating point 
processor, but, when you look at the values that make WR and WI in the fortran, you realize that, depending 
on the array size, this code only uses a handfull of values for theta, you could easily use a small lookup 
table... a really small one.

so, I chose to dial this back to the origional FORTRAN method with sin() and cos()

now, there is another cleanup I want to make, I dont care about IFFT, just forward.
so my ISIGN is always -1.

  THETA = 3.1415926535*FLOAT(ISIGN*(M-1))/FLOAT(MMAX)
to
  theta = M_PI *(-1)*(m-1)/mmax;
to 
  theta = M_PI*(1-m)/mmax;
nice.

now, one of the things my loop code is missing is the generation of mmax and m.
if you look at the values generated, mmax starts at 2, and goes up by two, so easy.
n starts at zero and goes up by two.


-----------------------------------------------------


  for(d = 1; d < nn; d <<= 1 ) {         
    n = -2;                                           // fraction initialization, this will immediatly have 2 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n+=2;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
        theta = (M_PI*(1.0-(float)n))  / (float)(d<<1);  
                               
      }          
      j = i + d;   // j is just an offset of i          
      
    }    
       
  }


-----------------------------------------------------
I added a check to the inner loop for j, becasue the 'if' is going to do all our theta calcs, we need to make sure it trips 
on the first itteration.

but waaaaaaait a min...
(no, ignore me changing variable names, I'm trying to work out a good name for what they are doing)
in the theta calc

mmax (otherwise d) is always even (2, 4, 8...)
and n, isn't part of a loop, but its also always even....

when you look at them they are always generating reducable fractions...

SO, how about we dont multiply d (otherwise mmax) by two EVERY loop, and we alter n to suit?


-----------------------------------------------------


  for(d = 1; d < nn; d <<= 1 ) {         
    n = -1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n+=1;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
        theta = (M_PI*(1.0-(float)n))  / (float)(d);  
                               
      }          
      j = i + d;   // j is just an offset of i                
    }           
  }


-----------------------------------------------------


that helps, we just dropped a multiply(shift) out of the inner loop

buuuut waiiiit, a sec....

1.0-(float)n

n is just along for the ride, it can be whatever we want...

if we start from 0, and work down....

-----------------------------------------------------


  for(d = 1; d < nn; d <<= 1 ) {         
    n = 1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n--;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
        theta = (M_PI*(float)n)  / (float)(d);  
                               
      }          
      j = i + d;   // j is just an offset of i                
    }           
  }


-----------------------------------------------------

and hey, look, we also just shaved a subtract out of the inner loop!
At this point I checked the values being generated for theta to make sure they happened with the right values
and the right I / J indexes

and yup. all good

so far:


-----------------------------------------------------

 
  for(d = 1; d < nn; d <<= 1 ) {         
    n = 1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n--;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
        theta = (M_PI*(float)n)  / (float)d;  // work out n/d of pi for our angle
        wr = cos(theta);  // weighing values, cos dosn't care that were rotating backwards
        wi = sin(theta);  // were rotating backwards cause of this one.                        
        
         printf("      theta=Pi*%02d/%02d, theta= %f wr= %f, wi= %f \n",n, d, theta,  wr, wi);        
      }          
      j = i + d;   // j is just an offset of i     
       printf(" i= %02d, j= %02d\n",  i, j);
        
    }    
       
  }


-----------------------------------------------------


at this point, the only thing left to do is to put back the actual math, 
there really isn't any thing I can do to improve it. 
what IS apparent, is that the single array - alternated values, is not
an optimization of any kind, its there cause its carried forward from FORTRAN


-----------------------------------------------------


  for(d = 1; d < nn; d <<= 1 ) {         
    n = 1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n--;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
        theta = (M_PI*(float)n)  / (float)d;  // work out n/d of pi for our angle
        wr = cos(theta);  // weighing values, cos dosn't care that were rotating backwards
        wi = sin(theta);  // were rotating backwards cause of this one.                        
        
      }          
      j = i + d;   // j is just an offset of i     
  
      tempr     =  wr * real[j] - wi * imag[j];    // defacto standard complex multiply
      tempi     =  wr * imag[j] + wi * real[j];   
      real[j]   =  real[i] - tempr; // update real, crossed path calculations
      imag[j]   =  imag[i] - tempi; // update imaginary 
      real[i]   += tempr;            // update real, forward path calculations
      imag[i]   += tempi;            // update imaginary
      
    }    
       
  }


-----------------------------------------------------


so then, load this into a test-suite and check against some known-good values.


-----------------------------------------------------


#include <stdio.h>
#include <math.h>
#include "binaryString.h"


void swap( float * this, float * that) {
  float t;
  
  t = *this;
  *this = *that;
  *that = t;
}


int main(void) {

    float         real[255], imag[255];  
    int           nn;
    unsigned int  d, i, j;
    int           n;
    float         theta, wr, wi;
    float         tempr, tempi;
 
 
   nn = 8;
   
   for (i = 0; i < nn; i++){
     real[i] = 0;
     imag[i] = 0;
     }
      
   real[0]=1;
   real[1]=1;
   real[2]=1;
   real[3]=1;   
   
    for ( i = 0; i < nn; printf("%6.2f \n", real[i]), i++);   
   
   // imaginary is all zero, just swap around reals!.        
   swap(&real[1], &real[4]);  //8
   swap(&real[3], &real[6]);    

  for(d = 1; d < nn; d <<= 1 ) {         
    n = 1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n--;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
        theta = (M_PI*(float)n)  / (float)d;  // work out n/d of pi for our angle
        wr = cos(theta);  // weighing values, cos dosn't care that were rotating backwards
        wi = sin(theta);  // were rotating backwards cause of this one.                        
        
      //  printf("      theta=Pi*%02d/%02d, theta= %f wr= %f, wi= %f \n",n, d, theta,  wr, wi);        
      }          
      j = i + d;   // j is just an offset of i     
     // printf(" i= %02d, j= %02d\n",  i, j);
  
      tempr     =  wr * real[j] - wi * imag[j];    // defacto standard complex multiply
      tempi     =  wr * imag[j] + wi * real[j];   
      real[j]   =  real[i] - tempr; // update real, crossed path calculations
      imag[j]   =  imag[i] - tempi; // update imaginary 
      real[i]   += tempr;            // update real, forward path calculations
      imag[i]   += tempi;            // update imaginary
      
    }           
  }

   printf("\n\n");
   
   for ( i = 0; i < nn; printf("%6.2f %6.2f\n", real[i], imag[i]), i++);  

return 0;

}


-----------------------------------------------------


results are:

  1.00 
  1.00 
  1.00 
  1.00 
  0.00 
  0.00 
  0.00 
  0.00 


  4.00   0.00
  1.00  -2.41
  0.00   0.00
  1.00  -0.41
  0.00   0.00
  1.00   0.41
  0.00   0.00
  1.00   2.41

  
  according to
  https://rosettacode.org/wiki/Fast_Fourier_transform


 0  4.000000  0.000000
 1  1.000000 -2.414214
 2  0.000000  0.000000
 3  1.000000 -0.414214
 4  0.000000  0.000000
 5  1.000000  0.414214
 6  0.000000  0.000000
 7  1.000000  2.414214


yay! it still works!


-----------------------------------------------------
summaries so far:
  - Don't rearrange the zeros in the input array, 0 = 0,  we can use either one.
  - Don't use 3 loop layers when 2 will do
  - Don't try to optimize for integer processors, if you need to, do it later.
  - Don't multiplex the array, were not limited to a single dataset anymore.
  - Optimize your equations, remove redundant calculations.
  - Modern day punchcards can handle more than 80 characters, write long lines if you want.


so the next challange for me is to shoe-horn it into an 8 bit integer system

but first, a revisit to reverse binary reindexing


the origional code for that reindexing is still bothering me:


-----------------------------------------------------

    nn = 32;

    // reverse-binary reindexing
    n = nn<<1;
    j=1;
    for (i=1; i<n; i+=2) {
        if (j>i) {
           
           printf("Swap %02d   %02d\n", (i-1)/2, (j-1)/2);
        }
        
        m = nn;
        while ((m>=2) && (j>m)) {
            j -= m;
            m >>= 1;
        }        
        
        j += m;                
    };


-----------------------------------------------------

It turns out, if you analize it closely, m is just part of the lower block.
The entire lower block actaully is a reverse-binary incrementer, and its independent of i,
Variable m is serving as a bit mask, the code is going from left to right, looking for
a bit that causes j to get smaller than the mask (by clearing them), and then setting the 
next bit down. aka, a rightwise bit ripple.
Why is it done liek this? well, remember, there were no binary operators in FORTRAN...


so, lets look at that reverse incrementer in sane terms (now that were not using FORTRAN
and have bitwise operators.)


-----------------------------------------------------

    // j comes in whatever value it wants...
    for( m = nn>>1; m; m >>= 1) { // walk thru bits rightwise
      if (j & m) // if bit is set
         j ^= m; // clear the bit and move along
      else {     // if the bit is clear
         j ^= m; // set the bit
         break;  // stop doing things!
      }     
    }

-----------------------------------------------------
 
 ok, the operation j ^= m is going to happen regardless...

-----------------------------------------------------
    for( m = nn>>1; m; m >>= 1) { // walk thru bits rightwise
      j ^= m;            // toggle the current bit;
      if (j & m) break;  // if the result from the toggle was a 1, then stop 
    }

-----------------------------------------------------


 while were at it, toggle that bit as you check 
 subtle change of that first shift
 
-----------------------------------------------------

    for( m = nn; m; m >>= 1, j ^= m) { // walk thru bits rightwise
      if (j & m) break;  // if the result from the toggle was a 1, then stop 
    }

-----------------------------------------------------

 the if is just a stop condition, we might as well put that in the for loop,


-----------------------------------------------------

for( m = nn; !(j & m) && m; m >>= 1, j ^= m ) ; // reverse binary increment j

-----------------------------------------------------


ok!
see the cleanup we just did!?, granted its not going to take less time, but man is it cleaner.
At the same time, now it can be zero indexed and go up by one.

-----------------------------------------------------

    nn = 32; // 32 sample tables    

    for (i = 0, j = 0; i < nn; i ++) {
        
        if (j > i) {  // dont swap with indexes behind us, or equal to us.                 
           printf(" swap( &real[%d], &real[%d]); \n", i, j);
        }
        
        for( m = nn; !(j & m) && m; m >>= 1, j ^= m ) ; // reverse binary increment j

    };
    

-----------------------------------------------------
Note how I had this generate a fixed swap list, its a tradeoff between memory and loop time.
Like I say, I'm not using code that swaps the values in the buffer, I'm building the buffer
in the right order in the first place.


with that out of the way...

TRIG VALUES
-----------
The idea is to use a trig lookup table, but its become apparent from the code that only 
particular values of theta are used, values like 0/2 0/4 1/2 2/4 4/8...
and yes, a bunch of those reduce down to the same values...

so the focus becomes, how to exploit this limited value set without incurreing too much overhead.

playing with things, ((float)(-n*(nn/2))/(float)d) always results in positive numbers,
zero indexed, that are unique to the value of theta being processed.
ok

-n*(nn/2) /d

we can make n positive, np. :  n*(nn/2)/d
ok, rearange:                  n*nn/2*d
binaryness:                    n*nn/(d<<1)


nn is always a power of 2, 
and its fixed (in this case 5): (n<<5)/(d<<1)

printf("%d \n", (n<<5)>>bitPos[d]);

printf("%d \n", n<<(5-bitPos[d]); 

or otherwise

printf("%d \n", n <<(bitPos[nn]-bitPos[d]-1) );

now thats an interesting relation.
lets take a closer look at that. (for nn = 32)

shift 4 left 
shift 3 left 
shift 3 left 
shift 2 left 
shift 2 left 
shift 2 left 
shift 2 left 
shift 1 left 
shift 1 left 
shift 1 left 
shift 1 left 
shift 1 left 
shift 1 left 
shift 1 left 
shift 1 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 
shift 0 left 

lets look at what we need and what we dont need here.

----------------------------------------
for(d = 1; d < nn; d <<= 1 ) {         
    n = 1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n--;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
        theta = (M_PI*(float)n)  / (float)d;  // work out n/d of pi for our angle
        wr = cos(theta);  // weighing values, cos dosn't care that were rotating backwards
        wi = sin(theta);  // were rotating backwards cause of this one.                        
                
      }          
      j = i + d;   // j is just an offset of i     
  
.
.
.
      
    }           
  }

----------------------------------------

d is used to generate i, and j
i is one of our master index pointers
j is the other
n builds our theta angle in i by counting the rollovers

none that do what we want already
BUT
d is just a power of two counter, its value is used twice, once in the increment of the for(i) loop,
and once to establish the offset of j
both of these are 'intensly' used, we could make both usages 1<<d if d were changed to a linear count, but
that would have a significant impact on performance, esp as my target system can only shift a value 
1 place at a time.

begrudgingly, I'll add another variable. b (Back in my day, 1 letter was all you needed for a variable name!)
and, going backwards isn't pretty, so I'll define the table size by the number of BITS the adderss uses

-------------------------------------------------------
   #define BITSIZE 5
   nn = 1 << BITSIZE;

  for(d = 1, b = BITSIZE-1; d < nn; d <<= 1, b-- ) {         
    n = -1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n++;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff        
       
     //   theta = (M_PI*(float)-n)  / (float)d;  // work out n/d of pi for our angle
     //   wr = cos(theta);  // weighing values, cos dosn't care that were rotating backwards
     //   wi = sin(theta);  // were rotating backwards cause of this one.                        
        
       printf("theta table index is: %d",  n<<b ) ;
      
      printf("\n");

      }                                  
      j = i + d;   // j is just an offset of i                
    }           
  }


-------------------------------------------------------------

its getting kinda crowded...
HOWEVER, this reduces our trig calcs to just a few values.
defined by this code (that does what we dont want to do realtime)

-------------------------------------------------------------


void buildTrigTables( uint8_t bitsize ) {

  unsigned int i, j, b, d, nn, n;  
  float  theta;
  float *sinTable;
  float *cosTable;

  nn = 1 << bitsize;
  
  sinTable = malloc(nn*sizeof(float)); // always check malloc for errors
  cosTable = malloc(nn*sizeof(float)); // ALWAYS 
     
  for(d = 1, b = bitsize-1; d < nn; d <<= 1, b-- ) {         
    n = -1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n++;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff               
        
       theta = (-M_PI*(float)n)  / (float)d;  // work out n/d of pi for our angle        
       sinTable[n<<b] = sin(theta);
       cosTable[n<<b] = cos(theta);

      }                                  
      j = i + d;   // j is just an offset of i   
    }           
  }

  printf("float sinTable[] = { ");
  for(i = 0; i < 16; i++) printf( "%8.6f%c", sinTable[i], (i!=15)?',':'}');
  printf(";\n\n");

  printf("float cosTable[] = { ");
  for(i = 0; i < 16; i++) printf( "%8.6f%c", cosTable[i], (i!=15)?',':'}');
  printf(";\n");

  free(sinTable);
  free(cosTable);

}


-------------------------------------------------------------

That is quite non-optimized slap-togethor for building the values, copy and paste, it was quick.
This only needs to be run to build your trig table for your source, static values, yay.
For a 32 sample FFT, you only need 32 sin and 32 cos values ;)


But with that, we have a pretty optimal, modern implementation of the NMBrenner alg!

--------------------------------------------------------------

#include <avr/io.h>
#include <avr/interrupt.h>
#include <stdint.h>
#include "MAX7219.h"
#include "avrcommon.h"

#define OUTPUT  1
#define INPUT   0

uint8_t RBI_FSM32[] = { 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,8,9,10,11,12,13,14,15,4,5,6,7,2,3,1,0 };

int8_t sinTable32[] = { -0.000000,-6.242890,-12.245871,-17.778248,-22.627417,-26.607027,-29.564144,-31.385128,-32.000000,-31.385128,-29.564146,-26.607029,-22.627417,-17.778246,-12.245872,-6.242890};
int8_t cosTable32[] = { 32.000000,31.385128,29.564144,26.607027,22.627417,17.778248,12.245870,6.242891,-0.000001,-6.242890,-12.245869,-17.778246,-22.627417,-26.607029,-29.564144,-31.385130};

volatile int8_t real[32];
volatile int8_t imag[32];

volatile char ADCFlag;

volatile uint8_t  IRQIDX ;


void AnalogInit (void);
void timerInit(void) ;

#define BITSIZE 5
#define nn (1<<BITSIZE)

int main( void ) {
 
 int a; //, d; //, i;
 
    int8_t     d, i, j, b, n;
    
    int8_t        wr, wi;
    int       tempr, tempi;

    // set up directions 
  DDRB = (OUTPUT << PB0 | OUTPUT << PB1 | INPUT << PB2 | INPUT << PB3 | INPUT << PB4 |OUTPUT << PB5 | INPUT << PB6 | INPUT << PB7);
  DDRD = (INPUT << PD0 | INPUT << PD1 | OUTPUT << PD2 |OUTPUT << PD3 |OUTPUT << PD4 |OUTPUT << PD5 |OUTPUT << PD6 |OUTPUT << PD7);        
  DDRC = (INPUT << PC0 | INPUT << PC1 | INPUT << PC2 |INPUT << PC3 |INPUT << PC4 |INPUT << PC5 |INPUT << PC6 ); 
  
  IRQIDX = 0;
  
  AnalogInit();
  timerInit();
  max7219Init( );
  
  sei();
 
   ADCFlag = 0;
   while(ADCFlag == 0);
   ADCFlag = 0;
 
  while(1) {
   // wait for buffer
   
   while(ADCFlag == 0);    
   
   for ( i = 0; i < 32; i++) imag[i] = 0;
      
   for(d = 1, b = BITSIZE-1; d < nn; d <<= 1, b-- ) {         
    n = -1;                                           // fraction initialization, this will immediatly have 1 added to it by the if         
    for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) {            // this will always itterate nn/2 times
      if ((i >= nn) || (j == 0)) {                             // do calcs and wrap the i index back into the array with an offset
        i -= (nn-1);   n++;                         // dont use i %= (nn-1);  its MUCH slower, otherwise, loop stuff               
        wr = cosTable32[n<<b];
        wi = sinTable32[n<<b];
      }          
      j = i + d;   // j is just an offset of i     
  
      tempr     =  (wr * real[j])/32 - (wi * imag[j])/32;    // defacto standard complex multiply
      tempi     =  (wr * imag[j])/32 + (wi * real[j])/32;   
      real[j]   =  real[i] - tempr; // update real, crossed path calculations
      imag[j]   =  imag[i] - tempi; // update imaginary 
      real[i]   += tempr;            // update real, forward path calculations
      imag[i]   += tempi;            // update imaginary      
      
    }           
  }
   
   // process data
   for ( i = 0; i < 8; i++) {    
     a = ((int)ABS(real[i+3])+(int)ABS(imag[i+3]))>>5; // I'mavoiding a square root, thanks.
     
     d = (1<<a)-1;
     send16(max7219MakePacket(cmdDIG0+(7-i), d));    
   }
   
   // start new buffer
   ADCFlag = 0;
   TIFR1 |= (1<<TOV1); // restart adc
    
  }
}

// 7112Hz
void timerInit(void) {

// Fast pwm mode, overflows @ OCR1A 
  TCCR1A |= (1<<WGM10)|(1<<WGM11);
  // set prescaler to /1 and FAST PWM mode
  TCCR1B |= (1<<CS10)|(1<<WGM12)|(1<<WGM13);
  
  // about 7111.11 hz
  OCR1A = 2250;

}


void AnalogInit (void) {  

  // auto trigger on timer 1 overflow, how cool is that!
  ADCSRB = 1 << ADTS2 |
           1 << ADTS1 ;

  // Activate ADC with Prescaler of 128, yielding max of 9.6ksps
  ADCSRA =  1 << ADEN  |
            0 << ADSC  | 
            1 << ADATE | /* auto start when timer1 overflows */
            0 << ADIF  |
            1 << ADIE  | /* enable interrupt */
            1 << ADPS2 |
            1 << ADPS1 |
            1 << ADPS0 ;
                        
  ADMUX = (1<<REFS0)|(1<<ADLAR);     // channel 0  , shift right 2 places for us   
  
}


ISR(ADC_vect) { 
  
  static uint8_t i;
  
  SetBit(5, PINB);
  
  if (!ADCFlag) { 
    real[IRQIDX] = ADCH;  // save value
  
    IRQIDX = RBI_FSM32[IRQIDX];

    if (IRQIDX == 0) { 
    ADCFlag = 1;
   }
  
    TIFR1 |= (1<<TOV1);
  } 
  return;
}


---------------------------------------------------------------

So, a summary?

Brenner took source from two others, which may have come from other sources...
It was passed along thru the ages, I could see features nonally from the obvious ones in FORTRAN, 
but some from BASIC.
The code was origionally constrained to a language that only had 1 dimentional data arrays and no binary functions,
people have trid to optimize it at various times for speed...

But nobody seems to have actually understood what any of the code was doing. 
That was my result here, I nonally ported the concept to C, but I learned WAAAY
more than I ever wanted to know about how an FFT works.

Questions left un-explored:

a)
  "PROGRAMS FOUR2 AND FOURT ARE AVAILABLE THAT RUN
   TWICE AS FAST AND OPERATE ON MULTIDIMENSIONAL ARRAYS WHOSE
   DIMENSIONS ARE NOT RESTRICTED TO POWERS OR TWO."
   
Yes, I have the source for those, not typed out, as they are LONG. 
Maybe thats why people went with this implementation?

b)
  Its possable to re-arrange the constants of an FFT and then maul the
  butterly into the form of a mesh network (like a neural net)
  But there would be a LOT of labour translating the equations.
  The advantage should be that the whole FFT could be done in 
  possibly just 1 or 2 passes of a loop. ??


 -- end --