FFT, a stroll thru WHOLY @#%$&@#$%!!!!! So as the path of a tangent of a project, I decided to implement a FFT. For years I'd known of FFTs and the simple magic they performed. but easy as they supposedly were to implement, I'd NEVER DONE IT. so, its time! Now, this would supposedly be easy, a) download fft code b) feed values into it c) display the results so, I set out to find code. Amongst looking for code it became clear that I needed to know more about what I was doing The results of an FFT depend on the sample rate of the data you put in, and the size of the sample block. ok, fair enough. I sat down and wrote a program that would allow me to define an input sample rate and block size, and would tell me the meaning of the FFT results or 'bins'. Each 'bin' represent the amplitude of some frequency in the given signal. From this, set some goal posts, I wanted to have 16 bins, such that 2 bins bordered 1khz. So then I went back over the results of my search for code. There are two main types of code for this, itterative and recursive. Because I want to implement this on a tiny microcontroller with limited memory I chose to use an itterative version. and thus started the problem... Almost ALL the itterative code I could find looked exactly the same and it WAS NOT pretty. The code is full of what seem to be excessive variables and loops. ----------------------- // http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199500857 void four1(double* data, unsigned long nn) { unsigned long n, mmax, m, j, istep, i; double wtemp, wr, wpr, wpi, wi, theta; double tempr, tempi; // reverse-binary reindexing n = nn<<1; j=1; for (i=1; ii) { swap(data[j-1], data[i-1]); swap(data[j], data[i]); } m = nn; while (m>=2 && j>m) { j -= m; m >>= 1; } j += m; }; // here begins the Danielson-Lanczos section mmax=2; while (n>mmax) { istep = mmax<<1; theta = -(2*M_PI/mmax); wtemp = sin(0.5*theta); wpr = -2.0*wtemp*wtemp; wpi = sin(theta); wr = 1.0; wi = 0.0; for (m=1; m < mmax; m += 2) { for (i=m; i <= n; i += istep) { j=i+mmax; tempr = wr*data[j-1] - wi*data[j]; tempi = wr * data[j] + wi*data[j-1]; data[j-1] = data[i-1] - tempr; data[j] = data[i] - tempi; data[i-1] += tempr; data[i] += tempi; } wtemp=wr; wr += wr*wpr - wi*wpi; wi += wi*wpr + wtemp*wpi; } mmax=istep; } } ----------------------- The code has two parts, the first part re-organizes the data, the second part does the FFT calculations. The code that does the "binary re-indexing" is horrid, in implementation and in practicality for my application. a) There are WAY too many loops and variables going on there, for a task like this. b) Here is a good joke, its moving the imaginary numbers around, why is this funny? cause, when you build the input array, you make all the imaginary numbers zero. so "why would anyone move them around" you might ask?, well, This version has been modified from the source to be specific to doing FORWARD FFT, the origional code in the Numerical methods book has a bi-directional FFT function, and there is a variable that can determine which direction the function works in. But not this version, its just FORWARD. And WTF Numerical Methods, the code was almost a "mentioned in passing"!?!?! OK, so, I isolated that first section, and just looked at the indexes it swapped. ------------- n = nn<<1; j=1; for (i=1; ii) { printf("Swap %02X %02X\n", i-1, j-1); printf("Swap %02X %02X\n", j, i); } m = nn; while ((m>=2) && (j>m)) { j -= m; m >>= 1; } j += m; }; ------------- Giving it a low-order number, (nn = 8) to see what its up to. Swap 02 08 Swap 09 03 Swap 06 0C Swap 0D 07 ok, so, this is supposed to be the reversal of values based on a mirror image of their offset in binary... error. You have to stand on your head for it to make sense, but you have to be inside out while doing it. This dosn't make sense, not even when you look at it in binary. This should swap 100 <-> 001 etc... looking at it in binary: Swap 0b00000011 0b00001001 Swap 0b00000010 0b00001000 Swap 0b00000111 0b00001101 Swap 0b00000110 0b00001100 still dosn't make sense, these are not mirror images of anything, till you start to squint. (nn = 8) so were working with 8 value pairs here, which is 3 bits... The LSB selects if your on the real or imaginary value, that accounted for Swap 0b0000001 0b0000100 Swap 0b0000001 0b0000100 Swap 0b0000011 0b0000110 Swap 0b0000011 0b0000110 if we further ditch the extra lefthand bits... Swap 001 100 Swap 001 100 Swap 011 110 Swap 011 110 and hey, there we have it! thats what its supposed to do, but UGH THE CODE. now, dont get me wrong, our modern computers were NOT optimized to mirror bits like this, there isn't one nice instruction to pull it off. In the long run, I'm not going to. I have an ADC that I'm pulling results from, what I'll do is build a state machine that generates the indexes that the ADC loads the values in. The data will be in the right order for the FFT in the first place. I set upon my own code to generate the swap map. ------------------------------- uint8_t uniReverse(uint8_t i, uint8_t bits) { uint8_t r = 0; do { r <<=1; // The first extra shift dosn't matter, cause were going in with a value of 0 if (i & 1) r++; i >>= 1; } while(--bits); return r; } int main(void) { int i; // nn = 8 #define BITS 3 for(i = 0; i < (1<= nn) i -= (nn-1); // dont use i %= (nn-1); its MUCH slower j = i + p; printf("p= %02d, i= %02d, j= %02d\n", p, i, j); } } ---------------------------------------------------------- there!, only 2 layers! output is: p= 01, i= 00, j= 01 p= 01, i= 02, j= 03 p= 01, i= 04, j= 05 p= 01, i= 06, j= 07 p= 02, i= 00, j= 02 p= 02, i= 04, j= 06 p= 02, i= 01, j= 03 p= 02, i= 05, j= 07 p= 04, i= 00, j= 04 p= 04, i= 01, j= 05 p= 04, i= 02, j= 06 p= 04, i= 03, j= 07 perfect! and notice how p = mmax/2.... The next step to rebuilding this code, is to generate the weighing values This is a point where you notice a strange difference between the origional FORTRAN and the C while (n>mmax) { istep = mmax<<1; theta = -(2*M_PI/mmax); wtemp = sin(0.5*theta); wpr = -2.0*wtemp*wtemp; wpi = sin(theta); wr = 1.0; wi = 0.0; for (m=1; m < mmax; m += 2) { . . . } wtemp=wr; wr += wr*wpr - wi*wpi; wi += wi*wpr + wtemp*wpi; vs THETA=3.1415926535*FLOAT(ISIGN*(M-1))/FLOAT(MMAX) WR=COS(THETA) WI=SIN(THETA) Are they the same??? so I put both to the test, and yes, the values in the inner loop for wr and wi came out the same so WTF? then I looked at it more, SOMEONE between the 6502 and the 80386 decided to optimize this code for integer math, apparently the 6 multiplies, 4 additions, and copy were cheaper than a cos() call (!?!?!) ok, but we have floating point processors now! You might comment that I was going to run this on an tiny microcontroller that dosn't have a floating point processor, but, when you look at the values that make WR and WI in the fortran, you realize that, depending on the array size, this code only uses a handfull of values for theta, you could easily use a small lookup table... a really small one. so, I chose to dial this back to the origional FORTRAN method with sin() and cos() now, there is another cleanup I want to make, I dont care about IFFT, just forward. so my ISIGN is always -1. THETA = 3.1415926535*FLOAT(ISIGN*(M-1))/FLOAT(MMAX) to theta = M_PI *(-1)*(m-1)/mmax; to theta = M_PI*(1-m)/mmax; nice. now, one of the things my loop code is missing is the generation of mmax and m. if you look at the values generated, mmax starts at 2, and goes up by two, so easy. n starts at zero and goes up by two. ----------------------------------------------------- for(d = 1; d < nn; d <<= 1 ) { n = -2; // fraction initialization, this will immediatly have 2 added to it by the if for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) { // this will always itterate nn/2 times if ((i >= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n+=2; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (M_PI*(1.0-(float)n)) / (float)(d<<1); } j = i + d; // j is just an offset of i } } ----------------------------------------------------- I added a check to the inner loop for j, becasue the 'if' is going to do all our theta calcs, we need to make sure it trips on the first itteration. but waaaaaaait a min... (no, ignore me changing variable names, I'm trying to work out a good name for what they are doing) in the theta calc mmax (otherwise d) is always even (2, 4, 8...) and n, isn't part of a loop, but its also always even.... when you look at them they are always generating reducable fractions... SO, how about we dont multiply d (otherwise mmax) by two EVERY loop, and we alter n to suit? ----------------------------------------------------- for(d = 1; d < nn; d <<= 1 ) { n = -1; // fraction initialization, this will immediatly have 1 added to it by the if for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) { // this will always itterate nn/2 times if ((i >= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n+=1; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (M_PI*(1.0-(float)n)) / (float)(d); } j = i + d; // j is just an offset of i } } ----------------------------------------------------- that helps, we just dropped a multiply(shift) out of the inner loop buuuut waiiiit, a sec.... 1.0-(float)n n is just along for the ride, it can be whatever we want... if we start from 0, and work down.... ----------------------------------------------------- for(d = 1; d < nn; d <<= 1 ) { n = 1; // fraction initialization, this will immediatly have 1 added to it by the if for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) { // this will always itterate nn/2 times if ((i >= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n--; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (M_PI*(float)n) / (float)(d); } j = i + d; // j is just an offset of i } } ----------------------------------------------------- and hey, look, we also just shaved a subtract out of the inner loop! At this point I checked the values being generated for theta to make sure they happened with the right values and the right I / J indexes and yup. all good so far: ----------------------------------------------------- for(d = 1; d < nn; d <<= 1 ) { n = 1; // fraction initialization, this will immediatly have 1 added to it by the if for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) { // this will always itterate nn/2 times if ((i >= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n--; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (M_PI*(float)n) / (float)d; // work out n/d of pi for our angle wr = cos(theta); // weighing values, cos dosn't care that were rotating backwards wi = sin(theta); // were rotating backwards cause of this one. printf(" theta=Pi*%02d/%02d, theta= %f wr= %f, wi= %f \n",n, d, theta, wr, wi); } j = i + d; // j is just an offset of i printf(" i= %02d, j= %02d\n", i, j); } } ----------------------------------------------------- at this point, the only thing left to do is to put back the actual math, there really isn't any thing I can do to improve it. what IS apparent, is that the single array - alternated values, is not an optimization of any kind, its there cause its carried forward from FORTRAN ----------------------------------------------------- for(d = 1; d < nn; d <<= 1 ) { n = 1; // fraction initialization, this will immediatly have 1 added to it by the if for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) { // this will always itterate nn/2 times if ((i >= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n--; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (M_PI*(float)n) / (float)d; // work out n/d of pi for our angle wr = cos(theta); // weighing values, cos dosn't care that were rotating backwards wi = sin(theta); // were rotating backwards cause of this one. } j = i + d; // j is just an offset of i tempr = wr * real[j] - wi * imag[j]; // defacto standard complex multiply tempi = wr * imag[j] + wi * real[j]; real[j] = real[i] - tempr; // update real, crossed path calculations imag[j] = imag[i] - tempi; // update imaginary real[i] += tempr; // update real, forward path calculations imag[i] += tempi; // update imaginary } } ----------------------------------------------------- so then, load this into a test-suite and check against some known-good values. ----------------------------------------------------- #include #include #include "binaryString.h" void swap( float * this, float * that) { float t; t = *this; *this = *that; *that = t; } int main(void) { float real[255], imag[255]; int nn; unsigned int d, i, j; int n; float theta, wr, wi; float tempr, tempi; nn = 8; for (i = 0; i < nn; i++){ real[i] = 0; imag[i] = 0; } real[0]=1; real[1]=1; real[2]=1; real[3]=1; for ( i = 0; i < nn; printf("%6.2f \n", real[i]), i++); // imaginary is all zero, just swap around reals!. swap(&real[1], &real[4]); //8 swap(&real[3], &real[6]); for(d = 1; d < nn; d <<= 1 ) { n = 1; // fraction initialization, this will immediatly have 1 added to it by the if for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) { // this will always itterate nn/2 times if ((i >= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n--; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (M_PI*(float)n) / (float)d; // work out n/d of pi for our angle wr = cos(theta); // weighing values, cos dosn't care that were rotating backwards wi = sin(theta); // were rotating backwards cause of this one. // printf(" theta=Pi*%02d/%02d, theta= %f wr= %f, wi= %f \n",n, d, theta, wr, wi); } j = i + d; // j is just an offset of i // printf(" i= %02d, j= %02d\n", i, j); tempr = wr * real[j] - wi * imag[j]; // defacto standard complex multiply tempi = wr * imag[j] + wi * real[j]; real[j] = real[i] - tempr; // update real, crossed path calculations imag[j] = imag[i] - tempi; // update imaginary real[i] += tempr; // update real, forward path calculations imag[i] += tempi; // update imaginary } } printf("\n\n"); for ( i = 0; i < nn; printf("%6.2f %6.2f\n", real[i], imag[i]), i++); return 0; } ----------------------------------------------------- results are: 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 4.00 0.00 1.00 -2.41 0.00 0.00 1.00 -0.41 0.00 0.00 1.00 0.41 0.00 0.00 1.00 2.41 according to https://rosettacode.org/wiki/Fast_Fourier_transform 0 4.000000 0.000000 1 1.000000 -2.414214 2 0.000000 0.000000 3 1.000000 -0.414214 4 0.000000 0.000000 5 1.000000 0.414214 6 0.000000 0.000000 7 1.000000 2.414214 yay! it still works! ----------------------------------------------------- summaries so far: - Don't rearrange the zeros in the input array, 0 = 0, we can use either one. - Don't use 3 loop layers when 2 will do - Don't try to optimize for integer processors, if you need to, do it later. - Don't multiplex the array, were not limited to a single dataset anymore. - Optimize your equations, remove redundant calculations. - Modern day punchcards can handle more than 80 characters, write long lines if you want. so the next challange for me is to shoe-horn it into an 8 bit integer system but first, a revisit to reverse binary reindexing the origional code for that reindexing is still bothering me: ----------------------------------------------------- nn = 32; // reverse-binary reindexing n = nn<<1; j=1; for (i=1; ii) { printf("Swap %02d %02d\n", (i-1)/2, (j-1)/2); } m = nn; while ((m>=2) && (j>m)) { j -= m; m >>= 1; } j += m; }; ----------------------------------------------------- It turns out, if you analize it closely, m is just part of the lower block. The entire lower block actaully is a reverse-binary incrementer, and its independent of i, Variable m is serving as a bit mask, the code is going from left to right, looking for a bit that causes j to get smaller than the mask (by clearing them), and then setting the next bit down. aka, a rightwise bit ripple. Why is it done liek this? well, remember, there were no binary operators in FORTRAN... so, lets look at that reverse incrementer in sane terms (now that were not using FORTRAN and have bitwise operators.) ----------------------------------------------------- // j comes in whatever value it wants... for( m = nn>>1; m; m >>= 1) { // walk thru bits rightwise if (j & m) // if bit is set j ^= m; // clear the bit and move along else { // if the bit is clear j ^= m; // set the bit break; // stop doing things! } } ----------------------------------------------------- ok, the operation j ^= m is going to happen regardless... ----------------------------------------------------- for( m = nn>>1; m; m >>= 1) { // walk thru bits rightwise j ^= m; // toggle the current bit; if (j & m) break; // if the result from the toggle was a 1, then stop } ----------------------------------------------------- while were at it, toggle that bit as you check subtle change of that first shift ----------------------------------------------------- for( m = nn; m; m >>= 1, j ^= m) { // walk thru bits rightwise if (j & m) break; // if the result from the toggle was a 1, then stop } ----------------------------------------------------- the if is just a stop condition, we might as well put that in the for loop, ----------------------------------------------------- for( m = nn; !(j & m) && m; m >>= 1, j ^= m ) ; // reverse binary increment j ----------------------------------------------------- ok! see the cleanup we just did!?, granted its not going to take less time, but man is it cleaner. At the same time, now it can be zero indexed and go up by one. ----------------------------------------------------- nn = 32; // 32 sample tables for (i = 0, j = 0; i < nn; i ++) { if (j > i) { // dont swap with indexes behind us, or equal to us. printf(" swap( &real[%d], &real[%d]); \n", i, j); } for( m = nn; !(j & m) && m; m >>= 1, j ^= m ) ; // reverse binary increment j }; ----------------------------------------------------- Note how I had this generate a fixed swap list, its a tradeoff between memory and loop time. Like I say, I'm not using code that swaps the values in the buffer, I'm building the buffer in the right order in the first place. with that out of the way... TRIG VALUES ----------- The idea is to use a trig lookup table, but its become apparent from the code that only particular values of theta are used, values like 0/2 0/4 1/2 2/4 4/8... and yes, a bunch of those reduce down to the same values... so the focus becomes, how to exploit this limited value set without incurreing too much overhead. playing with things, ((float)(-n*(nn/2))/(float)d) always results in positive numbers, zero indexed, that are unique to the value of theta being processed. ok -n*(nn/2) /d we can make n positive, np. : n*(nn/2)/d ok, rearange: n*nn/2*d binaryness: n*nn/(d<<1) nn is always a power of 2, and its fixed (in this case 5): (n<<5)/(d<<1) printf("%d \n", (n<<5)>>bitPos[d]); printf("%d \n", n<<(5-bitPos[d]); or otherwise printf("%d \n", n <<(bitPos[nn]-bitPos[d]-1) ); now thats an interesting relation. lets take a closer look at that. (for nn = 32) shift 4 left shift 3 left shift 3 left shift 2 left shift 2 left shift 2 left shift 2 left shift 1 left shift 1 left shift 1 left shift 1 left shift 1 left shift 1 left shift 1 left shift 1 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left shift 0 left lets look at what we need and what we dont need here. ---------------------------------------- for(d = 1; d < nn; d <<= 1 ) { n = 1; // fraction initialization, this will immediatly have 1 added to it by the if for(i = (nn-1), j = 0; j != (nn-1); i += (d << 1) ) { // this will always itterate nn/2 times if ((i >= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n--; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (M_PI*(float)n) / (float)d; // work out n/d of pi for our angle wr = cos(theta); // weighing values, cos dosn't care that were rotating backwards wi = sin(theta); // were rotating backwards cause of this one. } j = i + d; // j is just an offset of i . . . } } ---------------------------------------- d is used to generate i, and j i is one of our master index pointers j is the other n builds our theta angle in i by counting the rollovers none that do what we want already BUT d is just a power of two counter, its value is used twice, once in the increment of the for(i) loop, and once to establish the offset of j both of these are 'intensly' used, we could make both usages 1<= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n++; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff // theta = (M_PI*(float)-n) / (float)d; // work out n/d of pi for our angle // wr = cos(theta); // weighing values, cos dosn't care that were rotating backwards // wi = sin(theta); // were rotating backwards cause of this one. printf("theta table index is: %d", n<= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n++; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff theta = (-M_PI*(float)n) / (float)d; // work out n/d of pi for our angle sinTable[n< #include #include #include "MAX7219.h" #include "avrcommon.h" #define OUTPUT 1 #define INPUT 0 uint8_t RBI_FSM32[] = { 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,8,9,10,11,12,13,14,15,4,5,6,7,2,3,1,0 }; int8_t sinTable32[] = { -0.000000,-6.242890,-12.245871,-17.778248,-22.627417,-26.607027,-29.564144,-31.385128,-32.000000,-31.385128,-29.564146,-26.607029,-22.627417,-17.778246,-12.245872,-6.242890}; int8_t cosTable32[] = { 32.000000,31.385128,29.564144,26.607027,22.627417,17.778248,12.245870,6.242891,-0.000001,-6.242890,-12.245869,-17.778246,-22.627417,-26.607029,-29.564144,-31.385130}; volatile int8_t real[32]; volatile int8_t imag[32]; volatile char ADCFlag; volatile uint8_t IRQIDX ; void AnalogInit (void); void timerInit(void) ; #define BITSIZE 5 #define nn (1<= nn) || (j == 0)) { // do calcs and wrap the i index back into the array with an offset i -= (nn-1); n++; // dont use i %= (nn-1); its MUCH slower, otherwise, loop stuff wr = cosTable32[n<>5; // I'mavoiding a square root, thanks. d = (1<