Research and Publications
Computing In-Place FFTs with SIMD Lane Slicing
We present an approach for implementing in-place FFTs on cores fitted with SIMD units and non-temporal load-store units. Loading the input samples with SIMD instructions decimates them in time across the SIMD lanes. A classic FFT implementation is extended to operate on SIMD data rather than scalar data and computes the sub-transforms concurrently. This enables efficient exploitation of the SIMD arithmetic and memory access instructions while involving little SIMD lane shuffling. A last FFT stage then recombines in-place the sub-transforms results to produce the output. We illustrate this approach on a Cooley- Tukey radix-4 decimated-in-frequency FFT implementation, which also integrates the two inner loop collapsing optimization of the TI C6x DSP _fft32×32 code that enables software pipelining and the Burrus technique for using bit-reversal in high-radix FFT implementations. Performance evaluations are performed on the Kalray KV3 core, which implements a 64-bit vector-scalar VLIW architecture with level-l cache bypass load instructions.