Abstract

The field of pipelined FFT hardware architectures has been studied during the last 50 years. This paper is a survey that includes the main advances in the field related to architectures for complex input data and power-of-two FFT sizes. Furthermore, the paper is intended to be educational, so that the reader can learn how the architectures work. Finally, the paper divides the architectures into serial and parallel. This classification puts together those architectures that are conceived for a similar purpose and, therefore, are comparable.

1. Bibliographic Information

1.1. Title

A Survey on Pipelined FFT Hardware Architectures

1.2. Authors

Mario Garrido

1.3. Journal/Conference

Published online in a Springer Nature journal. Given the context of the paper, it is likely a peer-reviewed academic journal specializing in signal processing, circuits, or computer architecture. Springer Nature is a highly reputable publisher in academic research.

1.4. Publication Year

Published online: 6 July 2021 (Received: 26 June 2020 / Revised: 3 February 2021 / Accepted: 3 March 2021). The effective publication year is 2021.

1.5. Abstract

The paper presents a comprehensive survey of pipelined Fast Fourier Transform (FFT) hardware architectures, covering developments over the past 50 years. It focuses on architectures designed for complex input data and power-of-two FFT sizes. The survey aims to be educational, providing insights into the operational principles of these architectures. For clarity and comparability, the architectures are categorized into two main groups: serial and parallel.

1.6. Original Source Link

/files/papers/694529123856bdce7b4e2fb4/paper.pdf (This is a local link within the system and indicates it's an officially published paper, likely in a journal given the "Published online" and "Springer Nature" mentions).

2. Executive Summary

2.1. Background & Motivation

The field of pipelined Fast Fourier Transform (FFT) hardware architectures has been actively researched for over 50 years, with the first architectures appearing in 1970. The core problem the paper aims to address is the need to systematically collect, present, and explain the main advancements in this domain. This problem is important because FFT is a fundamental operation in many signal processing applications, and efficient hardware implementations are crucial for real-time performance, low power consumption, and area optimization.

Prior research has generated a diverse array of architectural designs, making it challenging for newcomers to grasp the underlying principles and compare different approaches. The specific challenges include understanding the mathematical fundamentals that govern these architectures and recognizing how various design choices (e.g., radix, parallelization, data flow management) lead to different trade-offs in terms of hardware resources (adders, rotators, memory) and performance (latency, throughput).

The paper's entry point is to provide an educational survey that not only catalogs these architectures but also explains how they work to serve as an introduction to the field. It focuses on specific, common constraints: architectures for power-of-two FFT sizes and complex input data, while explicitly scoping out other types (e.g., iterative, fully parallel, real-valued, variable-length, non-power-of-two sizes).

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Comprehensive Survey: It collects and summarizes the main advances in pipelined FFT hardware architectures over 50 years, providing a valuable historical perspective and a detailed overview of the state-of-the-art.
Educational Approach: The paper is structured to explain the working principles of each architecture, serving as an introductory guide for readers new to the field.
Focused Scope: It specifically targets pipelined FFT architectures for power-of-two sizes and complex input data, which are prevalent in many applications.
Novel Classification System: It classifies architectures into serial (Single-path Delay Feedback (SDF), Single-path Delay Commutator (SDC), Single-stream Feedforward (SFF), and Serial Commutator (SC)) and parallel (Multi-path Delay Feedback (MDF), Multi-path Delay Commutator (MDC), and Multi-path Serial Commutator (MSC)). This classification is based on similar purpose and comparability, aiding in a structured understanding.
Chronology of Advancements: A timeline is provided, highlighting the introduction and significant modifications of various architecture types, tracing their evolution.
Detailed Architectural Descriptions: Each architectural type is described with explanations of its components, data flow, timing, and hardware complexity (adders, rotators, memory).
Comparative Analysis: The paper concludes with a detailed comparison table summarizing the hardware resources and performance metrics for both serial and parallel architectures, allowing for easy assessment of their respective trade-offs.

The key conclusions and findings include:
There are distinct trade-offs between hardware resources (complex adders, rotators, memory) and performance (latency, throughput) among different pipelined FFT architectures.
SDF architectures generally minimize memory and latency but can have higher adder counts (except for deep feedback variations).
SDC architectures, especially later versions, minimize adders and rotators but have higher memory and latency than other serial types.
SFF architectures achieve low adder counts and latency but require more memory.
SC architectures represent a balance, achieving low adders and rotators with near-optimal memory and latency.
Parallel architectures (MDF, MDC, MSC) offer higher throughput by processing multiple data per cycle, with M2DF, MDC, and MSC often achieving the lowest adder and rotator counts, and good memory usage.
The choice of radix (e.g., radix-2, radix-4, radix- $2^k$ ) significantly impacts rotator complexity and butterfly utilization.
Optimization of rotators remains a key area of research, often achieved by exploring different data orders to minimize their number and complexity.

These findings solve the problem of providing a clear, structured, and comparative overview of a complex and evolving field, enabling researchers and engineers to make informed decisions when designing FFT hardware.

3.1. Foundational Concepts

To understand the architectures presented in this paper, a reader needs to be familiar with several fundamental concepts in digital signal processing and hardware design.

3.1.1. Discrete Fourier Transform (DFT)

The Discrete Fourier Transform (DFT) is a mathematical transform that converts a finite sequence of equally-spaced samples of a function into a sequence of coefficients of a finite combination of complex sinusoids, ordered by their frequencies. It decomposes a sequence of values into its constituent frequencies. The $N$ -point DFT of an input sequence x[n] is defined as: $ X[k] = \sum_{n=0}^{N-1} x[n] W_N^{nk}, \quad k = 0, 1, \dots, N-1 $ where:

X[k] is the output at frequency $k$ .
x[n] is the input sequence.
$N$ is the number of points in the DFT.
$W_N^{nk}$ are the twiddle factors.

3.1.2. Twiddle Factors ( $W_N^{nk}$ )

Twiddle factors are complex exponential terms central to the DFT calculation. They are defined as: $ W_N^{nk} = e^{-j \frac{2\pi}{N} nk} $ These factors essentially represent rotations in the complex plane.

3.1.3. Fast Fourier Transform (FFT)

The Fast Fourier Transform (FFT) is an efficient algorithm to compute the Discrete Fourier Transform (DFT) and its inverse. The most widely used FFT algorithm was proposed by Cooley and Tukey. It dramatically reduces the number of operations required from $\mathcal{O}(N^2)$ for the direct DFT calculation to $\mathcal{O}(N \log_2 N)$ for the FFT. This efficiency is achieved by exploiting the symmetries and periodicity of the twiddle factors to decompose the DFT into smaller DFTs.

3.1.4. Flow Graphs and Radix

FFT algorithms are often represented visually using flow graphs. A flow graph illustrates the computational steps, data dependencies, and data flow from input to output.

Stages: An $N$ -point FFT typically consists of $n = \log_2 N$ stages for a radix-2 algorithm. Each stage involves butterflies and rotations.
Radix: The term radix refers to the base of the FFT decomposition.
- Radix-2: The most common radix, where the basic computational unit (butterfly) combines two input samples to produce two output samples.
- Radix-4: Combines four input samples to produce four output samples, often leading to fewer stages ( $\log_4 N$ ) but larger butterflies.
- Radix-2^k $: A generalization where $k$ refers to how rotations are distributed among FFT stages, allowing for hybrid approaches that balance `butterfly` and `rotator` complexity. For example, `radix-`2^2$ (also known as split-radix) often minimizes rotators.

3.1.5. Butterfly

A butterfly is the fundamental computational unit in FFT algorithms.

Radix-2 Butterfly: Takes two complex input samples, $A$ and $B$ , multiplies $B$ by a twiddle factor ( $W_N^\phi$ ), and then computes $A + (B \cdot W_N^\phi)$ and $A - (B \cdot W_N^\phi)$ . It involves one complex multiplication and two complex additions/subtractions. The following figure (Figure 3 from the original paper) shows the Radix-2 butterfly:

该图像是示意图，展示了Radix-2蝶形运算单元的结构，其中输入信号 x[0] 和 x[1] 通过连接，输出信号为 X[0] 和 X[1]。
Radix-4 Butterfly: Takes four complex input samples and performs a more complex set of operations involving multiple twiddle factor multiplications and additions/subtractions. The following figure (Figure 4 from the original paper) shows the Radix-4 butterfly:

该图像是一个示意图，展示了 Radix-4 蝴蝶结构的内部连接关系，输入数据为 x[0]、x[1]、x[2] 和 x[3]，经过相应计算后输出为 X[0]、X[1]、X[2] 和 X[3]。

3.1.6. Decimation in Frequency (DIF) and Decimation in Time (DIT)

These are two common ways to decompose the FFT algorithm.

Decimation in Frequency (DIF): The output sequence is divided (decimated) into smaller subsequences. The following figure (Figure 1 from the original paper) shows the flow graph of a 16-point radix-2 DIF FFT:

该图像是一个16点的基2 DIF FFT的流程图，展示了从输入数据 x[n] 到输出数据 X[k] 的处理过程。图中通过多个阶段的连接说明了不同索引的对应关系，展示了DIF算法的结构和流向。
Decimation in Time (DIT): The input sequence is divided (decimated) into smaller subsequences. The following figure (Figure 2 from the original paper) shows the flow graph of a 16-point radix-2 DIT FFT:

该图像是一个16点基2分治FFT的流程图，展示了在四个阶段中输入信号的不同处理。每一阶段体现了数据的交换与变换，最终得到输出序列。该图形有助于理解FFT的工作原理。

DIF and DIT algorithms for power-of-two FFT sizes primarily differ in the placement of twiddle factors (rotations) within the flow graph, while the butterfly structure remains largely similar.

3.1.7. Index Bit $b_{n-s}$

For an $N$ -point FFT where $N=2^n$ , $n$ is the number of stages. Data are often indexed using their binary representation $I = b_{n-1} \ldots b_1 b_0$ . The bit $b_{n-s}$ is crucial because, at stage $s$ of the FFT, pairs of data processed by a butterfly differ only in this specific bit. This insight is fundamental for designing FFT architectures that correctly align data for butterfly operations.

3.1.8. Rotations and Trivial Rotations

Rotations refer to the complex multiplications by twiddle factors. The term $\phi$ in $W_N^\phi$ determines the angle of rotation.

If $\phi = 0$ , $W_N^0 = e^{-j0} = 1$ . This is a trivial rotation ( $0^\circ$ ), meaning no actual multiplication is needed, just passing the data.
If $\phi \in \{0, N/4, N/2, 3N/4\}$ , the twiddle factors are $1, -j, -1, j$ respectively, corresponding to rotations of $0^\circ, 270^\circ, 180^\circ, 90^\circ$ . These are called trivial rotations because they can be implemented simply by interchanging real and imaginary components and/or changing signs, without requiring a full complex multiplier. Architectures often try to maximize the use of trivial rotations to reduce hardware complexity.

3.1.9. Pipelined Architecture

A pipelined architecture breaks down a computational task into a series of smaller, sequential subtasks (stages), allowing different parts of multiple computations to be processed concurrently. In pipelined FFTs, data flows continuously from one stage to the next, with each stage performing a part of the FFT computation. This allows for high throughput (processing data quickly) by overlapping operations. The general structure of a pipelined FFT is shown below:

Figure 7 16-point radix-2 SDF FFT architecture. 该图像是16点基-2 SDF FFT架构的示意图，展示了四个阶段的结构，分别标记为阶段1至阶段4。每个阶段通过R2模块连接，其中包括8、4、2和1的寄存器级联，形成了一种有效的信号处理方式。

Each stage $s$ calculates the computations of one stage of the FFT algorithm. Each stage has $P$ inputs and $P$ outputs, and data flows in a continuous stream at a rate of $P$ data per clock cycle.

3.2. Previous Works

The paper itself is a survey, building upon and categorizing a vast body of previous work. Section 3.2, "Chronology," explicitly details the evolution of pipelined FFT architectures.

1970: The birth of pipelined FFTs with the first SDC FFT by O'Leary [68] and the first SDF FFT by Groginsky and Works [44], both radix-2.
1973: Introduction of the first MDC FFTs by Gold and Bially [42], initially limited to parallelization ( $P$ ) equal to the radix ( $r$ ).
1974-1989: Expansion to higher radices for SDF (radix-4 by Despain [23], radix-16 which was effectively radix-2^4 $by Despain [24]) and `SDC` (radix-4 by Bi and Jones [7]). * **1983:** A significant advancement in `MDC FFTs` by Johnston [50], allowing any $P$ for `radix-2`, decoupling parallelization from the radix. * **1984:** The delayed appearance of `MDF FFTs` by Wold and Despain [82]. * **1998:** Introduction of `radix-`2^2$ for SDF FFTs by He and Torkelson [45], offering a better balance between butterfly and rotator utilization than radix-2 or radix-4.
2006: The deep feedback SDF FFT by Yang et al. [85] achieved $100\%$ utilization of butterflies and rotators, though at the cost of increased memory.
2008-2015: Resurgence of SDC FFTs. Chang [9] solved the $50\%$ utilization issue by splitting data into high and low parts. Liu et al. [65] and Wang et al. [81] further refined SDC by splitting real/imaginary components, with Wang et al. introducing a rotator optimization.
2009-2013: Advancements in MDC FFTs with radix-2^2 $by Garrido [26] and its extension to `radix-`2^k$ by Garrido et al. [33].
2016: Introduction of the M2DF FFT by Wang et al. [80], improving MDF butterfly utilization by creating two opposite data flows. The SC FFT was also introduced by Garrido et al. [35], using serial-serial permutation for $100\%$ utilization and small memory.
2018: The SFF FFT by Ingemarsson and Gustafsson [47], a serial architecture using a double memory for small numbers of butterflies, rotators, and multiplexers.
2020: The MSC FFT by Hsu et al. [46], the parallel version of the SC FFT.

3.3. Technological Evolution

The evolution of FFT hardware architectures has been driven by the continuous demand for higher performance (speed, throughput) and efficiency (reduced area, power, memory). Initially, the focus was on simply realizing the FFT algorithm in a pipelined fashion to achieve continuous data processing. The first SDF and SDC architectures in 1970 laid this groundwork. Early MDC architectures introduced parallelism but were limited by coupling parallelization with radix size.

Subsequent developments introduced higher radices (e.g., radix-4, radix-2^k $) to optimize the trade-off between `butterfly` and `rotator` complexity. The decoupling of `parallelization` from `radix` in `MDC` architectures was a key step, allowing more flexible designs. A major theme in the evolution has been improving hardware `utilization`. Early `SDC` and `SDF` architectures often had low `butterfly utilization` (e.g., 50% or 25%). Innovations like `deep feedback SDF` and modifications to `SDC` (splitting high/low or real/imaginary parts) aimed to achieve $100\%$ `utilization` of `butterflies` and `rotators`, leading to more efficient designs. More recent advancements have focused on novel data flow management and `shuffling circuits`, such as those in `SC` and `SFF` architectures, to further optimize `memory usage`, `rotator complexity`, and overall `resource efficiency`. The `M2DF` architecture represents an evolution of `MDF` by introducing sophisticated data flow management to reuse hardware components. The technological timeline shows a clear progression from basic `pipelined` implementations to increasingly optimized designs that leverage `radix` choices, `parallelism`, and clever data manipulation to minimize `hardware resources` while maximizing `throughput`. ## 3.4. Differentiation Analysis Compared to general approaches to `FFT implementation` (e.g., iterative, fully parallel, software-based), this paper differentiates by focusing exclusively on `pipelined hardware architectures`. Within this specific domain, its core innovations lie in its systematic classification and detailed comparative analysis. * **Classification:** The paper's division of `pipelined FFT architectures` into `serial` (SDF, SDC, SFF, SC) and `parallel` (MDF, MDC, MSC) is a key differentiation. This structured approach allows for a direct comparison of architectures designed for similar throughput requirements, making the analysis more meaningful than a generic comparison across all `FFT hardware`. * **Educational Depth:** Unlike many surveys that primarily list architectures, this paper aims to explain *how they work*, including detailed descriptions of data flow, timing, and internal structures (e.g., `buffers`, `shuffling circuits`). This pedagogical focus sets it apart. * **Focus on $b_{n-s}$:** The paper highlights the critical role of the `index bit`b_{n-s}$ in every architecture's design. This provides a unified conceptual framework for understanding how different architectures manage data to feed butterflies correctly.

Detailed Resource Comparison: The comprehensive tables comparing complex adders, complex rotators, complex data memory, latency, and throughput across various radix and parallelization configurations for each architectural type offer a granular level of detail often missing in broader overviews. This allows for precise trade-off analysis.
Emphasis on Rotator Optimization: The paper explicitly discusses rotator complexity as a significant area of current research, detailing how different radices and data orders impact rotator count and implementation (e.g., trivial rotators, shift-and-add).

In essence, while the individual FFT architectures themselves are prior work, this paper's innovation lies in its rigorous, educational, and structured approach to surveying and comparing them, providing a clear roadmap for understanding their design principles and performance characteristics.

4. Methodology

The paper's methodology is to systematically describe and compare the main pipelined FFT hardware architectures. It categorizes them into serial and parallel and then delves into the specifics of each type, explaining their operational principles, internal structures, data flow, timing, and hardware resource complexities. The explanations are reinforced with diagrams and timing charts from the original publications.

4.1. General Pipelined FFT Architecture

A pipelined FFT generally consists of $n$ stages connected in series. Data flows continuously from stage 1 to stage $n$ . Each stage $s$ calculates all the computations corresponding to one stage of the FFT algorithm. Each stage has $P$ inputs and $P$ outputs, processing $P$ data per clock cycle in a continuous flow.

4.2. Serial Pipelined FFT Architectures

Serial pipelined FFT architectures process a continuous flow of one datum per clock cycle ( $P=1$ ).

4.2.1. The SDF FFT Architecture

The Single-path Delay Feedback (SDF) FFT architecture is characterized by its use of feedback loops containing memory buffers.

4.2.1.1. Radix-2 SDF FFT

A 16-point radix-2 SDF FFT architecture consists of $n$ stages. Each stage includes a radix-2 butterfly (R2), a rotator (represented by $\otimes$ ) or a trivial rotator (diamond-shaped), and a buffer. The length of the buffer at stage $s$ is $L = 2^{n-s}$ .

The following figure (Figure 7 from the original paper) shows a 16-point radix-2 SDF FFT architecture:

The internal structure of a stage in a radix-2 SDF FFT is shown below. Data arrive in natural order (top to bottom of the flow graph). For each stage $s$ , pairs of data that differ in bit $b_{n-s}$ arrive with a time difference of $2^{n-s}$ clock cycles. The buffer of length $L = 2^{n-s}$ delays one datum so that the two data required for the butterfly operation arrive simultaneously at its input. After computation, one output continues to the next stage, and the other is stored back into the buffer. The following figure (Figure 8 from the original paper) shows a stage of a radix-2 SDF FFT architecture:

Figure 8 Stage of a radix-2 SDF FFT architecture. (a) Conventional drawing. (b) Alternative drawing. 该图像是图示，展示了基于radix-2 SDF FFT架构的两个不同绘图方式。左侧(a)为传统绘图，右侧(b)为替代绘图，均标示了输入和输出的流向。

The butterfly processes data only when part A (from buffer output) and part B (from stage input) are available, which occurs $50\%$ of the time, leading to $50\%$ butterfly utilization. The other $50\%$ of the time is used for data flow through the buffer. The following figure (Figure 9 from the original paper) shows the timing of a radix-2 SDF FFT architecture:

Figure 9 Timing of a radix-2 SDF FFT architecture. Hardware Complexity for Radix-2 SDF:

Complex Adders (in butterflies): $2 \log_2 N$ (one radix-2 butterfly per stage, each having 2 complex adders).
Non-trivial Rotators: $\log_2 N - 2$ (rotators in all stages except the last two, which are trivial).
Complex Data Memory: $N - 1$ (sum of buffer lengths $L=2^{n-s}$ for $s=1 \dots n$ ).
Throughput: 1 data/cycle.
Latency: $N - 1$ cycles.

4.2.1.2. Radix-4 SDF FFT

A radix-4 SDF FFT architecture uses radix-4 butterflies. The number of stages is $n = \log_4 N$ . Each stage uses one radix-4 butterfly and three buffers of length $L = 4^{n-s}$ . The butterfly processes four data that are equally separated in time. The following figure (Figure 10 from the original paper) shows a 16-point radix-4 SDF FFT architecture:

Figure 10 16-point radix-4 SDF FFT architecture. 该图像是16点基4 SDF FFT架构的示意图。图中展示了两个阶段的处理结构，其中包含多个并行处理单元和R4运算单元，旨在提高FFT的计算效率。

The internal structure of a stage is similar to radix-2, but adapted for four inputs. The following figure (Figure 11 from the original paper) shows a stage of a radix-4 SDF FFT architecture:

Figure 11 Stage of a radix-4 SDF FFT architecture. 该图像是一个显示径向-4 SDF FFT 架构的示意图。图中包含多个信号线和逻辑单元，示范了数据流动和操作过程以实现快速傅里叶变换。该结构显示了输入数据的处理方式以及如何通过不同的节点进行组合和输出。

The radix-4 butterfly has a utilization of $25\%$ . While this is lower than radix-2, the reduction in the number of stages ( $\log_4 N$ vs $\log_2 N$ ) reduces the total number of rotators. The following figure (Figure 12 from the original paper) shows the timing of a radix-4 SDF FFT architecture:

Figure 12 Timing of a radix-4 SDF FFT architecture. Hardware Complexity for Radix-4 SDF:

Complex Adders (in butterflies): $8 \log_4 N$ (one radix-4 butterfly per stage, each having 8 complex adders).
Non-trivial Rotators: $\log_4 N - 1$ .
Complex Data Memory: $N - 1$ .

4.2.1.3. Radix- $2^2$ SDF FFT

A radix-2^2SDF FFT architecture operates similarly to a radix-2 SDF FFT in terms of data management and butterfly operations. The key difference lies in the rotators. Radix-2^2 can be applied to SFF to make odd-stage rotators trivial, reducing rotator complexity. Hardware Complexity for Radix-2 SFF:

Complex Adders (in butterflies): $\log_2 N$ .
Non-trivial Rotators: $\log_2 N - 2$ (for radix-2) or $1/2 \cdot \log_2 N - 1$ (for radix- $2^2$ ).
Complex Data Memory: $2N - 2$ .

4.2.4. The SC FFT Architecture

The Serial Commutator (SC) FFT architecture uses specialized shuffling circuits for serial-to-serial permutations. The following figure (Figure 27 from the original paper) shows a 16-point radix-2 serial commutator FFT:

$Figure 13 16-point radix $\\cdot 2 ^ { 2 }$ SDF FFT architecture.$ 该图像是16点基-2 SC FFT架构的示意图，展示了各个处理单元（PE1, PE2, PE3, PE4）和不同阶段的连接方式。此架构通过多个级别的处理，提高了FFT计算的效率。

At each stage of the SC FFT, data that differ in bit $b_{n-s}$ arrive in consecutive clock cycles. A delay of one clock cycle is used for half of the data to align them for the butterfly. After computation, another one-clock-cycle delay restores the serial flow. The following figure (Figure 28 from the original paper) shows the timing of a stage in the architecture:

Figure 14 16-point deep feedback radix-2 SDF FFT architecture. 该图像是图表，展示了一种基于时间的基-2快速傅里叶变换(SC FFT)架构的时序。图中说明了不同步骤A和B的处理，及其在时间轴上的分布关系，其中L=1表示特定的处理等级。

The internal structure of an SC FFT stage separates the real and imaginary parts of the serial input. The butterfly first adds/subtracts real parts, then imaginary parts in the next clock cycle, meaning it only needs one real adder and one real subtractor. Similarly, the rotator processes data over two consecutive clock cycles, effectively halving its complexity (e.g., two multipliers instead of four). The following figure (Figure 29 from the original paper) shows the internal structure of a stage in an SC FFT:

Figure 15 Stage of a deep feedback radix-2 SDF FFT architecture. Hardware Complexity for Radix-2 SC:

Complex Adders (in butterflies): $\log_2 N$ .
Non-trivial Rotators: $1/2 \cdot \log_2 N - 1$ .
Complex Data Memory: $\approx N$ .

4.3. Parallel Pipelined FFT Architectures

Parallel pipelined FFT architectures process $P$ data per clock cycle, where $P > 1$ .

4.3.1. The MDF FFT Architecture

The Multi-path Delay Feedback (MDF) FFT architecture is the parallel version of the SDF FFT. It can be conceptualized as unfolding an SDF architecture to handle multiple parallel data streams.

4.3.1.1. P-parallel Radix-2 MDF FFT

In an MDF FFT, data order is managed across parallel paths. For a 2-parallel MDF, even-indexed data flow through an upper path, and odd-indexed data through a lower path. This parallelization reduces the length of buffers in early stages by a factor of $P$ compared to SDF. Later stages combine data from these parallel streams as required by the $b_{n-s}$ bit. The following figure (Figure 30 from the original paper) shows a 16-point 2-parallel radix-2 MDF FFT architecture:

Figure 16 Timing of a deep feedback radix-2 SDF FFT architecture. 该图像是16点2并行基-2 MDF FFT架构的示意图，展示了四个阶段的处理流程、功能单元和连接关系。此架构通过多个R2单元并行操作，优化了FFT运算的效率。

The timing diagram shows how data is ordered in parallel paths across stages. The following figure (Figure 31 from the original paper) shows the timing of a 16-point 2-parallel radix-2 MDF FFT architecture:

Figure 17 16-point 2-parallel radix-2 MDC FFT architecture. 该图像是一个表格，展示了阶段 1 到 4 的一些数据。在表格中，第一行标有阶段标题，第二和第三行包含与 FFT 相关的数值序列。

A 4-parallel radix-2 MDF FFT further extends this parallelization. The following figure (Figure 32 from the original paper) shows a 16-point 4-parallel radix-2 MDF FFT architecture:

该图像是一个示意图，展示了输入数据通过五个阶段处理后得到的输出结果。每个阶段都对输入数据进行排序和计算，最终形成输出序列。这是对管道式快速傅里叶变换（FFT）硬件架构的一个典型例子。该图像是图示，展示了16点4并行的基-2多级快速傅里叶变换（MDF FFT）架构。图中包括多个处理单元，通过不同的级联方式进行信号处理，适用于复杂的输入数据。

The timing diagram for a 4-parallel MDF illustrates how the first stages operate much like SDF but with buffer lengths scaled by $P$ , and later stages explicitly combine data from parallel streams. The following figure (Figure 33 from the original paper) shows the timing of a 16-point 4-parallel radix-2 MDF FFT architecture:

Figure 20 16-point radix-2 SDC FFT architecture that divides the data in high and low parts. 该图像是一个示意图，展示了一个16点4并行的基于MDF的FFT架构的时间安排。图中分为四个阶段，分别列出了每个阶段的操作序列和数据流，左侧为阶段1至3的操作序列，右侧为阶段4的操作序列。

Radix-2^2 $can be applied to `MDF FFTs` to make `rotators` in odd stages `trivial`, halving `rotator` complexity. **General Hardware Complexity for P-parallel Radix-2 MDF:** * `Complex Adders (in butterflies)`: $2P \log_2 N - P \log_2 P$. * `Non-trivial Rotators`: $P \log_2 N - P/2 \log_2 P - P - [1]^*$, where $[1]^*$ applies only for $P=2$. * `Complex Data Memory`: $N - P$. * `Throughput`: $P$ data/cycle. * `Latency`: `N/P` cycles. #### 4.3.1.2. The $M^2DF$ FFT Architecture The `M2DF FFT architecture` (Mixed-Decimation MDF) increases the `utilization` of `butterflies` and `rotators` to $100\%$ by creating two data flows in opposite directions. These flows do not overlap in time, allowing hardware components to be shared. One data flow needs to enter the pipeline in `bit-reversed order`. The following figure (Figure 34 from the original paper) shows a 16-point 2-parallel radix-2 $M^2DF$ FFT architecture: ![Figure 34 16-point 2-parallel radix-2 $\\mathbf { M } ^ { 2 } \\mathbf { D } \\mathbf { F }$ FFT architecture. The boxes labeled with BR are circuits to calculate the bit reversal \[31\].](/files/papers/694529123856bdce7b4e2fb4/images/31.jpg) *该图像是一个示意图，展示了16点2并行的基于二进制的 $M^2DF$ FFT架构。图中标记为BR的框代表用于计算位反转的电路。* The `timing diagram` for `M2DF` shows how the upper flow (e.g., stages 1, 2, 3, BR, 4) and lower flow (e.g., BR, 3, 2, 1, 4) operate without overlapping at common `butterfly` or `rotator` stages, thus achieving high `utilization`. The following figure (Figure 35 from the original paper) shows the timing of a 16-point 2-parallel radix-2 $M^2DF$ FFT architecture: ![Figure 35 Timing of a 16-point 2-parallel radix-2 $\\mathbf { M } ^ { 2 } \\mathbf { D } \\mathbf { F }$ FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/32.jpg) **Hardware Complexity for 2-parallel Radix-2 $M^2DF$:** * `Complex Adders (in butterflies)`: $2 \log_2 N$. * `Non-trivial Rotators`: $\log_2 N - 2$. * `Complex Data Memory`: $\approx 2N$ (includes `buffers` at stages and `bit reversal` circuits). ### 4.3.2. The MDC FFT Architecture The `Multi-path Delay Commutator (MDC) FFT architecture` is a parallel extension of the `SDC` concept. It uses `shuffling circuits` to reorder data among parallel paths. #### 4.3.2.1. P-parallel Radix-2 MDC FFT For a 4-parallel `radix-2 MDC FFT`, `shuffling circuits` are employed at each stage to ensure that data differing in bit $b_{n-s}$ arrive at the inputs of the `butterflies`. The following figure (Figure 36 from the original paper) shows a 16-point 4-parallel radix-2 MDC FFT architecture: ![Figure 36 16-point 4-parallel radix-2 MDC FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/33.jpg) *该图像是图示，展示了16点4并行的基2 MHC FFT架构，它分为四个阶段。每个阶段中使用了多个R2模块和乘法器，能够有效处理复杂输入数据以实现快速傅里叶变换。图中所示的设计结构有助于理解该架构的运作机制。* The `timing diagram` illustrates the data ordering across the four parallel paths at each stage. The following figure (Figure 37 from the original paper) shows the timing of a 16-point 4-parallel radix-2 MDC FFT architecture: ![Figure 37 Timing of a 16-point 4-parallel radix-2 MDC FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/34.jpg) *该图像是图表，展示了一个16点4并行的基-2 MDC FFT架构的各个阶段的时序。图中分为四个阶段，每个阶段对应不同的数据排列，旨在帮助读者理解FFT架构的工作原理。* For an 8-parallel `radix-2 MDC FFT`, `shuffling circuits` might not be needed at early stages if $b_{n-s}$ can be managed by reorganizing the parallel streams directly. The following figure (Figure 38 from the original paper) shows a 16-point 8-parallel radix-2 MDC FFT architecture: ![Figure 38 16-point 8-parallel radix-2 MDC FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/36.jpg) *该图像是一个16点8并行基2多域卷积FFT架构的示意图，展示了各个阶段（STAGE 1 到 STAGE 4）之间的信号处理流程。图中标记的R2表示基2运算单元，通过乘法和加法实现FFT计算。* The `timing diagram` for 8-parallel shows the parallel data flow. The following figure (Figure 39 from the original paper) shows the timing of a 16-point 8-parallel radix-2 MDC FFT architecture: ![Figure 39 Timing of a 16-point 8-parallel radix-2 MDC FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/35.jpg) **General Hardware Complexity for P-parallel Radix-2 MDC:** * `Complex Adders (in butterflies)`: $P \log_2 N$. * `Non-trivial Rotators`: $P/2 \cdot (\log_2 N - 2)$. * `Complex Data Memory`: $N - P$. * `Throughput`: $P$ data/cycle. * `Latency`: `N/P` cycles. #### 4.3.2.2. P-parallel Radix-$2^2$ MDC FFT Using `radix-`2^2$ in MDC FFTs means that odd stages require only trivial rotators, significantly reducing overall rotator complexity. The following figure (Figure 40 from the original paper) shows a 16-point 4-parallel radix- $2^2$ MDC FFT architecture:

Figure 20 16-point radix-2 SDC FFT architecture that divides the data in high and low parts. 该图像是图示，展示了16点4并行基于 $ext{radix} imes 2^2$ 的MDC FFT架构的各个阶段，包括四个阶段的处理结构。

The timing diagram is similar, but the rotator placement is optimized. The following figure (Figure 41 from the original paper) shows the timing of a 16-point 4-parallel radix- $2^2$ MDC FFT architecture:

Figure 23 16-point radix-2 SDC FFT architecture that divides the data in real and imaginary parts. 该图像是一幅示意图，展示了4个阶段的16点4并行FFT架构的时序信息。在每个阶段中，输入数据以特定的顺序排列，显示了如何进行数据处理和转换。

An 8-parallel version also benefits from radix-2^2 $`rotator` optimization. The following figure (Figure 42 from the original paper) shows a 16-point 8-parallel radix-$2^2$ MDC FFT architecture: ![Figure 42 16-point 8-parallel radix $\\cdot 2 ^ { 2 }$ MDC FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/39.jpg) *该图像是图表，展示了16点8并行基数 $2^2$ 的MDC FFT架构。图中细分为四个阶段，每个阶段包含不同的R2单元和乘法器，以显示数据流的处理过程。* The `timing diagram` for 8-parallel radix-$2^2$ MDC FFT. The following figure (Figure 43 from the original paper) shows the timing of a 16-point 8-parallel radix-$2^2$ MDC FFT architecture: ![Figure 43 Timing of a 16-point 8-parallel radix $\\cdot 2 ^ { 2 }$ MDC FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/40.jpg) **General Hardware Complexity for P-parallel Radix-$2^2$ MDC:** * `Complex Adders (in butterflies)`: $P \log_2 N$. * `Non-trivial Rotators`: $3P/8 \cdot (\log_2 N - 2)$. * `Complex Data Memory`: $N - P$. ### 4.3.3. The MSC FFT Architecture The `Multi-path Serial Commutator (MSC) FFT architecture` is the parallel version of the `SC FFT`. It combines `SC-like stages` and `MDC-like stages`. The following figure (Figure 44 from the original paper) shows a 16-point 4-parallel radix-2 MSC FFT: ![Figure 44 16-point 4-parallel radix-2 MSC FFT architecture.](/files/papers/694529123856bdce7b4e2fb4/images/41.jpg) *该图像是一个示意图，展示了16点4并行的基-2 MSC FFT架构。图中显示了多个处理单元（PE）和相关的计算步骤，包括加权和乘法操作，适用于复杂输入数据的快速傅里叶变换。* In `MSC`, some stages process data as in an `SC FFT` (samples processed together in `butterflies` arrive in consecutive clock cycles), while others process data `MDC`-style (data arriving simultaneously). In `SC-like stages`, only half the `butterflies` and `rotators` are used due to the serial processing of real/imaginary parts. The `timing diagram` shows the hybrid nature. The following figure (Figure 45 from the original paper) shows the timing of a 16-point 4-parallel radix-2 MSC FFT architecture: ![](images/45.jpg) **Hardware Complexity for Radix-2 MSC:** * `Complex Adders (in butterflies)`: $P \log_2 N$. * `Non-trivial Rotators`: $P/2 \cdot (\log_2 N - 2)$. * `Complex Data Memory`: $\approx N - P$. ## 4.4. Rotations in FFT Architectures Understanding `rotations` is crucial for `FFT architecture` design. ### 4.4.1. Obtaining Rotation Coefficients `Rotation coefficients` ($W_N^\phi$) are determined from the `FFT flow graph` and the `timing diagram` of the architecture. The `timing diagram` provides the data indices ($I$) at any given time and stage. The `flow graph` specifies the `rotations` for each stage and index. By mapping indices from the `timing diagram` to the `flow graph`, the exact $\phi$ values can be found. For example, considering the 4-parallel `radix-2 MDC FFT` (Figure 36) and its timing (Figure 37), and a `radix-2 DIF FFT flow graph` (Figure 1): At stage 2, data with indices $(12, 13, 14, 15)$ flow through the lowest path. From Figure 1, the rotations for these indices at stage 2 are $\phi = (0, 2, 4, 6)$. For $N=16$ and $\phi=4$, the `twiddle factor` is: $ W_{N}^\phi = e^{-j \frac{2\pi}{N} \phi} = e^{-j \frac{2\pi}{16} 4} = e^{-j \frac{\pi}{2}} = -j $ This means the `rotator` at the lowest path at stage 2 for data with index 14 must rotate by `-j`. For mathematical derivation, for a `radix-2 DIF FFT`, $\phi$ can be obtained as: $ \phi_s(I) \equiv b_{n-s} \cdot 2^{s-1} \cdot [b_{n-s-1} \ldots b_0] $ where $I = b_{n-1} \ldots b_1 b_0$ is the binary representation of the index. For stage $s=2$ and $N=16$ ($n=4$): $ \phi_2(I) \equiv b_2 \cdot 2^{1} \cdot [b_1 \ldots b_0] $ Using this for $I = (12, 13, 14, 15)$ (binary: `1100, 1101, 1110, 1111`): * For $I=12 (1100_2)$: $b_2=1$, $[b_1b_0]=00 \Rightarrow \phi_2(12) = 1 \cdot 2^1 \cdot 0 = 0$. * For $I=13 (1101_2)$: $b_2=1$, $[b_1b_0]=01 \Rightarrow \phi_2(13) = 1 \cdot 2^1 \cdot 1 = 2$. * For $I=14 (1110_2)$: $b_2=1$, $[b_1b_0]=10 \Rightarrow \phi_2(14) = 1 \cdot 2^1 \cdot 2 = 4$. * For $I=15 (1111_2)$: $b_2=1$, $[b_1b_0]=11 \Rightarrow \phi_2(15) = 1 \cdot 2^1 \cdot 3 = 6$. These values $\phi = (0, 2, 4, 6)$ match those from the `flow graph`. ### 4.4.2. Implementation of Rotators The implementation of a `rotator` depends on the angles it needs to process. * **Trivial Rotator:** If all angles are trivial ($0^\circ, 90^\circ, 180^\circ, 270^\circ$), it's implemented using simple sign changes and real/imaginary component swaps. * **General Rotator:** For arbitrary angles, it's typically implemented using a `complex multiplier` or the `CORDIC algorithm` [78]. * **Low-Complexity Rotators:** If the number of different angles is small, `shift-and-add operations` can be used to implement the `rotators`, reducing complexity [39]. ### 4.4.3. Storage of Rotations * **ROM:** For `general rotators`, `twiddle factors` are often stored in a `Read-Only Memory (ROM)`. * **Memoryless CORDIC:** Alternatively, `CORDIC` can be implemented without memory [30]. This is beneficial for large `FFTs` where `ROM` size would be prohibitive [51]. * **Control Signal Generation:** For `shift-and-add rotators`, control signals can be generated on-the-fly without `ROM` storage [39]. ### 4.4.4. Optimization Focus Current research in `FFT architectures` often prioritizes optimizing the amount and complexity of `rotators`. This involves exploring different data orders to minimize the number of `rotators` and simplify their design, as seen in recent works [34, 37, 38]. # 5. Experimental Setup As a survey paper, this article does not present new experimental results from its own methodology. Instead, it aggregates and compares the characteristics of existing `FFT hardware architectures` based on reported hardware resources and performance metrics. Therefore, sections like "Datasets" and "Baselines" in the traditional experimental sense are not directly applicable to the paper's own research, but rather describe the characteristics by which the *surveyed papers* are compared. ## 5.1. Datasets The concept of "datasets" is not applicable to this survey paper. This paper reviews `FFT hardware architectures`, which are general computational engines, not models trained or evaluated on specific data. The designs are typically analyzed for their efficiency in performing the mathematical `FFT operation` for a given size $N$, regardless of the specific input data `x[n]`. The paper focuses on architectures for `complex input data` and `power-of-two FFT sizes`, which are generic requirements for many signal processing applications. ## 5.2. Evaluation Metrics The paper evaluates and compares different `pipelined FFT hardware architectures` based on several key metrics related to hardware resource utilization and performance. These metrics are standard in hardware design for `FFT processors`. 1. **Complex Adders (in butterflies)** * **Conceptual Definition:** This metric quantifies the total number of complex addition/subtraction units required within the `butterfly` stages of the `FFT architecture`. A `complex adder` typically performs two real additions and two real subtractions (or similar operations depending on the specific butterfly). The goal is to minimize this count as adders are fundamental computational units. * **Mathematical Formula:** Not a single formula, but a count based on the number of `butterflies` and their `radix`. For a `radix-2 butterfly`, it involves two complex adders. For a `radix-4 butterfly`, it involves multiple complex adders. The paper reports total `complex adders` for the entire `FFT`. * **Symbol Explanation:** $n$ is used to represent $\log_2 N$. The specific counts are functions of $n$ (or $log_4 N$ etc.) and the parallelization factor $P$. 2. **Complex Rotators** * **Conceptual Definition:** This metric counts the number of `non-trivial complex rotators` (complex multipliers) required in the `FFT architecture`. `Rotators` implement the multiplication by `twiddle factors`. `Trivial rotators` (multiplications by $\pm 1, \pm j$) are typically excluded from this count because they can be implemented with simple sign changes and swaps, not full complex multipliers. Minimizing `complex rotators` is critical as they are generally the most resource-intensive components. * **Mathematical Formula:** Not a single formula, but a count based on the number of stages, `radix`, and specific architectural optimizations (e.g., `radix-`2^2$ reduces rotators). * Symbol Explanation: $n$ is used to represent $\log_2 N$ . The specific counts are functions of $n$ (or $log_4 N$ etc.) and the parallelization factor $P$ .

Complex Data Memory
- Conceptual Definition: This metric represents the total amount of memory (e.g., buffers, FIFOs, registers) required to store intermediate data within the pipelined architecture. This memory is crucial for delaying and shuffling data to ensure correct butterfly input alignment. Minimizing memory is important for reducing hardware area and power consumption.
- Mathematical Formula: Not a single formula, but typically calculated as the sum of the lengths of all internal buffers (or delays).
- Symbol Explanation: $N$ is the FFT size, $P$ is the parallelization factor. The memory is often expressed in terms of $N$ and $P$ .
Latency (cycles)
- Conceptual Definition: Latency refers to the total number of clock cycles it takes for a single FFT computation to complete and its first output to appear, from the moment its first input datum enters the pipeline. Lower latency is desirable for real-time applications.
- Mathematical Formula: Varies significantly by architecture, often a function of $N$ and $P$ .
- Symbol Explanation: $N$ is the FFT size, $P$ is the parallelization factor.
Throughput (data/cyc.)
- Conceptual Definition: Throughput indicates how many data samples the FFT architecture can process per clock cycle once the pipeline is full and operating steadily. It measures the rate at which data can be continuously accepted and produced.
- Mathematical Formula: For serial architectures, it's 1 data/cycle. For parallel architectures, it's $P$ data/cycle.
- Symbol Explanation: $P$ is the parallelization factor.

5.3. Baselines

The paper itself serves as a comprehensive comparison of various pipelined FFT hardware architectures. Therefore, each architectural type and its variations (e.g., SDF, SDC, SFF, SC, MDF, MDC, MSC, and their respective radix implementations like radix-2, radix-4, radix-2^2,radix- $2^k$ ) act as baselines against which others are implicitly or explicitly compared. The comparison tables (Tables 3 and 4) are designed to highlight the trade-offs between these different baseline approaches across the defined evaluation metrics. The purpose is not to declare one single "best" architecture but to show the optimal choice for different design constraints (e.g., lowest memory, lowest rotators, highest throughput).

6. Results & Analysis

The paper's results are presented through two detailed comparison tables, one for serial pipelined FFT architectures and one for parallel pipelined FFT architectures. These tables summarize the hardware resources and performance characteristics of various designs, enabling a direct comparison of their trade-offs.

6.1. Core Results Analysis

The core results validate that different pipelined FFT architectures offer distinct trade-offs in terms of complex adders, complex rotators, complex data memory, latency, and throughput. There is no single "best" architecture; the optimal choice depends on the specific design constraints of an application.

6.1.1. Serial Pipelined FFT Architectures

The following are the results from Table 3 of the original paper:

PIPELINED ARCHITECTURE	Complex Adders	Complex Rotators	Complex Data Mem.	Latency (cycles)	Th. (data/cyc.)
SDF, radix-2 [44]	2n	n − 2	N - 1	N − 1	1
SDF, radix-4 [23]	4n	n/2 − 1	N − 1	N − 1	1
SDF, radix-2² [45]	2n	n/2 − 1	N − 1	N − 1	1
SDF, radix-2³ [2, 52]	2n	2n/3 − 1	N − 1	N − 1	1
SDF, radix-2⁴ [24]	2n	n/2 − 1	N − 1	N − 1	1
SDF, Deep Feedback [85]	n	n/2 - 1	4(N − 1)/3	N − 1	1
SDC, radix-2 [68]	2n	n − 2	2N 2	N − 1	1
SDC, radix-2, HL [9]	n	n 2	3N/2	3N/2	1
SDC, radix-2, RI [65]	n	n 2	3N/2	3N/2	1
SDC, radix-2, RI+rot [81]	n	n/2 − 1	3N/2	3N/2	1
SFF, radix-2 [47]	n	n − 2	2N 2	N − 1	1
SFF, radix-2² [47]	n	n/2 - 1	2N -2	N − 1	1
SC, radix-2 [35]	n	n/2 − 1	≈ N	≈ N	1

SDF Architectures:
- Memory and Latency: Generally excel in minimizing memory usage (N-1) and achieving low latency (N-1 cycles).
- Rotators: Radix-2 SDF has n-2 rotators, while radix-4, radix-2^2,radix- $2^3$ , and radix-2^4 $reduce this to `n/2-1` or `2n/3-1`, indicating the benefit of higher/mixed radices for `rotator` count. * **Adders:** Typically higher `adder` count (`2n`) compared to other architectures. * **Deep Feedback SDF:** A notable exception, it reduces `adders` to $n$ by achieving $100\%$ `butterfly utilization` but at the cost of slightly increased memory (`4(N-1)/3`). * **SDC Architectures:** * **Adders and Rotators:** Later versions (HL, RI, RI+rot) minimize `adders` to $n$ and `rotators` to `n/2-1` (RI+rot version). The original `SDC` had `2n` `adders` and `n-2` `rotators`. * **Memory and Latency:** Have higher memory usage (`2N-2` or $3N/2$) and `latency` (`N-1` or $3N/2$ cycles) compared to `SDF`. This is a significant disadvantage of `SDC`, particularly its original $50\%$ utilization versions. * **SFF Architectures:** * **Adders and Latency:** Minimize `adders` ($n$) and achieve low `latency` (`N-1`). * **Rotators:** `Radix-`2^2`SFF` achieves `n/2-1` `rotators`. * **Memory:** Drawback is a larger `data memory` (`2N-2`). * **SC Architectures:** * **Overall Efficiency:** Achieve a strong balance, minimizing `adders` ($n$) and `rotators` (`n/2-1`) while offering `almost optimum memory` ($\approx N$) and `latency` ($\approx N$). This makes `SC FFTs` highly competitive. ### 6.1.2. Parallel Pipelined FFT Architectures The following are the results from Table 4 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th>PIPELINED ARCHITECTURE</th> <th>Complex Adders</th> <th>Complex Rotators</th> <th>Complex Data Mem.</th> <th>Latency (cycles)</th> <th>Th. (data/cyc.)</th> </tr> </thead> <tbody> <tr> <td colspan="6">2-PARALLEL ARCHITECTURES</td> </tr> <tr> <td>MDF, radix-2</td> <td>4n 2</td> <td>2n 4 − 2</td> <td>N 2</td> <td>N/2</td> <td>2</td> </tr> <tr> <td>MDF, radix-2<sup>4</sup> [57]</td> <td>4n 2</td> <td>n − 2</td> <td>2N</td> <td>N</td> <td>2</td> </tr> <tr> <td>M2DF, radix-2<sup>4</sup> [80]</td> <td>2n</td> <td>n − 2</td> <td>N - 2</td> <td>N/2</td> <td>2</td> </tr> <tr> <td>MDC, radix-2 [45]</td> <td>2n</td> <td>n − 2</td> <td>N - 2</td> <td>N/2</td> <td>2</td> </tr> <tr> <td>MDC, radix-2<sup>4</sup> [33]</td> <td>2n</td> <td>n − 2</td> <td>N - 2</td> <td>N/2</td> <td>2</td> </tr> <tr> <td>MSC, radix-2 [46]</td> <td>2n</td> <td>n − 2</td> <td>≈ N − 2</td> <td>N /2</td> <td>2</td> </tr> <tr> <td colspan="6">4-PARALLEL ARCHITECTURES</td> </tr> <tr> <td>MDF, radix-2</td> <td>8n 8</td> <td>4n - 4</td> <td>N - 4</td> <td>N/4</td> <td>4</td> </tr> <tr> <td>MDF, radix-2<sup>4</sup> [17]</td> <td>8n − 8</td> <td>2n 4</td> <td>N - 4</td> <td>N/4</td> <td>4</td> </tr> <tr> <td>M2DF, radix-2<sup>4</sup> [80]</td> <td>4n</td> <td>2n 5</td> <td>2N</td> <td>N/2</td> <td>4</td> </tr> <tr> <td>MDC, radix-2 [50]</td> <td>4n</td> <td>2n 4</td> <td>N - 4</td> <td>N/4</td> <td>4</td> </tr> <tr> <td>MDC, radix-2<sup>2</sup> [33]</td> <td>4n</td> <td>3n/2 − 3</td> <td>N - 4</td> <td>N/4</td> <td>4</td> </tr> <tr> <td>MDC, radix-2<sup>3</sup> [33]</td> <td>4n</td> <td>2n 4</td> <td>N - 4</td> <td>N/4</td> <td>4</td> </tr> <tr> <td>MDC, radix-2<sup>4</sup> [33]</td> <td>4n</td> <td>7n/4 - 4</td> <td>N - 4</td> <td>N/4</td> <td>4</td> </tr> <tr> <td>MSC, radix-2 [46]</td> <td>4n</td> <td>2n 4</td> <td>≈ N − 4</td> <td>N/4</td> <td>4</td> </tr> <tr> <td colspan="6">8-PARALLEL ARCHITECTURES</td> </tr> <tr> <td>MDF, radix-2</td> <td>16n - 24</td> <td>8n − 20</td> <td>N -8</td> <td>N/8</td> <td>8</td> </tr> <tr> <td>MDF, radix-2<sup>4</sup> [76]</td> <td>16n - 24</td> <td>4n - 8</td> <td>N - 8</td> <td>N/8</td> <td>8</td> </tr> <tr> <td>M2DF, radix-2<sup>4</sup> [80]</td> <td>8n</td> <td>4n − 11</td> <td>2N</td> <td>N/4</td> <td>8</td> </tr> <tr> <td>MDC, radix-2 [50]</td> <td>8n</td> <td>4n 8</td> <td>N - 8</td> <td>N/8</td> <td>8</td> </tr> <tr> <td>MDC, radix-2<sup>2</sup> [33]</td> <td>8n</td> <td>3n − 6</td> <td>N - 8</td> <td>N/8</td> <td>8</td> </tr> <tr> <td>MDC, radix-2<sup>3</sup> [33]</td> <td>8n</td> <td>3n 7</td> <td>N - 8</td> <td>N/8</td> <td>8</td> </tr> <tr> <td>MDC, radix-2<sup>4</sup> [33]</td> <td>8n</td> <td>7n/2 - 8</td> <td>N - 8</td> <td>N/8</td> <td>8</td> </tr> <tr> <td>MSC, radix-2 [46]</td> <td>8n</td> <td>4n - 8</td> <td>≈ N - 8</td> <td>N/8</td> <td>8</td> </tr> </tbody> </table></div> * **General Trends:** * All parallel architectures achieve `throughput` equal to their parallelization factor $P$ (e.g., 2, 4, 8 data/cycle) and significantly lower `latency` (`N/P` or $N/P \cdot (\text{some factor})$) compared to serial architectures. * `MDF`, `MDC`, and `MSC` architectures generally require `small memory` (`N-P` or $\approx N-P$), with `M2DF` being an exception at $\approx 2N$. * **2-Parallel Architectures:** * `M2DF`, `MDC`, and `MSC` architectures generally use the lowest number of `adders` (`2n`) and `rotators` (`n-2`). * `MDC` and `MSC` also have very `small memory` (`N-2`). * For `MDF`, `radix-`2^4$ (with 2N memory) reduces rotators to n-2 compared to radix-2 MDF (2n-4-2 rotators), but M2DF achieves similar rotator count with 2n adders and N-2 memory.
4-Parallel Architectures:
- M2DF, MDC, and MSC consistently achieve the smallest number of adders (4n).
- Rotator count varies significantly with radix:
  - MDC radix-2^2 $(3n/2 - 3 rotators) achieves the smallest number of rotators. * `MDC radix-`2^4$ has 7n/4-4 rotators, while MDC radix-2 and MSC radix-2 have 2n-4 rotators. While radix-2^4 $might not always reduce the *count* of rotators, it often reduces their *complexity* by using a larger number of simpler $W_{16}$ rotators instead of larger ones, as discussed in [34]. * `MDF`, `MDC`, and `MSC` all maintain `small memory` (`N-4`) and `low latency` ($N/4$). * **8-Parallel Architectures:** * Similar to 4-parallel, `M2DF`, `MDC`, and `MSC` achieve the lowest `adder` count (`8n`). * The `radix-`2^3`MDC FFT` architecture (`3n-7` rotators) achieves the smallest number of `rotators`. `MDC radix-`2^2$ has 3n-6 rotators, and MDF radix-2^4 $has `4n-8` rotators. * All these architectures maintain `small data memory` (`N-8`) and the `lowest latency` ($N/8$). ## 6.2. Data Presentation (Tables) The two tables above fully transcribe the comparison data for serial and parallel architectures, respectively, as presented in the original paper. ## 6.3. Ablation Studies / Parameter Analysis This being a survey paper, it does not conduct its own ablation studies or parameter analyses. Instead, it aggregates the results of various prior works, where individual papers would have performed such studies to justify their specific architectural choices and optimizations (e.g., the impact of choosing `radix-`2^2$ over radix-2 on rotator count). The comparison tables effectively present the results of such design choices as demonstrated by the listed complexity figures for different radices and parallelization levels. For instance, the difference in rotator count between radix-2 SDF (n-2) and radix-2^2SDF (n/2-1) implicitly shows the effect of the radix-2^2 $optimization. # 7. Conclusion & Reflections ## 7.1. Conclusion Summary This survey paper comprehensively documents the major advancements in `pipelined FFT hardware architectures` over the past five decades, specifically for `complex-input data` and `power-of-two FFT sizes`. It establishes a clear classification system, dividing architectures into `serial` (SDF, SDC, SFF, SC) and `parallel` (MDF, MDC, MSC) types. The paper provides detailed explanations of how each architecture functions, including their underlying principles, data flow mechanisms, and component-level designs. A crucial aspect highlighted is the constant trade-off among key hardware resources—`complex adders`, `complex rotators`, and `complex data memory`—and performance metrics like `latency` and `throughput`. The survey concludes by systematically comparing these architectures, showing that `SC FFTs` achieve a near-optimal balance for serial processing, while `M2DF`, `MDC`, and `MSC` architectures generally offer superior efficiency for parallel implementations, with the choice of `radix` further impacting `rotator` complexity. ## 7.2. Limitations & Future Work The paper implicitly outlines its limitations by explicitly stating its scope. It focuses solely on `pipelined FFT architectures` for `complex input data` and `power-of-two FFT sizes`. This means that other important categories of `FFT architectures` are outside its scope: * `Iterative FFTs` and `fully parallel FFTs` (beyond the `pipelined` context). * `Real-valued FFTs`. * `Variable-length architectures`. * `FFTs` for `natural input/output order` (though some architectures discussed might achieve this). * `Non-power-of-two FFT sizes`. * It also intentionally avoids detailed discussion of `FFT algorithms` themselves, `data management` specifics, `implementation of rotators`, and `accuracy analysis`, providing only necessary concepts in these sub-fields. Regarding future work, the paper points to an ongoing research direction: * **Optimization of Rotators:** It states that while `FFT architectures` have achieved minimum `adders`, `smallest memory`, and `lowest latency` in many cases, "the amount and complexity of rotators in FFT architectures still need to be optimized." This suggests that future research will continue to focus on reducing the number and complexity of `rotators` through exploring different `data orders` and advanced `rotator allocation` strategies [34, 37, 38]. The novelty of the `MSC FFT` also suggests it's an area needing further exploration. ## 7.3. Personal Insights & Critique This paper serves as an excellent, highly rigorous, and thorough reference for anyone delving into `pipelined FFT hardware architectures`. Its commitment to explaining *how* the architectures work, rather than just listing them, is commendable and truly makes it an educational resource. One of the most inspiring aspects is the clear demonstration of how a single mathematical operation (`FFT`) can lead to such a diverse array of hardware designs, each optimizing for different constraints. The constant innovation over 50 years, often driven by subtle insights into `data flow` ($b_{n-s}$) and `resource utilization` (`butterfly` and `rotator` efficiency), is fascinating. The evolution from $50\%$ to $100\%$ `utilization` and the strategic use of `radix-`2^k$ algorithms to balance adder and rotator complexity are particularly insightful.

The method of categorizing architectures into serial and parallel and then providing detailed comparison tables is a highly effective way to structure such a complex survey. It directly addresses the "comparability" challenge.

Potential areas for improvement or further detail for a true novice:

More Step-by-Step Numerical Examples: While the flow graphs and timing diagrams are excellent, a true beginner might benefit from a very simple, small $N$ example (e.g., $N=4$ or $N=8$ ) walked through numerically for one SDF and one SDC stage to concretely see the values flowing and how $b_{n-s}$ affects actual data pairing. The abstract mentions it's "educational," and such an example could bridge the gap between diagrams and concrete data.
Deeper Explanation of Complex Adder/Rotator Costs: For a beginner, explicitly breaking down what a "complex adder" or "complex rotator" entails in terms of real operations (e.g., how many real adders/multipliers make up one complex unit) would clarify the resource costs even more.
Visualizing $b_{n-s}$ Impact: While the paper states its importance, a small visual aid (e.g., highlighting $b_{n-s}$ in the binary indices on a flow graph or timing diagram to show pairing) could reinforce this concept even more effectively for a novice.

The paper's methods and conclusions are highly applicable across various domains requiring high-speed FFT computations, such as telecommunications (OFDM, MIMO), radar signal processing, medical imaging, and digital audio processing. Understanding these trade-offs is crucial for hardware architects to select or design the most appropriate FFT core given specific power, area, and performance budgets. For instance, a battery-powered device might prioritize low adder and rotator counts with moderate memory, while a high-throughput data center application might favor parallel architectures with $100\%$ utilization even with slightly increased memory. The paper effectively provides the blueprint for such informed decisions.