MscaleFNO: Multi-scale Fourier Neural Operator Learning for Oscillatory Function Spaces

Wei Cai

Paper status: completed

MscaleFNO: Multi-scale Fourier Neural Operator Learning for Oscillatory Function Spaces

Published:12/28/2024

Multi-Scale Fourier Neural Operator (1)High-Frequency Mapping Learning (1)Nonlinear Mapping of Helmholtz Equation (1)Spectral Bias Reduction (1)Wave Scattering Problem (1)

Original Link PDF

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces MscaleFNO, a multi-scale Fourier neural operator that reduces spectral bias in learning mappings between highly oscillatory functions. It shows significant performance improvements in high-frequency wave scattering problems by employing parallel scaled FNOs.

Abstract

In this paper, a multi-scale Fourier neural operator (MscaleFNO) is proposed to reduce the spectral bias of the FNO in learning the mapping between highly oscillatory functions, with application to the nonlinear mapping between the coefficient of the Helmholtz equation and its solution. The MscaleFNO consists of a series of parallel normal FNOs with scaled input of the function and the spatial variable, and their outputs are shown to be able to capture various high-frequency components of the mapping's image. Numerical methods demonstrate the substantial improvement of the MscaleFNO for the problem of wave scattering in the high-frequency regime over the normal FNO with a similar number of network parameters.

Mind Map

In-depth Reading

English Analysis~38 min read · 44,529 chars

1. Bibliographic Information

1.1. Title

MscaleFNO: Multi-scale Fourier Neural Operator Learning for Oscillatory Function Spaces

1.2. Authors

Zhilin You $^ { 1 }$
Zhenli Xu $^ { 1 }$
Wei Cai $^ *$ 2

$^ { 1 }$ School of Mathematical Sciences, MOE-LSC and CMA-Shanghai, Shanghai Jiao Tong University, Shanghai, China $^ 2$ Department of Mathematics, Southern Methodist University, Dallas, TX, USA

1.3. Journal/Conference

This paper is published on arXiv, a preprint server. As such, it has not yet undergone formal peer review for publication in a journal or conference. arXiv is a widely recognized platform for disseminating early research in physics, mathematics, computer science, and related fields.

1.4. Publication Year

December 31, 2024 (as indicated in the paper's footer, though the arXiv publication date is 2024-12-28T15:40:45.000Z).

1.5. Abstract

This paper introduces the multi-scale Fourier neural operator (MscaleFNO), an approach designed to mitigate the spectral bias inherent in standard Fourier Neural Operators (FNOs). This bias often hinders FNOs from effectively learning mappings between highly oscillatory functions. The MscaleFNO achieves this by employing a series of parallel, standard FNOs, each receiving scaled inputs of both the function and its spatial variable. The outputs from these parallel FNOs are shown to effectively capture various high-frequency components of the mapping's target image. The paper demonstrates the utility of MscaleFNO through its application to the nonlinear mapping between the coefficient of the Helmholtz equation and its solution, particularly in the high-frequency wave scattering regime. Numerical experiments reveal a substantial improvement in performance compared to a standard FNO with a similar number of network parameters.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2412.20183v1
PDF Link: https://arxiv.org/pdf/2412.20183v1.pdf

The paper is currently available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the spectral bias of Deep Neural Networks (DNNs), specifically as it applies to operator learning. DNNs have a well-documented tendency to prioritize learning low-frequency components of functions during training, often struggling to accurately represent or capture high-frequency content. This becomes a significant challenge in operator learning tasks where the input and output functions are highly oscillatory, meaning they exhibit rapid changes and intricate patterns.

This problem is particularly critical in various scientific and engineering domains, such as wave scattering phenomena, where the relationship between physical properties (e.g., material coefficients) and system responses (e.g., wave fields) involves complex, high-frequency oscillations. Traditional numerical methods for solving partial differential equations (PDEs) like the Helmholtz equation are computationally intensive, requiring repeated simulations for each new parameter configuration. Operator learning offers a promising alternative by learning a universal mapping, but the spectral bias of standard models like the Fourier Neural Operator (FNO) limits their effectiveness in these high-frequency regimes.

The paper's entry point is to adapt the successful multi-scale approach, previously applied to DNNs for function approximation (MscaleDNN), to the more complex domain of operator learning with FNOs. The innovative idea is to address the spectral bias in operator learning by simultaneously scaling both the spatial coordinates and the input function itself across multiple parallel FNO branches.

2.2. Main Contributions / Findings

The primary contributions and key findings of this paper are:

Proposal of MscaleFNO Architecture: The paper introduces the Multi-scale Fourier Neural Operator (MscaleFNO). This novel architecture extends the multi-scale DNN (MscaleDNN) concept to operator learning, specifically tailored for FNOs. It comprises multiple parallel FNO sub-networks, each processing inputs that are scaled differently in both spatial coordinates and function values.
Mitigation of Spectral Bias in Operator Learning: The MscaleFNO is explicitly designed to reduce the spectral bias of FNOs, enabling them to effectively learn mappings between highly oscillatory functions. This is achieved by allowing different sub-networks to specialize in capturing frequency components at their respective scales.
Enhanced High-Frequency Representation: The MscaleFNO demonstrates a superior capability in capturing high-frequency components of the mapping's image. Numerical results, particularly Discrete Fourier Transform (DFT) analyses, confirm that MscaleFNO accurately reconstructs the full spectrum of oscillatory solutions, unlike standard FNOs which tend to smooth out high-frequency details.
Application to Helmholtz Equation: The proposed MscaleFNO is successfully applied to the challenging problem of mapping the coefficient of the Helmholtz equation to its solution. This is a crucial application in wave scattering, especially in high-frequency regimes.
Substantial Performance Improvement: Numerical experiments show that MscaleFNO significantly outperforms standard FNOs in accuracy for highly oscillatory problems, often by an order of magnitude or more, even when both models have a similar number of parameters. This improvement is consistent across various test cases, including single-frequency, multi-frequency, and varying domain lengths, and demonstrates robust generalization to unseen data distributions.
Hierarchical Frequency Decomposition: The analysis of individual sub-network contributions within MscaleFNO reveals a systematic hierarchical frequency decomposition. Sub-networks with smaller scaling factors capture low-frequency patterns, while those with larger scaling factors effectively extract high-frequency details, collectively achieving a comprehensive spectral representation.

In essence, the paper provides a practical and effective solution to a critical limitation of neural operators when dealing with oscillatory phenomena, thereby broadening their applicability in scientific computing.

This section provides the necessary background for understanding the MscaleFNO paper.

3.1. Foundational Concepts

3.1.1. Operator Learning

Operator learning is a subfield of machine learning that focuses on learning mappings between infinite-dimensional function spaces. Unlike traditional machine learning which learns mappings between finite-dimensional vectors (e.g., classifying an image, predicting a stock price), operator learning aims to learn a functional relationship that maps an entire input function to an entire output function. For example, in physics, this could involve learning the mapping from a material's conductivity field (an input function) to the resulting temperature distribution (an output function) without needing to solve the underlying Partial Differential Equation (PDE) every time. This is particularly powerful because it allows for generalization to new functions, not just new discrete data points.

3.1.2. Neural Networks (DNNs)

Deep Neural Networks (DNNs) are computational models inspired by the structure and function of the human brain. They consist of multiple layers of interconnected nodes (neurons), organized into an input layer, several hidden layers, and an output layer. Each connection between neurons has an associated weight, and each neuron has a bias. During training, the network learns to adjust these weights and biases to transform input data into desired output data. DNNs are highly versatile and have achieved state-of-the-art performance in various tasks like image recognition, natural language processing, and pattern recognition. The Fourier Neural Operator and MscaleFNO are specialized types of DNNs.

3.1.3. Spectral Bias

Spectral bias, also known as the frequency principle, refers to the empirical observation that DNNs tend to learn low-frequency components of a function faster and more efficiently than high-frequency components during training. In other words, a DNN will approximate the smooth, overall shape of a target function much more quickly than its rapid oscillations or fine details. This is a fundamental limitation when dealing with highly oscillatory functions, where high-frequency content carries crucial information. This bias is attributed to the internal mechanisms of DNNs, such as optimization algorithms (e.g., gradient descent) and architecture choices.

3.1.4. Fourier Transform and Discrete Fourier Transform (DFT) / Fast Fourier Transform (FFT)

The Fourier Transform is a mathematical operation that decomposes a function into its constituent frequencies. It transforms a function from its original domain (e.g., time or space) to the frequency domain, representing it as a sum of sine and cosine waves of different amplitudes and frequencies.

For a continuous function f(x), its Fourier Transform $\mathcal{F}(f(x))$ is $\hat{f}(k) = \int_{-\infty}^{\infty} f(x) e^{-2\pi i k x} dx$ .
The Inverse Fourier Transform $\mathcal{F}^{-1}(\hat{f}(k))$ reconstructs the original function: $f(x) = \int_{-\infty}^{\infty} \hat{f}(k) e^{2\pi i k x} dk$ .

The Discrete Fourier Transform (DFT) is the discrete version of the Fourier Transform, applied to a sequence of sampled data points. The Fast Fourier Transform (FFT) is an efficient algorithm for computing the DFT. In neural operators like FNO, FFT is crucial for efficiently moving between the spatial and frequency domains, enabling computations directly in the frequency domain.

3.1.5. Helmholtz Equation

The Helmholtz equation is a linear elliptic partial differential equation that arises in the study of physical phenomena involving wave propagation in space. It is often derived from the wave equation by applying the Fourier Transform in time, reducing a time-dependent problem to a time-independent one. In 1D, the Helmholtz equation is typically written as: $ \Delta u + k^2 u = f $ where $\Delta$ is the Laplacian operator (e.g., $\frac{d^2}{dx^2}$ in 1D), $u$ is the wave field, $k$ is the wave number (related to frequency), and $f$ is a source term. The paper uses $a^2(\pmb{x})$ for the wave number squared, which can be spatially varying: $ \Delta u + a^2(\pmb{x}) u = f(\pmb{x}) $ This equation describes wave scattering (e.g., acoustic, electromagnetic) where $a^2(\pmb{x})$ represents material properties (like conductivity) that influence wave propagation. Solutions to the Helmholtz equation can be highly oscillatory, especially for large wave numbers (high-frequency regimes), making it a challenging problem for standard DNNs with spectral bias.

3.2. Previous Works

3.2.1. Fourier Neural Operator (FNO)

The Fourier Neural Operator (FNO) [8] is a prominent neural operator architecture. Unlike other neural operators that learn directly in the spatial domain or use kernel integration, FNO leverages the Fourier Transform to learn mappings in the frequency domain. This is motivated by the idea that many physical systems exhibit simpler dynamics in the spectral space. The core idea of FNO is to replace the kernel integration in a general neural operator with a convolution, which can be efficiently computed using FFT. Specifically, a general neural operator iterates using: $ v_t(\pmb{x}) = \sigma \Big( W_t v_{t-1}(\pmb{x}) + \big( K(a ; \phi_t) v_{t-1} \big)(\pmb{x}) \Big) $ where $\mathcal{K}$ is an integral operator: $ (\mathcal{K}(a ; \phi_t) v_{t-1})(\pmb{x}) = \int_D k(\pmb{x}, \pmb{y}, a(\pmb{x}), a(\pmb{y}) ; \phi_t) v_{t-1}(\pmb{y}) d\pmb{y} $ The FNO simplifies this by imposing translation invariance on the kernel and removing its dependence on $a$ , making it a convolution: $ (\mathcal{K}(a ; \phi_t) v_{t-1})(\pmb{x}) = \int_D k_{\phi_t}(\pmb{x} - \pmb{y}) v_{t-1}(\pmb{y}) d\pmb{y} $ By the Convolution Theorem, this can be computed in the Fourier domain as: $ (\mathcal{K}(a ; \phi_t) v_{t-1})(\pmb{x}) = \mathcal{F}^{-1} \big( R_t \cdot \mathcal{F}(v_{t-1}) \big)(\pmb{x}) $ where $R_t = \mathcal{F}(k_{\phi_t})$ is the Fourier transform of the kernel. FNO then learns the Fourier modes $R_t$ directly and applies a truncation mechanism to focus on a limited number of low-frequency modes ( $k_{\mathrm{max}}$ ), which is a source of its spectral bias.

3.2.2. DeepONet

DeepONet [11, 5] is another popular neural operator architecture. It is based on the universal approximation theorem for operators, which states that any continuous nonlinear operator can be approximated by a neural network. DeepONet consists of two sub-networks: a branch network that encodes the input function at various sensor locations, and a trunk network that encodes the coordinates of the output. The outputs of these two networks are then combined (typically by a dot product) to produce the approximation of the operator's output.

3.2.3. U-Net

U-Net [14] is a convolutional neural network (CNN) architecture originally developed for biomedical image segmentation. It has a U-shaped structure, consisting of a downsampling path (encoder) to capture context and an upsampling path (decoder) to enable precise localization. Skip connections between the encoder and decoder paths help retain fine-grained details lost during downsampling. While not strictly an operator learning model in the same sense as FNO or DeepONet, U-Net-like architectures are often used for learning mappings between functions (e.g., image-to-image translation, solving PDEs on grids) and can be considered a form of neural operator in practice.

3.2.4. Multi-scale Deep Neural Network (MscaleDNN)

The MscaleDNN [10, 17] is a method proposed to address the spectral bias of DNNs in function approximation. The core idea is to decompose a target function into multiple frequency bands and use separate DNNs to learn each band. Each sub-network operates on a scaled version of the input variable, effectively shifting the high-frequency components of the original function to lower frequencies that the DNN can learn more easily. The process involves:

Frequency Partitioning: Dividing the frequency domain of the target function $f(\pmb{x})$ into non-overlapping bands $A_i$ . This conceptually decomposes $f(\pmb{x})$ into a sum of functions $f_i(\pmb{x})$ , where each $f_i(\pmb{x})$ contains frequencies only within $A_i$ . $ f(\pmb{x}) = \sum_{i=1}^M f_i(\pmb{x}) $
Radial Scaling: For each $f_i(\pmb{x})$ , its Fourier transform $\hat{f}_i(\pmb{k})$ is scaled radially: $\hat{f}_i^{(\mathrm{scale})}(\pmb{k}) = \hat{f}_i(\alpha_i \pmb{k})$ . This means that if $\alpha_i > 1$ , the frequencies in $f_i(\pmb{x})$ are compressed, effectively transforming a high-frequency component into a lower-frequency one that is easier for a standard DNN to learn.
Sub-network Learning: Each scaled function $f_i^{(\mathrm{scale})}(\pmb{x})$ is approximated by a separate DNN $f_{\theta_i}(\pmb{x})$ (or $f_{\theta_i}(\alpha_i \pmb{x})$ for the original spatial variable).
Weighted Summation: The final approximation of $f(\pmb{x})$ is a weighted sum of the outputs from these sub-networks. $ f(\pmb{x}) \sim \sum_{i=1}^M \alpha_i^n f_{\theta_i}(\alpha_i \pmb{x}) $ The MscaleDNN has shown significant improvements in approximating oscillatory functions.

3.2.5. Other Spectral Bias Mitigation Techniques

Other works have also sought to reduce spectral bias:

Phase Shift DNN [4]: Introduces phase shifts to DNN layers to improve their ability to learn high-frequency oscillations.
Hierarchical Attention Neural Operator [9]: Uses an attention mechanism across different scales to address the spectral bias in operator learning.
Diffusion Models with Neural Operators [12]: Integrates diffusion models to enhance the spectral representation capabilities of neural operators for complex phenomena like turbulent flows.

3.3. Technological Evolution

The evolution of PDE solvers has progressed from traditional numerical methods (finite element method, finite difference method) that are computationally expensive for parameterized problems, to data-driven operator learning methods. Early neural operator models like DeepONet provided a general framework for learning function-to-function mappings. FNO then introduced the powerful concept of learning directly in the frequency domain, leveraging the efficiency of FFT. However, FNO inherited the spectral bias common to DNNs, limiting its performance on highly oscillatory problems. The MscaleFNO represents a step forward by addressing this spectral bias specifically within the FNO framework, allowing neural operators to tackle more challenging high-frequency wave phenomena which are ubiquitous in scientific computing. It builds upon the success of multi-scale approaches developed for function approximation (like MscaleDNN) and adapts them to the more complex domain of operator learning.

3.4. Differentiation Analysis

The MscaleFNO differentiates itself from previous work by:

Extending MscaleDNN to Operator Learning: While MscaleDNN successfully mitigates spectral bias for function approximation (mapping a point $\pmb{x}$ to a value $f(\pmb{x})$ ), MscaleFNO extends this concept to operator learning (mapping an input function $a(\pmb{x})$ to an output function $u(\pmb{x})$ ). This is a more complex task as it involves learning mappings between entire function spaces.
Addressing Dual High-Frequency Variations: MscaleFNO specifically targets high-frequency variations in two critical aspects:
- Spatial coordinates ( $\pmb{x}$ ): Similar to MscaleDNN, it scales the spatial input to capture fine spatial details.
- Input function ( $a(\pmb{x})$ ): Crucially, it also scales the values of the input function $a(\pmb{x})$ , acknowledging that the operator's response might itself be highly oscillatory with respect to changes in the input function's amplitude, not just its spatial distribution. This is essential for problems like the Helmholtz equation where the wave number (related to $a(\pmb{x})$ ) dictates the oscillation frequency of the solution.
Integration with FNO: By integrating the multi-scale approach directly into the FNO architecture, MscaleFNO leverages the spectral domain processing capabilities of FNO while simultaneously overcoming its spectral bias. This allows for efficient learning of Fourier modes across different frequency bands.
Improved Performance in High-Frequency Regimes: Compared to a normal FNO (which has inherent spectral bias), MscaleFNO significantly improves accuracy in high-frequency wave scattering problems, as demonstrated by the Helmholtz equation examples. Other neural operators or DNNs might struggle with these regimes without explicit multi-scale or spectral bias mitigation strategies.

In summary, MscaleFNO fills a critical gap by providing a multi-scale solution to the spectral bias problem for Fourier Neural Operators, making them more effective for complex highly oscillatory function mappings in scientific computing.

4. Methodology

The MscaleFNO combines the principles of Fourier Neural Operators (FNOs) with a multi-scale approach, inspired by MscaleDNN, to improve the learning of mappings involving highly oscillatory functions. This section details the architecture and underlying concepts.

4.1. Principles

The core principle of MscaleFNO is to overcome the spectral bias of standard FNOs by explicitly decomposing the learning problem into multiple frequency scales. Instead of a single FNO trying to learn all frequency components simultaneously (and preferentially learning low ones), MscaleFNO employs a parallel ensemble of FNOs. Each sub-FNO processes a scaled version of the input, allowing it to specialize in a particular frequency band. By scaling both the spatial variable $\pmb{x}$ and the input function $a(\pmb{x})$ , the MscaleFNO ensures that both the spatial oscillations and the amplitude-dependent oscillations of the operator's output are captured across different scales. The outputs of these specialized sub-FNOs are then combined via a weighted sum to produce the final, comprehensive solution. This parallel architecture ensures that high-frequency components, which are typically challenging for DNNs to learn, are effectively "downshifted" into a learnable frequency range for at least one of the FNO sub-networks.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. General Operator Learning Framework

The paper considers learning a (nonlinear) mapping $G$ between infinite-dimensional function spaces. Let $D \subset \mathbb{R}^d$ be a bounded and open domain. The input function is $a \in \mathcal{A}(D; \mathbb{R}^{d_a})$ and the output function is $u \in \mathcal{U}(D; \mathbb{R}^{d_u})$ , where $\mathcal{A}$ and $\mathcal{U}$ are spaces of vector functions. The goal is to learn an approximation $G_\theta$ such that $G_\theta(a) \approx G(a) = u$ .

Given $N$ observations $\mathbb{S} = \{a_j, u_j\}_{j=1}^N$ , the objective is to solve an optimization problem: $\operatorname*{min}_{\theta \in \mathbb{R}^{d_p}} \frac{1}{|\mathbb{S}|} \sum_{(a, u) \in \mathbb{S}} L \bigl( G_\theta(a), u \bigr)$ where $\theta$ represents the finite-dimensional parameters of the approximation $G_\theta$ . The loss function $L$ is defined as the relativeL_2loss: $L \big( G_\theta(a), u \big) := \frac{\| G_\theta(a) - u \|_ {L^2}}{\| u \|_ {L^2}}$ Here, $\| \cdot \|_{L^2}$ denotes the $L_2$ norm for functions. For numerical computation, functions are discretized at $n$ points $D_n = \{\pmb{x}_1, \dots, \pmb{x}_n\}$ . The discrete loss function then becomes: $L \big( G_\theta(a), u \big) := \frac{ \sqrt{ \sum_{i=1}^n \left( G_\theta(a)(\pmb{x}_i) - u(\pmb{x}_i) \right)^2 } }{ \sqrt{ \sum_{i=1}^n u(\pmb{x}_i)^2 } }$ This loss function measures the relative difference between the predicted output and the true output, normalized by the magnitude of the true output.

The general neural operator framework (as presented in [2, 7]) is an iterative architecture: $\begin{array}{r l} & v_0(\pmb{x}) = P(a)(\pmb{x}), \\ & v_t(\pmb{x}) = \sigma \Big( W_t v_{t-1}(\pmb{x}) + \big( \mathcal{K}(a ; \phi_t) v_{t-1} \big)(\pmb{x}) \Big), \\ & u(\pmb{x}) = Q(v_T)(\pmb{x}) \end{array}$ Here:

$P$ : A linear lifting operator (typically a fully connected neural network) that maps the input function $a(\pmb{x}) \in \mathbb{R}^{d_a}$ to a higher-dimensional feature space $v(\pmb{x}) \in \mathbb{R}^{d_v}$ . So, $v_0$ is the initial lifted representation.
$v_t$ : The feature representation at iteration $t$ .
$W_t$ : A linear local transform mapping $\mathbb{R}^{d_v} \to \mathbb{R}^{d_v}$ . It represents local operations applied to the feature vector at each point $\pmb{x}$ .
$\mathcal{K}(a ; \phi_t)$ : An integral operator mapping $\mathbb{R}^{d_v} \to \mathbb{R}^{d_v}$ . It captures global interactions and dependencies across the domain $D$ . It is defined as: $(\mathcal{K}(a ; \phi_t) v_{t-1})(\pmb{x}) := \int_D k(\pmb{x}, \pmb{y}, a(\pmb{x}), a(\pmb{y}) ; \phi_t) v_{t-1}(\pmb{y}) d\pmb{y}$ where $k$ is a kernel function parameterized by $\phi_t$ .
$\sigma$ : A nonlinear activation function (e.g., GELU, ReLU).
$Q$ : A projection operator (typically a neural network) that maps the final high-dimensional feature $v_T$ back to the desired output function $u(\pmb{x}) \in \mathbb{R}^{d_u}$ .

4.2.2. Fourier Neural Operator (FNO)

The FNO enhances the general neural operator framework by specializing the integral operator $\mathcal{K}$ . It enforces translation invariance on the kernel and removes its dependence on the input function $a$ . This simplifies the kernel to $k_{\phi_t}(\pmb{x} - \pmb{y})$ , making the integral operation a convolution: $\big( \mathcal{K}(a ; \phi_t) v_{t-1} \big)(\pmb{x}) = \int_D k_{\phi_t}(\pmb{x} - \pmb{y}) v_{t-1}(\pmb{y}) d\pmb{y}$ According to the Convolution Theorem, convolutions in the spatial domain can be efficiently computed as element-wise products in the Fourier domain. Let $\mathcal{F}$ denote the Fourier transform and $\mathcal{F}^{-1}$ its inverse. If $R_t := \mathcal{F}(k_{\phi_t})$ is the Fourier transform of the kernel, then: $\big( \mathcal{K}(a ; \phi_t) v_{t-1} \big)(\pmb{x}) = \mathcal{F}^{-1} \big( R_t \cdot \mathcal{F}(v_{t-1}) \big)(\pmb{x})$ In practical FNO implementations, the spatial coordinates $\pmb{x}$ themselves are often included as an input feature to capture position-dependent characteristics. This is represented as an identity mapping: $\mathrm{id}_D(\pmb{x}) = \pmb{x}$ . Thus, the FNO learns a mapping $G : \{ \mathrm{id}_D \} \times \mathcal{A}(D ; \mathbb{R}^{d_a}) \to \mathcal{U}(D ; \mathbb{R}^{d_u})$ , effectively $u(\pmb{x}) = G \lfloor \pmb{x}, a(\pmb{x}) \rfloor(\pmb{x})$ .

The comprehensive FNO framework is then given by: $\begin{array}{r l} & v_0(\pmb{x}) = P\left(\pmb{x}, \boldsymbol{a}(\pmb{x})\right)(\pmb{x}), \\ & v_t(\pmb{x}) = \sigma \Big( W_t v_{t-1}(\pmb{x}) + \mathcal{F}^{-1} \big( R_t \cdot \mathcal{F}(v_{t-1}) \big)(\pmb{x}) \Big), \\ & u(\pmb{x}) = Q\left(v_T\right)(\pmb{x}) \end{array}$ Here, $P: \mathbb{R}^{d_a} \times \mathbb{R}^d \to \mathbb{R}^{d_v}$ is an enhanced linear lifting operator that takes both the spatial coordinate $\pmb{x}$ and the input function $a(\pmb{x})$ as input. $W_t$ and $Q$ remain as defined before. This entire process can be compactly written as: $u(\pmb{x}) = \mathrm{FNO}_\theta \left[ \pmb{x}, a(\pmb{x}) \right] (\pmb{x})$ The architecture is illustrated in Figure 1.

The FNO architecture consists of an initial lifting layer, several Fourier layers (each containing a linear transformation and a Fourier integral operation), and a final projection layer.

Figure 1: The FNO architecture
该图像是一个示意图，展示了多尺度傅里叶神经算子（MscaleFNO）的架构。图中包含输入 a(x) 和 $x$ ，经过处理模块 $P$ ，然后通过多个傅里叶层处理，最终生成输出 u(x)。该框架能够捕捉高频成分，适用于波散射问题。

Figure 1: The FNO architecture

4.2.2.1. Truncation Mechanism of $R_t$

For computational efficiency, FNO truncates the Fourier spectrum. When the Fast Fourier Transform (FFT) is applied to $v_t$ , it produces Fourier modes. Instead of retaining all modes, only the $k_{\mathrm{max}}$ lowest-frequency modes are kept, where $k_{\mathrm{max}}$ is typically much smaller than the total number of discretization points $n$ . The parameter tensor $R_t \in \mathbb{C}^{d_v \times d_v \times k_{\mathrm{max}}}$ acts as a learned filter in the Fourier domain, preserving these $k_{\mathrm{max}}$ modes and effectively nullifying higher frequencies. Mathematically, this operation is expressed as: $\left( R_t \cdot \mathcal{F}(v_{t-1}) \right)_{k, l} = \sum_{i=1}^{d_v} (R_t)_{k, l, i} \mathcal{F}(v_{t-1})_{k, i}$ where $k = 1, \dots, k_{\mathrm{max}}$ represents the selected Fourier modes, and $i = 1, \dots, d_v$ represents the channel dimension. This truncation is a primary source of the spectral bias in FNO, as it inherently limits the model's ability to represent high-frequency components.

4.2.3. FNO Network Parameters

The total number of parameters in a standard FNO is meticulously calculated to facilitate comparison.

Lifting Layer ( $P$ ): Implemented as a linear transformation: $P(\pmb{x}, a(\pmb{x})) = A_p^{\pmb{x}} \pmb{x} + A_p^a a(\pmb{x}) + b_p$ where $A_p^{\pmb{x}} \in \mathbb{R}^{d \times d_v}$ , $A_p^a \in \mathbb{R}^{d_a \times d_v}$ , and $b_p \in \mathbb{R}^{d_v}$ are parameters. The number of parameters for $P$ is: $\mathrm{Num}_p = d \times d_v + d_a \times d_v + d_v = (d + d_a + 1)d_v$
Fourier Layer (one layer, repeated $T$ times):
- Local Linear Transform ( $W_t$ ): $W_t(v) = A_w v + b_w$ , where $A_w \in \mathbb{R}^{d_v \times d_v}$ and $b_w \in \mathbb{R}^{d_v}$ . Number of parameters for $W_t$ : $\mathrm{Num}_w = d_v \times d_v + d_v = d_v^2 + d_v$
- Fourier Integral Operation ( $R_t$ ): The parameter tensor $R_t \in \mathbb{C}^{d_v \times d_v \times k_{\mathrm{max}}}$ . Since complex numbers have real and imaginary parts, the actual number of real-valued parameters is double. The paper text, however, specifies $d_v \times d_v \times k_{\mathrm{max}}$ for $R_t$ , implying a standard count for complex numbers as single parameters or a specific implementation detail. We follow the paper's formula directly: Number of parameters for $R_t$ : $\mathrm{Num}_f = d_v \times d_v \times k_{\mathrm{max}}$
Projection Layer ( $Q$ ): Implemented as a two-layer fully connected network with GELU activation and $2d_v$ hidden channels: $w_T = \mathrm{GELU}(A_m v_T + b_m), \quad Q(v_T) = A_q w_T + b_q$ where $A_m \in \mathbb{R}^{d_v \times 2d_v}$ , $A_q \in \mathbb{R}^{2d_v \times d_u}$ , $b_m \in \mathbb{R}^{2d_v}$ , and $b_q \in \mathbb{R}^{d_u}$ . Number of parameters for $Q$ : $\mathrm{Num}_q = 2d_v^2 + 2d_v \times d_u + 2d_v + d_u$
MLP after Fourier Integral (optional): Some FNO implementations include an MLP (Multi-Layer Perceptron) after the Fourier integral operation. This modifies the equation for $v_t$ $v_{t}$ to: $v_t(\pmb{x}) = \sigma \bigg( W_t v_{t-1}(\pmb{x}) + M \Big( \mathscr{F}^{-1} \big( R_t \cdot \mathscr{F}(v_{t-1}) \big)(\pmb{x}) \Big) \bigg)$ The MLP $M$ $M$ is a two-layer fully connected neural network with GELU activation and hidden dimension $d_v$ $d_{v}$ : $M(v) = A_2 \big( \mathrm{GELU}(A_1 v + b_1) \big) + b_2, \quad \forall v \in \mathbb{R}^{d_v}$ where $A_1, A_2 \in \mathbb{R}^{d_v \times d_v}$ $A_{1}, A_{2} \in R^{d_{v} \times d_{v}}$ are weight matrices and $b_1, b_2 \in \mathbb{R}^{d_v}$ $b_{1}, b_{2} \in R^{d_{v}}$ are bias vectors. Number of parameters for the MLP layer ( $M$ $M$ ): $\mathrm{Num}_m = 2d_v^2 + 2d_v$ $Num_{m} = 2 d_{v}^{2} + 2 d_{v}$ The total number of parameters for an FNO with an MLP after Fourier integral operations is: $\mathrm{Num} = (d + d_a + 1)d_v + T [ (k_{\mathrm{max}} + 3)d_v^2 + 3d_v ] + [ 2d_v^2 + (2d_u + 2)d_v + d_u ]$ Here:
- $d_v$ : channel dimension (number of internal feature channels).
- $k_{\mathrm{max}}$ : number of retained Fourier modes.
- $T$ : number of Fourier layers.
- $d$ : spatial dimension of the domain (e.g., $d=1$ for 1D problems).
- $d_a$ : dimension of the input function $a$ .
- $d_u$ : dimension of the output function $u$ .

4.2.4. Multi-scale Fourier Neural Operator (MscaleFNO)

The MscaleFNO extends the multi-scale concept to operator learning using the FNO framework. It is specifically designed to address high-frequency variations in both the spatial coordinate $\pmb{x}$ and the input function $a(\pmb{x})$ .

The MscaleFNO architecture consists of $N$ parallel sub-networks, each of which is a complete FNO as described above. Each sub-network receives a scaled version of the input. The final output is a weighted sum of the outputs from these sub-networks. The architecture is illustrated in Figure 2.

Figure 2: The MscaleFNO architecture
该图像是MscaleFNO架构示意图，展示了多个并行的Fourier神经算子如何处理输入函数及其空间变量。不同的输入经过Fourier层后输出，并通过加法整合形成最终结果 u(x)。

Figure 2: The MscaleFNO architecture

The mathematical expression for the MscaleFNO model is: $u\left(\pmb{x}\right) = \sum_{i=1}^N \gamma_i \mathrm{FNO}_{\theta_m} \left[ c_i \pmb{x}, c_i a(\pmb{x}) \right] (\pmb{x})$ Here:

$N$ : The total number of parallel sub-networks.
$\mathrm{FNO}_{\theta_m}$ : Represents an individual FNO sub-network with parameters $\theta_m$ .
$c_i$ : A scaling factor applied to both the spatial coordinate $\pmb{x}$ and the input function $a(\pmb{x})$ for the $i$ -th sub-network. These $c_i$ values are trainable parameters. Larger $c_i$ values enable the sub-network to capture higher-frequency components.
$\gamma_i$ : A weight for combining the output of the $i$ -th sub-network. These $\gamma_i$ values are also trainable parameters.
The linear transformations $P$ and $Q$ , and the Fourier layers within each FNO sub-network, maintain their original definitions from the normal FNO framework.
A specific detail mentioned is the use of the sine activation function $\sigma(x) = \sin(x)$ throughout all Fourier layers. This choice is often beneficial for learning oscillatory patterns.

The total number of parameters for the MscaleFNO is calculated as: $\mathrm{Num}^{(\mathrm{mscale})} = N \times \mathrm{Num} + 2N$ where $\mathrm{Num}$ is the parameter count of a single FNO sub-network (e.g., from Equation (24)), and 2N accounts for the $N$ trainable scaling factors $c_i$ and $N$ trainable combination weights $\gamma_i$ . This equation assumes that each of the $N$ sub-networks has the same internal architecture (channel dimension $d_v$ and Fourier modes $k_{\mathrm{max}}$ ).

4.2.5. Mapping between Conductivity and Solution in Helmholtz Equation

The paper applies MscaleFNO to the Helmholtz equation, which models wave scattering. The specific form considered is: $\Delta u + a^2(\pmb{x}) \pmb{u} = \pmb{f}(\pmb{x}), \qquad \pmb{x} \in \Omega \subset \mathbb{R}^d$ with a boundary condition: $u |_{\partial \Omega} = g(\pmb{x}), \quad \pmb{x} \in \partial \Omega$ Here:

$u$ : The scattering field (the solution).
$a^2(\pmb{x})$ : The square of the wave number, representing the conductivity or material properties of the scatterer, which is compact-supported (non-zero only in a finite region) inside the domain $\Omega$ .
$f(\pmb{x})$ : The incident wave or forcing term.
$\Omega$ : The computational domain.
$g(\pmb{x})$ : The boundary condition.

For a homogeneous Dirichlet boundary condition ( $g(\pmb{x}) = 0$ ), the solution $u(\pmb{x})$ can be expressed using a Green's function $G(\pmb{x}, \pmb{x}')$ : $u(\pmb{x}) = \int_\Omega G(\pmb{x}, \pmb{x}') f(\pmb{x}') d\pmb{x}'$ The Green's function $G(\pmb{x}, \pmb{x}')$ satisfies: $\Delta G(\pmb{x}, \pmb{x}') + a^2 G(\pmb{x}, \pmb{x}') = - \delta(\pmb{x}, \pmb{x}'), \quad G(\pmb{x}, \pmb{x}') |_{\pmb{x} \in \partial \Omega} = 0$ where $\delta(\pmb{x}, \pmb{x}')$ is the Dirac delta function. The Green's function can be decomposed into a free space Green's function $G_0(\pmb{x}, \pmb{x}')$ and a smooth function $h(\pmb{x}, \pmb{x}')$ : $G(\pmb{x}, \pmb{x}') = G_0(\pmb{x}, \pmb{x}') + h(\pmb{x}, \pmb{x}')$ The free space Green's function $G_0$ for different dimensions is: $\begin{array}{l} { \displaystyle G_0(x, x') = \frac{i}{2a} e^{ia |x - x'|}, ~ x, x' \in R, } \\ { \displaystyle G_0(x, x') = \frac{i}{4} H_0^{(2)}(a |x - x'|), ~ x, x' \in R^2, } \\ { \displaystyle G_0(\mathbf{x}, \mathbf{x'}) = \frac{e^{ia |\mathbf{x} - \mathbf{x'}|}}{4\pi |\mathbf{x} - \mathbf{x'}|}, ~ \mathbf{x}, \mathbf{x'} \in R^3, } \end{array}$ where $H_0^{(2)}$ is the Hankel function of the second kind of order zero. The smooth function $h(\pmb{x}, \pmb{x}')$ satisfies a homogeneous Helmholtz equation with a non-homogeneous Dirichlet boundary condition: $\Delta h(\pmb{x}, \pmb{x}') + a^2 h(\pmb{x}, \pmb{x}') = 0, \quad h(\pmb{x}, \pmb{x}') = - G_0(\pmb{x}, \pmb{x}'), \ \pmb{x} \in \partial \Omega$ The problem for MscaleFNO is to learn the mapping from the spatially varying wave number perturbation (or conductivity) $a(\pmb{x})$ (or $\omega(\pmb{x})$ in later examples) to the scattering field $u(\pmb{x})$ .

5. Experimental Setup

The paper presents a series of numerical experiments to compare the performance of the proposed MscaleFNO with a normal FNO (standard FNO). The primary goal is to demonstrate MscaleFNO's superior ability to learn mappings involving highly oscillatory functions, particularly in high-frequency regimes.

5.1. Datasets

All experiments use 1-D functions, meaning $d_a = d_u = d = 1$ . The datasets are synthetically generated based on specific functional forms or Helmholtz equation solutions.

5.1.1. Example 4.1: Single-frequency nonlinear mapping

Target Mapping: Learning the operator $G$ for $u(x) = \sin(20a(x))$ , where $x \in [-1, 1]$ . This function exhibits high-frequency dependence on the variable a(x).
Input Function a(x) Generation: Generated as a sum of sine functions: $a(x) = \frac{ \sum_{n=0}^{50} a_n \sin(n \pi x) }{ \operatorname*{max}_x \big\{ \sum_{n=0}^{50} a_n \sin(n \pi x) \big\} }$ where $a_n \sim \mathrm{rand}(-1, 1)$ are randomly sampled coefficients. This ensures a(x) is normalized.
Dataset Size: 2,000 samples in total, split into 1,000 for training, 500 for validation, and 500 for testing.
Grid Resolution: Exact solutions are computed on a 1001-point grid.
Example Data Profile:

该图像是图表，展示了输入函数 a(x) 的概要和其离散傅里叶变换（DFT）。左侧为函数 a(x) 的图像，右侧为其 DFT，显示了不同模式的幅度。该图清晰地对比了输入函数及其频域特性。

Figure 3: The profile of the input function a ( x ) (left) and its DFT (right). The left panel shows the spatial variation of a(x), while the right panel illustrates its Discrete Fourier Transform (DFT), indicating the frequency components present.

5.1.2. Example 4.2: Multiple-frequency nonlinear mapping

Target Mapping: Learning the operator $G$ for u(x) as a mixture of sine and cosine functions with multiple frequencies, exhibiting multi-frequency dependence on a(x): $u(x) = \sum_{m=1}^M \big[ A_m \sin \big( m a(x) \big) + B_m \cos \big( m a(x) \big) \big], \quad x \in [-1, 1]$ where $A_m$ and $B_m \in [-1, 1]$ are fixed, randomly generated numbers. The parameter $M$ (number of frequency terms) is varied: $M = \{10, 20, 40, 80, 100, 200\}$ .
Input Function a(x) Generation: Generated similarly to Example 4.1, but with sine and cosine terms: $a(x) = \frac{ \sum_{n=0}^{10} [a_n \sin(n \pi x) + b_n \cos(n \pi x)] }{ \operatorname*{max}_x \{ \sum_{n=0}^{10} [a_n \sin(n \pi x) + b_n \cos(n \pi x)] \} }$ where $a_n, b_n \sim \mathrm{rand}(-1, 1)$ .
Dataset Size & Grid: Same as Example 4.1 (2,000 samples, 1001-point grid).
Example Data Profile:

该图像是图表，展示了输入函数 a(x) 的轮廓（左）和其离散傅里叶变换（DFT，右）。输入函数的值在 $x$ 范围内波动，而 DFT 显示出不同模式的幅值能量分布。

Figure 7: The profile of the input function a ( x ) (left) and the DFT of a ( x ) (right). This figure presents a representative input function a(x) and its frequency components.

$Figure 8: DFT of representative exact solution `u ( x )` for different $M$$
该图像是图表，展示了不同 $M$ 值下代表性精确解 u(x) 的离散傅里叶变换（DFT）。图中分别显示了 $M=10$ , 20, 40, 80, 100, 和 200 的模态与幅度的关系，反映出随着 $M$ 的增加，高频成分的表现变化。

Figure 8: DFT of representative exact solution u ( x ) for different $M$ . This figure shows how the spectral content of the solution u(x) becomes increasingly complex and extends to higher frequencies as $M$ increases.

5.1.3. Example 4.3: Helmholtz equation ( $L=1$ )

Target Mapping: Learning the operator from variable wave number perturbation $\omega(x)$ to the solution u(x) for the 1-D Helmholtz equation: $\begin{array}{r} { \left\{ u'' + (\lambda^2 + c \omega(x)) u = f(x), x \in [-L, L] \right. } \\ { \left. u(-L) = u(L) = 0, \right. } \end{array}$ with $L=1$ . Parameters are $\lambda=2$ and $c=0.9\lambda^2 = 3.6$ .
Forcing Term f(x): $f(x) = \sum_{k=0}^{10} (\lambda^2 - \mu_k^2) \sin(\mu_k x), \mu_k = 300 + 35k$
Perturbation $\omega(x)$ Generation: $\omega(x) = \frac{ \sum_{n=0}^{500} [a_n \sin(n \pi x) + b_n \cos(n \pi x)] }{ \operatorname*{max}_x \left\{ \sum_{n=0}^{500} [a_n \sin(n \pi x) + b_n \cos(n \pi x)] \right\} }$ where $a_n, b_n \sim \mathrm{rand}(-1, 1)$ .
Dataset Size & Grid: 1,000 samples (800 train, 100 val, 100 test). High-resolution numerical solutions are computed on an 8001-point grid (accuracy $O(10^{-4})$ ) and then downsampled to a 1001-point grid.
Example Data Profile:

$Figure 12: The profile of the input function $\\omega ( x )$ (left) and the DFT of $\\omega ( x )$ (right)$
该图像是图表，展示了输入函数 $ω(x)$ 的轮廓（左）以及 $ω(x)$ 的离散傅里叶变换（DFT）（右）。左侧图显示了函数值在区间 $[-1, 1]$ 内的变化，右侧图则展示了各频率模式的幅度。

Figure 12: The profile of the input function $\omega ( x )$ (left) and the DFT of $\omega ( x )$ (right).

$Figure 13: The profile of the exact solution `u ( x )` from $\\omega ( x )$ in Fig. 12 (left) and the DFT of `u ( x )` (right)$
该图像是图表，包含了精确解 u(x) 的曲线（左）和其离散傅立叶变换（DFT，右）。左侧展示了解的值与变量 $x$ 的关系，而右侧表示对应模态的幅度。图中揭示了高频成分的特征。

Figure 13: The profile of the exact solution u ( x ) from $\omega ( x )$ in Fig. 12 (left) and the DFT of u ( x ) (right). These figures show the spatial and spectral characteristics of the input perturbation and the resulting solution for the Helmholtz equation.

5.1.4. Example 4.4: Helmholtz equation (varying $L$ )

Target Mapping: Same Helmholtz equation as Example 4.3, but with varying domain lengths $L \in \{2, 4, 8, 10\}$ . This corresponds to increasingly challenging high-frequency scattering problems.
Perturbation $\omega(x)$ Generation: A simpler form is used: $\omega(x) = \frac{ \sum_{n=0}^{50} [a_n \sin(n \pi x) + b_n \cos(n \pi x)] }{ \operatorname*{max}_x \left\{ \sum_{n=0}^{50} [a_n \sin(n \pi x) + b_n \cos(n \pi x)] \right\} }$
Dataset Size & Grid: Dataset sizes (1,000 samples) and splits (800 train, 100 val, 100 test) are the same. High-resolution solutions are computed and then downsampled to maintain a consistent mesh size, resulting in varying numbers of grid points proportional to $L$ .
Example Data Profile:

$Figure 15: Characteristic solutions of the Helmholtz equation in spatial space for different domain lengths $L$$
该图像是图表，展示了不同域长度 $L$ （分别为 2, 4, 8, 和 10）下的亥姆霍兹方程的特征解的空间分布。每个子图表示在不同的 $L$ 值下，函数在空间 $x$ 上的取值变化。

Figure 15: Characteristic solutions of the Helmholtz equation in spatial space for different domain lengths $L$ . This figure illustrates how the solutions become more complex and oscillatory as the domain length $L$ increases, indicating higher frequency content.

5.1.5. Example 4.5: Helmholtz equation (generalization test)

Target Mapping: Same Helmholtz equation as Example 4.4, specifically at $L=10$ .
Test Samples $\omega(x)$ Generation (unseen distribution): To evaluate generalization capability, test samples for $\omega(x)$ are drawn from a distribution distinct from the training data: $\eta(x) = \sum_{n=1}^{50} a_n \sin(k_n x^3) + b_n \cos(l_n x^2), \quad \omega(x) = \frac{\eta(x)}{\operatorname*{max}_x \left\{ \eta(x) \right\}}$ where $k_n \sim \mathrm{rand}(0, 30)$ and $l_n \sim \mathrm{rand}(40, 60)$ . The forcing term $f$ is the same as in Equation (47).
Dataset Size & Grid: Training and validation data are from the distribution in Example 4.4. The test data is newly generated using the above formula.

5.2. Evaluation Metrics

The primary evaluation metric used in the paper is the relativeL_2loss (or relativeL_2error) on the test set. This metric quantifies the normalized difference between the predicted solution and the true solution.

5.2.1. Relative $L_2$ Loss

Conceptual Definition: The relativeL_2loss (or error) measures the root mean square error (RMSE) between the predicted function and the true function, normalized by the $L_2$ norm of the true function. It indicates how well the model's prediction matches the true output relative to the overall magnitude of the true output. A lower value indicates better accuracy.
Mathematical Formula: For discrete functions (as used in the numerical computations), the formula is: $L \big( G_\theta(a), u \big) = \frac{ \sqrt{ \sum_{i=1}^n \left( G_\theta(a)(\pmb{x}_i) - u(\pmb{x}_i) \right)^2 } }{ \sqrt{ \sum_{i=1}^n u(\pmb{x}_i)^2 } }$
Symbol Explanation:
- $G_\theta(a)(\pmb{x}_i)$ : The predicted value of the output function by the model $G_\theta$ at the $i$ -th discrete point $\pmb{x}_i$ , given input $a$ .
- $u(\pmb{x}_i)$ : The true value of the output function at the $i$ -th discrete point $\pmb{x}_i$ .
- $n$ : The total number of discrete points in the computational domain.
- $\sum_{i=1}^n (\cdot)^2$ : Sum of squared differences or values over all discrete points.
- $\sqrt{\cdot}$ : Square root, forming the $L_2$ norm (Euclidean norm for vectors).

5.3. Baselines

The paper compares the MscaleFNO primarily against a normal FNO (standard FNO) configuration. To ensure a fair comparison, the normal FNO is designed to have a similar number of network parameters as the MscaleFNO.

5.3.1. General Model Configurations

Optimizer: Adam optimizer with a learning rate of 0.001.
Batch Size: 20 for all training processes.

5.3.2. Specific Model Architectures for 1-D Problems ( $d_a=d_u=d=1$ )

Normal FNO:
- Channel dimension (d_v): 48
- Number of Fourier modes (k_{\mathrm{max}}): 500
- Number of Fourier layers ( $T$ ): 1 (for Examples 4.1 & 4.2), 4 (for Examples 4.3, 4.4 & 4.5).
MscaleFNO:
- Number of parallel sub-networks ( $N$ ): 8
- Each sub-network:
  - Channel dimension (d_v): 16 (note: $N \times d_v = 8 \times 16 = 128$ for MscaleFNO, compared to $d_v = 48$ for Normal FNO, indicating a different total capacity structure, though parameter count is matched.)
  - Number of Fourier modes (k_{\mathrm{max}}): 500
  - Number of Fourier layers ( $T$ ): 1 (for Examples 4.1 & 4.2), 4 (for Examples 4.3, 4.4 & 4.5).
- Initial scaling factors (c_i) and combination weights (\gamma_i) are trainable. Specific initial $c_i$ values are mentioned for each example.

Parameter Count Comparison:

For Examples 4.1 & 4.2: MscaleFNO (1,035,544 parameters) vs. normal FNO (1,164,001 parameters). MscaleFNO has slightly fewer parameters.
For Examples 4.3, 4.4 & 4.5: MscaleFNO (4,127,128 parameters) vs. normal FNO (4,641,169 parameters). MscaleFNO again has fewer parameters. The paper consistently ensures that MscaleFNO uses a similar or even slightly smaller parameter count than the normal FNO to attribute performance gains to the architectural innovation rather than merely increased model capacity.

6. Results & Analysis

This section details the experimental results, comparing the MscaleFNO against the normal FNO across various highly oscillatory function learning tasks and Helmholtz equation problems.

6.1. Core Results Analysis

6.1.1. Example 4.1: Learning a Single-Frequency Nonlinear Mapping

This example focuses on learning the mapping $u(x) = \sin(20a(x))$ . The initial scaling factors for MscaleFNO were set as $c = \{1, 10, 20, 40, 60, 80, 100, 120\}$ .

Error Curves:

该图像是图表，展示了不同模型在训练过程中的相对误差曲线。红色曲线表示MscaleFNO的表现，而蓝色曲线表示普通FNO。可以看到，随着训练轮次的增加，MscaleFNO的相对误差显著降低，表明其在高频问题上的优势。

Figure 4: Error curves of different models during the training process. The MscaleFNO (red curve) quickly converges to a relative testing error of $O(10^{-4})$ within 100 epochs. In stark contrast, the normal FNO (blue curve) stagnates at an error level of $O(1)$ , failing to effectively learn the mapping. This demonstrates a significant, order-of-magnitude improvement by MscaleFNO.
Predicted Solutions (Visual Comparison):

$Figure 5: Predicted solution by the normal FNO (left) and MscaleFNO (right) with zoomed-in inset for $x \\in \[ - 0 . 1 8 , - 0 . 1 2 \]$$
该图像是图表，展示了正常 FNO（左）与 MscaleFNO（右）在预测解的对比。图中包含了 x ext{ 的范围为 } [-0.18, -0.12] 的放大插图，能够更清晰地看出两者在捕捉高频成分方面的差异。

Figure 5: Predicted solution by the normal FNO (left) and MscaleFNO (right) with zoomed-in inset for $x \in [-0.18, -0.12]$ . Visually, the normal FNO produces a smoothed-out approximation, completely failing to capture the high-frequency oscillations present in the true solution. The MscaleFNO, however, accurately reproduces the fine wave patterns, matching the exact solution almost perfectly.
DFT Analysis (Spectral Comparison):

$Figure 6: DFT of predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for modes $\\in \[ 0 , 2 0 \]$$
该图像是图表，展示了正常 FNO（左侧）与 MscaleFNO（右侧）预测解的 DFT。横坐标为模态，纵坐标为幅度，包含了 0 到 20 的放大插图。正常 FNO 的预测结果与精确解相比存在明显偏差，而 MscaleFNO 在高频模态下的表现更加优越。

Figure 6: DFT of predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for modes $\in [0, 20]$ . The DFT analysis quantitatively confirms the visual observations. The spectrum of the MscaleFNO closely matches the high-frequency components of the true solution, preserving the energy at higher modes. The normal FNO's DFT shows a significant decay in the high-frequency region, indicating its inability to learn these components. These results collectively highlight MscaleFNO's superior capability in capturing high-frequency components.

6.1.2. Example 4.2: Learning a Multiple-Frequency Nonlinear Mapping

This example explores the performance with increasingly complex multi-frequency solutions by varying $M$ from 10 to 200.

Error Curves under Varying $M$ :

$Figure 9: Error curves of different models during the training process under different values of $M$ (Epoch $- 9 0 0$ )$
该图像是图表，展示了不同模型在训练过程中相对误差的变化，横轴为训练轮次（epoch），纵轴为相对误差，分为六个子图，分别对应不同的 $M$ 值（10, 20, 40, 80, 100, 200）。通过比较，MscaleFNO（橙色曲线）显示出相较于普通FNO（蓝色曲线）有更低的相对误差。

Figure 9: Error curves of different models during the training process under different values of $M$ (Epoch = 900). For the normal FNO (blue curves), as $M$ increases (representing higher-frequency scattering regimes), the relative testing error significantly increases, reaching approximately 0.2 at $M=200$ . This confirms the spectral bias. In contrast, MscaleFNO (orange curves) consistently outperforms the normal FNO and maintains a high accuracy, with relative errors around $10^{-2}$ even at $M=200$ . The initial scaling factors for MscaleFNO were adjusted for $M=200$ to include higher values $\{1, 40, 80, 100, 120, 140, 180, 200\}$ , demonstrating adaptability to the problem's frequency content.
Predicted Solutions for $M=200$ :

$Figure 10: $M ~ = ~ 2 0 0$ : Predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for $x \\in \[ - 0 . 1 8 , - 0 . 1 2 \]$$
该图像是图表，展示了正常FNO（左）与MscaleFNO（右）在 $M = 200$ 条件下的预测解。图中包含了精确解和各自方法的结果，并在 $x ext{ \text{》} } [-0.18, -0.12]$ 区域进行了放大。

Figure 10: $M = 200$ : Predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for $x \in [-0.18, -0.12]$ . Similar to the single-frequency case, the normal FNO struggles with the highly oscillatory patterns of the $M=200$ solution, producing a blurred output. MscaleFNO, however, accurately captures the intricate fine details and oscillations.
Spectral Contributions of MscaleFNO Subnetworks:

$Figure 11: $M = 2 0 0$ : Spectral contributions of MscaleFNO subnetworks corresponding to differet initial scales$
该图像是图表，展示了不同初始尺度下MscaleFNO子网络的谱贡献。图中包括六个子图，分别标记为(a)到(f)，展示了在不同尺度下（如 $Scale = 1, 80, 100, 120, 180, 200$ ）的模式（modes）对应的幅度（Amplitude）。每个子图显示了模式数与对应幅度的关系，揭示了多尺度输入对高频成分的捕获能力。

Figure 11: $M = 200$ : Spectral contributions of MscaleFNO subnetworks corresponding to different initial scales. This figure provides a crucial insight into MscaleFNO's mechanism. It shows the DFT of the outputs from individual sub-networks. Sub-networks with smaller scaling factors ( $c_i$ ) effectively capture low-frequency patterns, while those with larger $c_i$ values contribute significantly to reconstructing the high-frequency components. This demonstrates the hierarchical frequency decomposition, where each sub-network specializes in a different part of the frequency spectrum, and their combined output reconstructs the full range of frequencies.

6.1.3. Example 4.3: Mapping for Helmholtz Equation ( $L=1$ )

This example applies MscaleFNO to a real-world physics problem, the 1-D Helmholtz equation with $L=1$ .

Error Curves:

该图像是图表，展示了不同模型在训练过程中的相对误差曲线。蓝色线条表示正常FNO，棕色线条表示MscaleFNO，数值结果显示MscaleFNO在高频误差处理上有明显优于正常FNO的表现。

Figure 14: Error curves of different models during the training process. Both models show rapid initial convergence. However, the normal FNO converges to a relative error around $O(10^{-2})$ , with limited further improvement. The MscaleFNO, with initial scales $\{1, 4, 8, 10, 12, 14, 18, 20\}$ , continues to reduce errors throughout training, achieving a relative error of $O(10^{-3})$ . This represents an order-of-magnitude improvement in accuracy for the Helmholtz equation problem.

6.1.4. Example 4.4: Mapping for Helmholtz Equation (Varying $L$ )

This section tests MscaleFNO's robustness as the domain length $L$ increases, which corresponds to increasingly high-frequency scattering problems.

Error Curves under Varying $L$ :

$Figure 16: Error curves under different values of $L$ (Epoch=100)$
该图像是图表，展示了不同 $L$ 值下的相对误差曲线（Epoch=100）。曲线显示了正常 FNO 与 MscaleFNO 在不同训练轮次下的误差变化，让人对其在高频范围内的表现进行比较。

Figure 16: Error curves under different values of $L$ (Epoch=100). As $L$ increases from 2 to 10, the normal FNO (blue curves) exhibits progressively deteriorating performance, with the relative testing error soaring to approximately 0.7 for $L=10$ . This vividly illustrates the normal FNO's struggle with higher frequency regimes. In contrast, MscaleFNO (orange curves) demonstrates robust performance across all $L$ values, consistently outperforming the normal FNO and maintaining a relative error accuracy better than $O(10^{-2})$ .
Predicted Solutions for $L=10$ :

$Figure 17: $L = 1 0$ : Predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for $x \\in \[ - 0 . 2 , 0 . 2 \]$$
该图像是图表，展示了正常 FNO 和 MscaleFNO 的预测解。左侧为正常 FNO 的绘图，右侧为 MscaleFNO，二者均显示了精确解和对应方法的比较，包含一个放大插图，范围为 $x ext{ in } [-0.2, 0.2]$ 。

Figure 17: $L = 10$ : Predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for $x \in [-0.2, 0.2]$ . For $L=10$ , the solution is highly oscillatory. The normal FNO captures only the general wave patterns but misses the high-frequency details. MscaleFNO, however, accurately reproduces the full solution structure, including the fine oscillations.
DFT Analysis for $L=10$ :

$Figure 18: $L = 1 0$ : DFT of predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for modes $\\in \[ 1 0 0 0 , 1 1 0 0 \]$$
该图像是图表，展示了正常FNO和MscaleFNO在预测解的离散傅里叶变换（DFT）中的结果。左侧为正常FNO，右侧为MscaleFNO，图中标出了各模式的幅度，并显示在模式 $L = 1 0$ 的放大细节。

Figure 18: $L = 10$ : DFT of predicted solution by normal FNO (left) and MscaleFNO (right) with zoomed-in inset for modes $\in [1000, 1100]$ . The Fourier analysis corroborates the spatial observations. The normal FNO preserves low-frequency components but shows significant distortion and decay at high frequencies. MscaleFNO accurately reconstructs the entire spectrum, as evidenced by the excellent match with the exact solution's DFT, especially in the zoomed-in high-frequency modes.

6.1.5. Example 4.5: Generalization Capability (Unseen Data Distribution)

This test evaluates how well MscaleFNO performs on test samples generated from a different function form (unseen distribution) than the training data, for $L=10$ .

Normal FNO on Unseen Data:

$Figure 19: $L = 1 0$ : (a) Predicted solution of normal FNO against exact solution with zoomed-in inset for $x \\in \[ - 0 . 2 , 0 . 2 \]$ and (b) The DFT of `u ( x )` with zoomed-in inset for modes $\\in \[ 1 0 0 0 , 1 1 0 0 \]$$
该图像是一个图表，展示了常规傅里叶神经算子的预测解与精确解的比较，左侧 (a) 显示的是 u(x) 的值，横轴为 $x$ ，纵轴为值，包含放大的插图；右侧 (b) 显示 u(x) 的离散傅里叶变换 (DFT)，横轴为模态，纵轴为幅度，也带有放大的插图。图中包含精确解与常规 FNO 解的对比曲线。

Figure 19: $L = 10$ : (a) Predicted solution of normal FNO against exact solution with zoomed-in inset for $x \in [-0.2, 0.2]$ and (b) The DFT of u ( x ) with zoomed-in inset for modes $\in [1000, 1100]$ . The normal FNO fails dramatically on the unseen test functions. Its predictions show significant errors in both solution amplitude and oscillation patterns in the spatial domain (a), and it fails to reconstruct the correct spectral content in the Fourier domain (b), particularly at high frequencies.
MscaleFNO on Unseen Data:

$Figure 20: $L = 1 0$ : (a) Predicted solution of MscaleFNO against exact solution with zoomed-in inset for $x \\in \[ - 0 . 2 , 0 . 2 \]$ and (b) The DFT of `u ( x )` with zoomed-in inset for modes $\\in \[ 1 0 0 0 , 1 1 0 0 \]$$
该图像是图表，展示了MscaleFNO的预测解与精确解的对比。图(a)显示了u(x)的预测值，其中包含 $[-0.2, 0.2]$ 范围的缩放插图；图(b)展示了u(x)的离散傅里叶变换（DFT），并包含了模式范围[1000, 1100]的缩放插图。两部分均对比了精确解和MscaleFNO的结果。

Figure 20: $L = 10$ : (a) Predicted solution of MscaleFNO against exact solution with zoomed-in inset for $x \in [-0.2, -0.2]$ and (b) The DFT of u ( x ) with zoomed-in inset for modes $\in [1000, 1100]$ . In contrast, MscaleFNO demonstrates robust prediction capability even with a change in the test function form. It accurately captures high-frequency oscillations across the interval and precisely predicts both low and high-frequency modes in the Fourier spectrum. This highlights MscaleFNO's excellent generalization ability beyond the training data distribution.

6.2. Data Presentation (Tables)

The paper does not include explicit result tables, but rather presents all comparative data through error curves and visual comparisons of predicted solutions and their Discrete Fourier Transforms (DFTs). The analysis above is derived from interpreting these figures.

6.3. Ablation Studies / Parameter Analysis

The paper implicitly conducts a form of ablation study by comparing MscaleFNO to normal FNO with similar parameter counts, isolating the effect of the multi-scale architecture. The analysis of Figure 11 for MscaleFNO's subnetwork contributions serves as a parameter analysis, showing how different scaling factors (c_i) lead to specialization in different frequency ranges, justifying the multi-scale design. The adjustment of initial scaling factors for different $M$ values in Example 4.2 also shows some empirical parameter tuning for optimal performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully proposes the Multi-scale Fourier Neural Operator (MscaleFNO), a novel architecture designed to mitigate the inherent spectral bias of Fourier Neural Operators (FNOs) when learning mappings between highly oscillatory functions. By employing a parallel ensemble of FNO sub-networks, each operating on scaled inputs of both the spatial variable and the input function, MscaleFNO effectively decomposes the complex learning problem into different frequency scales. Numerical results on various nonlinear oscillatory mappings and, more critically, on the Helmholtz equation in high-frequency regimes, consistently demonstrate that MscaleFNO significantly outperforms the normal FNO. It achieves higher accuracy, captures fine oscillatory details, accurately reconstructs the full Fourier spectrum of solutions, and exhibits robust generalization capabilities to unseen data distributions, all while maintaining a comparable number of model parameters. The MscaleFNO thus presents a substantial advancement for operator learning in scientific computing problems involving high-frequency phenomena.

7.2. Limitations & Future Work

The authors explicitly state that future work will include:

Applying MscaleFNO for higher dimensional Helmholtz equations: The current numerical demonstrations are primarily for 1-D problems. Extending MscaleFNO to 2-D and 3-D Helmholtz equations will be crucial to prove its broader applicability and scalability. This often involves increased computational complexity and data requirements.
Solving inverse medium problems in high-frequency wave scattering: Inverse problems, where the goal is to infer properties of a medium from observed wave measurements, are notoriously challenging, especially in high-frequency regimes due to their ill-posed nature and sensitivity to high-frequency components. Applying MscaleFNO to such problems would be a significant step.

The paper does not explicitly list limitations of the MscaleFNO itself, but the nature of the future work suggests potential challenges with scaling to higher dimensions (e.g., computational cost, data size, complexity of Fourier transforms), and the difficulty of inverse problems.

7.3. Personal Insights & Critique

Innovative Extension: The MscaleFNO is an elegant and effective extension of the MscaleDNN concept to operator learning. The key insight of simultaneously scaling both the spatial coordinate $\pmb{x}$ and the input function $a(\pmb{x})$ for operator learning is powerful. This allows the model to address spectral bias arising from both the spatial structure and the amplitude-dependent oscillations of the operator's response. This is a clear improvement over traditional FNO for the specific problem domain of oscillatory functions.
Computational Cost vs. Parameter Count: While the paper emphasizes that MscaleFNO has a similar (or even slightly smaller) number of parameters compared to normal FNO, it's important to consider the computational cost. Running $N$ parallel FNO sub-networks implies a higher computational cost during both training and inference per epoch/sample, effectively $N$ times the forward/backward pass of a single FNO (though potentially faster due to smaller individual $d_v$ or parallelization). The balance between increased training time/inference latency and accuracy gain is a practical consideration for real-world applications.
Learned Scales and Weights: The decision to make scaling factors (c_i) and combination weights (\gamma_i) trainable is a strong point. It allows the model to adaptively find the optimal frequency decomposition and combination, rather than relying on heuristic choices. However, the sensitivity of the training process to the initialization of these scales, and whether they can sometimes converge to degenerate solutions (e.g., all scales becoming similar), could be an area for further investigation. The sine activation function choice is also a sensible and likely contributing factor to learning oscillations.
Generalization Performance: The results from Example 4.5, demonstrating MscaleFNO's robust generalization to an unseen distribution of test functions, are particularly impressive. This suggests that the multi-scale architecture helps the model learn more fundamental, scale-invariant patterns of the underlying physical system, rather than simply memorizing the training data. This is a crucial attribute for neural operators intended for scientific discovery.
Potential Improvements/Future Directions not mentioned:
- Adaptive Scaling: Instead of a fixed number of $N$ parallel FNOs with initially set scales, one could explore adaptive methods where the number of scales or the scale values are dynamically adjusted during training, perhaps guided by spectral analysis of the residual errors.
- Efficiency for Higher Dimensions: For 2D/3D problems, the computational cost of FFT and the memory footprint can become substantial. Exploring more efficient spectral representations or hierarchical grids within the MscaleFNO framework might be necessary.
- Theoretical Guarantees: While empirical results are strong, theoretical analysis of MscaleFNO's spectral representation capabilities and convergence properties would further solidify its foundation.
  
  Overall, MscaleFNO represents a significant and well-motivated step towards making neural operators more robust and accurate for oscillatory phenomena, opening doors for broader application in scientific machine learning, especially in areas like computational electromagnetics, acoustics, and seismology.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MscaleFNO: Multi-scale Fourier Neural Operator Learning for Oscillatory Function Spaces

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~38 min read · 44,529 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Operator Learning

3.1.2. Neural Networks (DNNs)

3.1.3. Spectral Bias

3.1.4. Fourier Transform and Discrete Fourier Transform (DFT) / Fast Fourier Transform (FFT)

3.1.5. Helmholtz Equation

3.2. Previous Works

3.2.1. Fourier Neural Operator (FNO)

3.2.2. DeepONet

3.2.3. U-Net

3.2.4. Multi-scale Deep Neural Network (MscaleDNN)

3.2.5. Other Spectral Bias Mitigation Techniques

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. General Operator Learning Framework

4.2.2. Fourier Neural Operator (FNO)

4.2.2.1. Truncation Mechanism of RtR_tRt​

4.2.3. FNO Network Parameters

4.2.4. Multi-scale Fourier Neural Operator (MscaleFNO)

4.2.5. Mapping between Conductivity and Solution in Helmholtz Equation

5. Experimental Setup

5.1. Datasets

5.1.1. Example 4.1: Single-frequency nonlinear mapping

5.1.2. Example 4.2: Multiple-frequency nonlinear mapping

5.1.3. Example 4.3: Helmholtz equation (L=1L=1L=1)

5.1.4. Example 4.4: Helmholtz equation (varying LLL)

5.1.5. Example 4.5: Helmholtz equation (generalization test)

5.2. Evaluation Metrics

5.2.1. Relative L2L_2L2​ Loss

5.3. Baselines

5.3.1. General Model Configurations

5.3.2. Specific Model Architectures for 1-D Problems (da=du=d=1d_a=d_u=d=1da​=du​=d=1)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Example 4.1: Learning a Single-Frequency Nonlinear Mapping

6.1.2. Example 4.2: Learning a Multiple-Frequency Nonlinear Mapping

6.1.3. Example 4.3: Mapping for Helmholtz Equation (L=1L=1L=1)

6.1.4. Example 4.4: Mapping for Helmholtz Equation (Varying LLL)

6.1.5. Example 4.5: Generalization Capability (Unseen Data Distribution)

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.2.1. Truncation Mechanism of $R_t$

5.1.3. Example 4.3: Helmholtz equation ( $L=1$ )

5.1.4. Example 4.4: Helmholtz equation (varying $L$ )

5.2.1. Relative $L_2$ Loss

5.3.2. Specific Model Architectures for 1-D Problems ( $d_a=d_u=d=1$ )

6.1.3. Example 4.3: Mapping for Helmholtz Equation ( $L=1$ )

6.1.4. Example 4.4: Mapping for Helmholtz Equation (Varying $L$ )