This work provides a quantitative demonstration of the superior performance of Automatic Differentiation (AD) over Finite Difference (FD) methods in the training process of neural networks for solving Partial Differential Equations (PDEs). While both AD and FD are used to compute derivatives within neural network-based PDE solvers like Physics-Informed Neural Networks (PINNs), a long-standing debate exists regarding their efficacy. This paper specifically addresses the impact of these differentiation approaches on training error and speed, arguing that AD offers significant advantages in this regard.
A key innovation is the introduction of truncated entropy, a novel metric derived from the effective cut-off number of singular values. This metric serves to characterize the training dynamics, with higher truncated entropy correlating with faster training speeds and lower residual loss. The authors demonstrate both theoretically and experimentally that AD consistently exhibits higher truncated entropy compared to FD, thereby predicting its superior training performance.
The paper offers a detailed analytical and empirical investigation based on two neural network architectures: Random Feature Models (RFMs) and two-layer neural networks.
For RFMs, where PDE solving can be cast as a linear least-squares problem (\(Aa=f\)), the analysis focuses on the singular values of the system matrix \(A\). The findings reveal that:
1. Large singular values of the AD-derived matrix (\(A_{AD}\)) and FD-derived matrix (\(A_{FD}\)) are largely similar, consistent with FD being a numerical approximation of AD.
2. Crucially, small singular values of \(A_{FD}\) are consistently larger than those of \(A_{AD}\). This discrepancy is significant because solving \(Aa=f\) often involves computing the pseudo-inverse of \(A\), and very small singular values can introduce substantial computational errors or numerical instability. By effectively having “smaller” small singular values, \(A_{AD}\) allows for a more accurate pseudo-inverse calculation through truncation, leading to lower training error.
For two-layer neural networks, where training involves non-convex optimization via gradient descent, the analysis shifts to the eigenvalues of the kernel matrix G, which governs the gradient descent dynamics. Similar to the RFM case, the authors find that:
1. Large eigenvalues of the AD-derived kernel (\(G_{AD}\)) and FD-derived kernel (\(G_{FD}\)) are comparable.
2. However, the small eigenvalues of \(G_{FD}\) are again larger than those of \(G_{AD}\). This implies that for FD, a greater number of small, less significant eigenvalues are present in the kernel, which slows down the convergence of gradient descent, ultimately resulting in slower training and potentially higher final training error compared to AD.
Through comprehensive experimental validation across various PDEs (1D Poisson, 2D Poisson, Biharmonic, Allen-Cahn) and network structures (RFM, shallow, deep NNs), the paper confirms these theoretical insights, consistently showing AD’s faster convergence and lower training errors.
The work builds upon established foundations in:
* Physics-Informed Neural Networks (PINNs): A prominent framework for using neural networks to solve differential equations.
* Automatic Differentiation (AD): The cornerstone technique for exact gradient computation in deep learning, enabling backpropagation.
* Numerical Differentiation (e.g., Finite Difference): Traditional methods for approximating derivatives, often simpler but less precise than AD.
* Random Feature Models: A simplified neural network architecture that reduces training to a convex optimization problem, facilitating analytical study.
* Spectral Analysis (Singular Value Decomposition, Eigenvalue Analysis): Standard tools from linear algebra for understanding matrix properties and their impact on system solutions and optimization dynamics.
* Optimization Algorithms (Gradient Descent): The fundamental iterative methods used to train neural networks.
* Concepts of Spectral Entropy and Effective Rank: From information theory and linear algebra, providing a basis for the newly defined “truncated entropy.”