The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws

1. Introduction: The Unreasonable Effectiveness of Overparameterization

The theoretical understanding of deep neural networks has undergone a fundamental transformation over the last decade. Historically, statistical learning theory relied on concepts such as the Vapnik-Chervonenkis (VC) dimension and Rademacher complexity to explain the generalization capabilities of machine learning models. These classical frameworks suggested a trade-off between model complexity and generalization: as the number of parameters increases, the model’s capacity to fit the training data improves, but the risk of overfitting—memorizing noise rather than learning signal—escalates correspondingly. This view implies a “sweet spot” of model complexity, beyond which test performance should degrade.1

However, the empirical reality of modern deep learning stands in stark contradiction to this classical wisdom. State-of-the-art neural networks are massively overparameterized, often possessing parameters numbering in the billions or trillions, far exceeding the number of training samples available. Yet, rather than overfitting, these models exhibit a phenomenon known as “double descent”: as model size increases beyond the interpolation threshold (where training error reaches zero), test error frequently continues to decrease, defying the U-shaped curve predicted by traditional bias-variance trade-offs.1

To resolve this paradox, the theoretical community turned to asymptotic analysis, specifically studying the behavior of neural networks as their width—the number of neurons in hidden layers—approaches infinity. This “infinite-width limit” serves a role analogous to the thermodynamic limit in statistical physics: by taking the number of components to infinity, individual fluctuations average out, revealing deterministic macroscopic laws that govern the system’s behavior.3

This line of inquiry led to the discovery of the Neural Tangent Kernel (NTK) in 2018, a mathematical object that describes the training dynamics of infinitely wide networks as a linear evolution in function space.4 The NTK framework provided the first rigorous proof that massive, overparameterized networks can be trained to global optimality. It established a duality between training neural networks with gradient descent and performing kernel ridge regression, suggesting that deep learning could be understood through the lens of kernel methods.4

Despite its mathematical elegance, the NTK regime—often termed “lazy training”—failed to capture the arguably most critical aspect of deep learning: the ability to learn data-dependent features. Empirical evidence demonstrated a significant performance gap between NTK predictors and finite-width networks on complex tasks like ImageNet, suggesting that practical networks operate in a different regime.5 This realization catalyzed the development of alternative scaling theories, most notably the Maximal Update Parameterization ($\mu$P) and the Mean-Field limit, which preserve feature learning and representation alignment even as width approaches infinity.7

This report provides an exhaustive analysis of these regimes. It details the mathematical foundations of the Gaussian Process (GP) limit at initialization, the freezing of the kernel in the NTK regime, and the specific scaling laws required to unlock feature learning. It further explores the application of these theories to modern architectures, including Residual Networks (ResNets) and Transformers, identifying critical scaling adjustments (such as Depth-$\mu$P and attention head scaling) necessary to maintain stability and expressivity in deep, large-scale models.

2. The Gaussian Process Limit at Initialization

Before analyzing the dynamics of training, it is essential to understand the statistical properties of neural networks at initialization. The foundational link between wide neural networks and probabilistic methods was established by Radford Neal in 1994, fundamentally shifting the perspective from geometric to probabilistic analysis.3

2.1 Neal’s Theorem and the Central Limit Theorem

Neal proved that a single-hidden-layer neural network with random weights converges to a Gaussian Process (GP) as the number of hidden units approaches infinity.3 The intuition relies on the Central Limit Theorem (CLT). Consider a neuron in the first hidden layer. Its pre-activation value is a weighted sum of the inputs. If the weights are independent and identically distributed (i.i.d.) with zero mean and finite variance, and the number of inputs is large, the pre-activation distribution approaches a Gaussian.

When we extend this to a network with an infinite number of hidden units, the output of the network becomes a sum of infinitely many i.i.d. variables (the contributions of each hidden neuron). Consequently, the distribution of functions represented by the network at initialization converges to a Gaussian Process. This means that for any finite set of input points $\{x_1, \dots, x_k\}$, the joint distribution of the network outputs $\{f(x_1), \dots, f(x_k)\}$ is a multivariate Gaussian distribution.3

This result implies that, at initialization, a wide network is simply a random field defined entirely by its mean function (usually zero) and its covariance function (kernel). This “Neural Network Gaussian Process” (NNGP) allows researchers to perform exact Bayesian inference with infinite-width networks without ever training a finite model, simply by computing the kernel and applying standard GP regression formulas.10

2.2 Recursive Covariance Propagation

For deep networks, the Gaussian behavior propagates through the layers. This can be formalized recursively. Let $n_l$ be the width of layer $l$. As $n_l \to \infty$ sequentially (or simultaneously), the pre-activations $z^{(l)}$ at each layer behave as Gaussian processes.3

The covariance kernel $\Sigma^{(l)}(x, x’)$ at layer $l$ describes the similarity between the network’s representations of two inputs $x$ and $x’$. For a standard Multilayer Perceptron (MLP) with nonlinearity $\sigma$, the kernel propagates as follows:

  1. Base Case (Input Layer):

    $$\Sigma^{(1)}(x, x’) = \frac{\sigma_w^2}{d_{in}} \langle x, x’ \rangle + \sigma_b^2$$

    Here, $\sigma_w^2$ and $\sigma_b^2$ are the variances of the weights and biases, respectively.
  2. Inductive Step (Hidden Layers):

    $$\Sigma^{(l+1)}(x, x’) = \sigma_w^2 \mathbb{E}_{(u, v) \sim \mathcal{N}(0, \Lambda^{(l)})} [\sigma(u)\sigma(v)] + \sigma_b^2$$

    where $\Lambda^{(l)}$ is the covariance matrix of the pre-activations at the previous layer:

    $$\Lambda^{(l)} = \begin{pmatrix} \Sigma^{(l)}(x, x) & \Sigma^{(l)}(x, x’) \\ \Sigma^{(l)}(x’, x) & \Sigma^{(l)}(x’, x’) \end{pmatrix}$$

This recursive formula reveals that the covariance at layer $l+1$ is determined by the expected product of the activations from layer $l$.3 This propagation allows for the analytical computation of the NNGP kernel for arbitrary depths, effectively solving the “forward pass” of the infinite network in closed form.

2.3 The Phase Transition: Order vs. Chaos

The properties of the NNGP kernel at deep layers depend critically on the choice of initialization variances ($\sigma_w^2, \sigma_b^2$) and the nonlinearity $\sigma$. Research has identified an “Order-to-Chaos” phase transition in deep networks.13

  • Ordered Phase: If the weights are small, the correlations between inputs decay slowly or converge to a fixed point (often 1). The network maps all inputs to similar outputs, resulting in a very smooth, low-frequency bias.
  • Chaotic Phase: If the weights are large, the correlations decay exponentially with depth. Two slightly different inputs $x$ and $x’$ will have nearly orthogonal representations in deep layers. This corresponds to a highly sensitive, “chaotic” function that is difficult to train.
  • Edge of Chaos: The optimal initialization usually lies on the boundary between these phases (e.g., $\sigma_w^2 = 2$ for ReLU networks, known as He initialization). In this regime, the correlation between inputs is preserved through the layers, allowing signals to propagate deeply without vanishing or exploding.14

This analysis of the NNGP provides the static picture. However, deep learning is fundamentally about dynamics—how the network changes to fit the data. This requires the introduction of the Neural Tangent Kernel.

3. The Neural Tangent Kernel (NTK): Linearization of Training

While the GP limit describes the network before training, the Neural Tangent Kernel (NTK) describes how it evolves during training. Introduced by Jacot et al. in 2018, the NTK framework was a breakthrough because it extended the infinite-width analysis to the optimization trajectory of gradient descent.4

3.1 The Geometry of Parameter Space vs. Function Space

Consider a neural network $f(x; \theta)$ parameterized by a vector $\theta \in \mathbb{R}^P$. We train this network to minimize a loss function $\mathcal{L}$ using gradient descent with a learning rate $\eta$:

 

$$\theta_{t+1} = \theta_t – \eta \nabla_\theta \mathcal{L}(\theta_t)$$

In the continuous time limit (gradient flow), the evolution of the parameters is $\frac{d\theta}{dt} = -\nabla_\theta \mathcal{L}$. Crucially, we are often interested not in the parameters themselves, but in the evolution of the network’s output $f(x; \theta)$ in function space. Using the chain rule, the time derivative of the output for an input $x$ is:

 

$$\frac{df(x; \theta_t)}{dt} = \nabla_\theta f(x; \theta_t)^T \frac{d\theta_t}{dt} = – \nabla_\theta f(x; \theta_t)^T \nabla_\theta \mathcal{L}$$

If we consider the Mean Squared Error loss over a dataset, the gradient $\nabla_\theta \mathcal{L}$ is a sum over training points $x_j$:

 

$$\nabla_\theta \mathcal{L} = \sum_{j=1}^N (f(x_j; \theta_t) – y_j) \nabla_\theta f(x_j; \theta_t)$$

Substituting this back, we obtain the evolution equation for the function:

 

$$\frac{df(x; \theta_t)}{dt} = – \sum_{j=1}^N \underbrace{\langle \nabla_\theta f(x; \theta_t), \nabla_\theta f(x_j; \theta_t) \rangle}_{\Theta_t(x, x_j)} (f(x_j; \theta_t) – y_j)$$

The term $\Theta_t(x, x’) = \langle \nabla_\theta f(x; \theta_t), \nabla_\theta f(x’; \theta_t) \rangle$ is the Neural Tangent Kernel at time $t$.4 It is a symmetric, positive semi-definite kernel that governs the dynamics of the network outputs.

3.2 The “Frozen” Kernel Property

The most profound discovery of NTK theory is the behavior of $\Theta_t$ as the network width $n$ tends to infinity.

  1. Deterministic Initialization: By the Law of Large Numbers, the empirical kernel $\Theta_0$ at initialization converges to a deterministic limiting kernel $\Theta_\infty$ that depends only on the architecture and nonlinearity, not on the random draw of weights.4
  2. Stability During Training: Most surprisingly, as $n \to \infty$, the kernel $\Theta_t$ remains constant throughout the training process: $\Theta_t \approx \Theta_0$.

This “frozen kernel” property arises because in a sufficiently wide network, individual weights need to change only by an infinitesimal amount $O(1/\sqrt{n})$ to effect a finite change in the output. Since the weights barely move, the feature map $\nabla_\theta f(x; \theta)$ (which depends on the weights) also remains approximately constant.1

3.3 Equivalence to Kernel Ridge Regression

Because the kernel $\Theta_t$ is constant ($\Theta_t \approx \Theta_\infty$), the training dynamics become linear. The differential equation governing the output becomes:

 

$$\frac{df_t(x)}{dt} = – \eta \sum_{j=1}^N \Theta_\infty(x, x_j) (f_t(x_j) – y_j)$$

This is a linear ordinary differential equation (ODE) that can be solved analytically. The solution path $f_t(x)$ is identical to the path taken by gradient descent on a linear model with fixed features $\phi(x) = \nabla_\theta f(x; \theta_0)$.

Furthermore, if the loss function is convex (like MSE), the network converges to a unique global minimum. The final function learned by the infinite-width network is exactly the kernel ridge regression solution using the NTK:

 

$$f_{\infty}(x) = \Theta_\infty(x, X_{train}) (\Theta_\infty(X_{train}, X_{train}))^{-1} Y_{train}$$

 

(assuming zero initialization for simplicity).4

This equivalence provided a rigorous explanation for the training stability of overparameterized networks. It showed that “training” a massive network in this regime is effectively just projecting the target function onto the tangent space defined by the random initialization.

3.4 Computing the NTK

Like the NNGP, the NTK can be computed recursively for deep networks. For fully connected networks, the recursion involves both the covariance of the activations ($\Sigma^{(l)}$) and the derivative of the nonlinearity ($\dot{\sigma}$).3

Let $\dot{\Sigma}^{(l)}(x, x’) = \sigma_w^2 \mathbb{E}[\dot{\sigma}(u)\dot{\sigma}(v)]$. The NTK recursion is:

 

$$\Theta^{(l+1)}(x, x’) = \Theta^{(l)}(x, x’) \dot{\Sigma}^{(l+1)}(x, x’) + \Sigma^{(l+1)}(x, x’)$$

The first term represents the backpropagation of gradients through the weights, while the second term captures the contribution of the weights in the current layer. This recursive structure allows efficient computation of the exact infinite-width limit for various architectures, including CNNs (CNTK) and Graph Neural Networks.6

4. The Lazy vs. Rich Dichotomy: When Theory Meets Practice

While the NTK framework provides a powerful theoretical tool, it introduced a significant dissonance. Empirical studies quickly revealed that the NTK approximation does not fully capture the performance of state-of-the-art neural networks. Real networks, especially those trained with standard parameterizations and hyperparameters, often outperform their corresponding NTK kernels, sometimes by large margins.5

This discrepancy led to the categorization of training regimes into two distinct phases: Lazy Training (the NTK regime) and Rich Training (the Feature Learning regime).18

4.1 Characteristics of the Lazy (NTK) Regime

In the lazy regime, the network behaves as a linearized model.

  • Mechanism: The weights $\theta$ stay very close to their initialization $\theta_0$. The change $\|\theta – \theta_0\|$ is small enough that the curvature of the loss landscape is negligible.
  • Feature Evolution: The internal representations (hidden layer features) are static. The network does not learn “new” features; it reweights the random features provided at initialization.
  • Inductive Bias: The generalization properties are determined entirely by the spectral properties of the initial kernel $\Theta_0$. If the target function aligns well with the eigenfunctions of the NTK (usually low-frequency components), learning is fast and generalization is good. If not, the model fails to learn efficiently.21
  • Scaling: This regime is reached when the scaling of the initialization variance is large or the scaling of the network output is small (e.g., dividing by $n$). This suppresses the gradient signal relative to the initialization, locking the features in place.

4.2 Characteristics of the Rich (Feature Learning) Regime

In the rich regime, the network significantly departs from its initialization, and the internal representations evolve to adapt to the data structure.

  • Mechanism: The weights move significantly, traversing the non-linear manifold of the loss landscape. The Taylor expansion around $\theta_0$ breaks down.
  • Feature Evolution: The hidden layers actively learn features. For example, in a CNN, early layers evolve from random noise to Gabor-like edge detectors. In a transformer, attention heads specialize to track syntactic dependencies. This adaptation is crucial for transfer learning and generalization on complex tasks.5
  • Mean-Field Limit: This regime is mathematically described by the Mean-Field limit (distinct from the NTK limit). Here, the network is viewed as a distribution of particles (neurons). As width $n \to \infty$, the training dynamics are described by a Wasserstein gradient flow on the probability distribution of the weights (the Vlasov equation).24 Unlike the NTK, the kernel in the Mean-Field limit is data-dependent and evolves over time.

4.3 The Role of Initialization Scale ($\alpha$)

The transition between Lazy and Rich regimes can be controlled by a scaling parameter $\alpha$ that governs the magnitude of the model’s output at initialization.20

Consider a network parameterized as $f(x) = \alpha \sum_{i=1}^n v_i \sigma(w_i^T x)$.

  • Large $\alpha$ (Lazy): If $\alpha$ is large (e.g., $O(1)$ while weights are small), the output is sensitive to small changes in $v_i$. The model can fit the data with infinitesimal weight updates, preserving the linearization.
  • Small $\alpha$ (Rich): If $\alpha$ is small (e.g., $O(1/n)$), the initial output is small. The gradients are such that the weights must grow and align with the signal to reduce the error. This forces the weights to travel far from initialization, inducing feature learning.19

Chizat et al. demonstrated that by simply varying this scale, one can interpolate between the kernel regime and the feature learning regime, showing that they are two ends of a continuum.20

5. Architectural Limits and Scaling Laws

The general theory of infinite-width networks applies to all architectures, but the specific mechanics of convergence and the quality of the limit depend heavily on the architecture type (ResNet vs. Transformer vs. CNN) and the specific parameterization used.

5.1 Deep Residual Networks (ResNets)

ResNets introduce skip connections ($x_{l+1} = x_l + f_l(x_l)$), which fundamentally alter signal propagation.

  • Signal Explosion: In a standard ResNet, if the residual branch $f_l$ has variance $O(1)$, the variance of the signal $x_L$ at the output grows linearly with depth $L$. For extremely deep networks ($L \to \infty$), this leads to exploding forward signals and gradients.
  • Depth-$\mu$P: To enable feature learning in infinitely deep ResNets, the residual branch must be scaled down. The Depth-$\mu$P scaling suggests multiplying the residual branch by $1/\sqrt{L}$.25 This ensures that the total variance at the output remains $O(1)$ regardless of depth.
  • Impact on Features: Snippet 25 notes that Depth-$\mu$P maximizes “feature diversity.” Without this scaling, deep networks either fall into the lazy regime (if initialization is large) or suffer from rank collapse (if initialization is too small). With $1/\sqrt{L}$ scaling, the contribution of each layer is balanced, allowing the network to learn complex hierarchical features.

5.2 Transformers and Attention Collapse

Transformers pose a unique challenge because they possess two distinct dimensions that can be taken to infinity: the embedding width $N$ (or $d_{model}$) and the number of attention heads $H$.

  • Infinite Embedding Dimension ($N \to \infty$):
  • The Scaling Issue: Standard attention uses a scaling factor of $1/\sqrt{N}$ in the softmax: $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{N}})V$.
  • Attention Collapse: Recent research by Bordelon et al. 26 and Yang 28 reveals that in the strict $N \to \infty$ limit with standard parameterization, the attention mechanism degenerates. Specifically, all attention heads collapse to the same dynamics, effectively behaving as a single-head model. This occurs because the fluctuations that distinguish heads are suppressed by the Law of Large Numbers.
  • The Fix: To preserve head diversity and feature learning in the $N \to \infty$ limit, the scaling must be adjusted. Some theories suggest a $1/N$ scaling (the $\mu$P scaling for attention) or specific reparameterizations of the query/key weights to maintain $O(1)$ updates that are distinct across heads.26
  • Infinite Heads ($H \to \infty$):
  • Hron et al. 26 investigated the limit where the number of heads $H \to \infty$ while keeping $N$ fixed. In this regime, the output of the multi-head attention block converges to a Gaussian Process. This limit is often more stable and easier to analyze but may not capture the “feature selection” capability of finite-head transformers where specific heads attend to specific syntactic structures.27

5.3 Convolutional Networks (CNNs) and the CNTK

The NTK for convolutional networks (CNTK) incorporates the weight sharing and locality of convolutions.

  • Performance: The CNTK performs significantly better than the MLP NTK on image tasks because it encodes translation invariance. On CIFAR-10, CNTK (with global average pooling and data augmentation) can achieve ~89% accuracy.6
  • The Gap: Despite this high performance, it still lags behind standard ResNets (which reach >96% on CIFAR-10). The gap is even wider on ImageNet. This confirms that while the inductive bias of convolution is powerful, the adaptive features learned by deep CNNs (hierarchical composition of parts) are not fully captured by the static CNTK kernel.30

6. Maximal Update Parameterization ($\mu$P): The “Grand Unified Theory” of Scaling

The limitations of the NTK/Lazy regime and the instability of standard parameterization (SP) in large models led to the development of the Tensor Programs framework and Maximal Update Parameterization ($\mu$P) by Greg Yang and collaborators.7

6.1 The Problem with Standard Parameterization (SP)

In standard parameterization (e.g., PyTorch default), weights are typically initialized with variance $\sigma^2 \propto 1/n$ to ensure that activations have $O(1)$ variance. However, the learning rate is usually treated as a scalar independent of width.

  • Vanishing Updates: As width $n \to \infty$, the gradient updates to the hidden layers in SP scale as $O(1/n)$ or $O(1/\sqrt{n})$. This means that for very wide networks, the weights effectively stop moving. The network enters the Lazy regime by default, or worse, different layers learn at vastly different speeds, leading to instability.

6.2 The $\mu$P Solution: Ensuring Maximal Feature Learning

$\mu$P is a principled set of scaling rules derived to ensure that every layer learns features maximally. This means that as width $n \to \infty$, the change in the pre-activations $\Delta z$ due to a gradient update is $O(1)$—neither vanishing (lazy) nor exploding (unstable).7

The derivation relies on “Tensor Programs,” a formal language for tracking the scaling exponents of every computation in the network graph. By balancing the exponents, $\mu$P prescribes specific scalings for initialization and learning rates.

Table 1: Scaling Rules for Standard (SP), NTK, and $\mu$P Parameterizations ($n = \text{width}$) 7

Parameter Type Standard Param (SP) NTK Param μP (Maximal Update)
Input Weights ($W_{in}$) Init Var $\propto 1$ Init Var $\propto 1$ Init Var $\propto 1$
LR $\propto 1$ LR $\propto 1$ LR $\propto n$
Hidden Weights ($W_{hid}$) Init Var $\propto 1/n$ Init Var $\propto 1$ Init Var $\propto 1/n$
LR $\propto 1$ LR $\propto 1$ LR $\propto 1/n$
Output Weights ($W_{out}$) Init Var $\propto 1/n$ Init Var $\propto 1$ Init Var $\propto 1/n^2$
LR $\propto 1$ LR $\propto 1$ LR $\propto 1/n$

Note: In the table above, the scaling factors ensure the forward pass is $O(1)$. The Learning Rate (LR) scalings are the critical differentiators. In $\mu$P, the input layer gets a massive learning rate boost ($O(n)$) to drive feature learning from the raw data, while the output layer is scaled down to prevent logits from exploding.

6.3 Practical Application: $\mu$Transfer

The most significant practical application of this theory is $\mu$Transfer.32

  • The Principle: Because $\mu$P ensures that training dynamics (loss curves, update magnitudes) converge to a stable limit as $n \to \infty$, the optimal hyperparameters for a maximal-width model are approximately the same as those for a small-width model.
  • The Workflow:
  1. Parametrize the target architecture (e.g., GPT-3) using $\mu$P.
  2. Train a small “proxy” model (e.g., width 128) and tune hyperparameters (learning rate, schedule, initialization) aggressively.
  3. Copy the optimal hyperparameters directly to the multi-billion parameter model.
  • Results: Empirical results show that this method yields hyperparameters that are near-optimal for the large model, bypassing the need for expensive tuning at scale. This technique was reportedly used to tune models like GPT-3 and huge ResNets with a fraction of the compute cost.34

7. Empirical Validation and Performance Gaps

The theoretical distinction between Lazy and Rich regimes is validated by robust empirical data demonstrating performance gaps across various tasks.

7.1 CIFAR-10 and ImageNet Benchmarks

Comparison of accuracy on standard vision benchmarks highlights the limitation of pure kernel methods.

Table 2: Test Accuracy Comparison (Approximate) 2

Model / Method CIFAR-10 Accuracy ImageNet Top-1 Accuracy
ResNet-50 (Standard) ~96% ~76%
Convolutional NTK (CNTK) ~89% (w/ Augmentation) ~60-65% (Estimated)
Simple NTK (MLP) ~60% < 40%
$\mu$P ResNet ~96% ~76%

The gap on ImageNet (~10-15%) is critical. It underscores that while CNTK captures translation invariance, it fails to capture the deep hierarchical abstractions (feature learning) required for high-resolution, complex object recognition. The CNTK performs well on “small data” (e.g., CIFAR-10 with few samples), but fails to scale with dataset size as effectively as feature-learning networks.22

7.2 Word Embeddings and Semantic Structure

A striking demonstration of the difference is found in Word2Vec.

  • NTK: Embeddings are essentially fixed random vectors. They possess no semantic structure; the vector arithmetic King – Man + Woman results in a random vector, not Queen.5
  • $\mu$P / Feature Learning: The infinite-width limit of $\mu$P learns embeddings that exhibit correct semantic clustering and algebraic relationships, matching the behavior of finite-width networks trained with SGD.8

8. Finite-Width Corrections and the Neural Tangent Hierarchy

While infinite-width theories are powerful, real networks are finite. How much does “finiteness” matter?

8.1 The $1/n$ Expansion

Researchers have developed perturbation theories to calculate corrections to the infinite-width limit, typically as a power series in $1/n$ (where $n$ is width).35

  • First-Order Correction ($1/n$): Introduces the first interactions between features. The kernel $\Theta_t$ begins to evolve.
  • Neural Tangent Hierarchy: The evolution of the kernel is coupled to 3-point and 4-point correlation functions (higher-order tensors). This hierarchy of coupled differential equations describes how the network departs from Gaussianity.

8.2 Depth vs. Width: The Aspect Ratio

A critical finding is that the validity of the NTK approximation depends on the ratio of depth $L$ to width $n$.

  • If $L/n \to 0$ (Wide and Shallow): The NTK approximation is excellent.
  • If $L/n \sim O(1)$ (Deep and Narrow): The NTK becomes stochastic. The variance of the kernel scales exponentially with $L/n$. In this regime, the “frozen kernel” assumption breaks down completely, and the training dynamics are far more complex and chaotic.35 This suggests that for very deep networks (like modern Transformers), the infinite-width approximation is only valid if the width is scaled significantly faster than the depth.

9. Practical Guide: When to Use Which Theory?

For deep learning practitioners and researchers, understanding these regimes is not just theoretical—it informs model design and debugging.

9.1 When to use NTK / Kernel Methods?

  • Small Data: When training data is scarce (e.g., < 10,000 samples), NTK and CNTK often outperform deep networks because they are less prone to overfitting and provide strong regularization.2
  • Uncertainty Quantification: The NNGP limit allows for exact Bayesian inference, providing calibrated uncertainty estimates that are often better than ensemble methods for small datasets.10
  • Architecture Search: Computing the NTK condition number or score at initialization can be a cheap proxy for predicting the trainability of a neural architecture without running full training (Neural Architecture Search).38

9.2 When to use Feature Learning / $\mu$P?

  • Large Data / Foundation Models: For ImageNet, LLMs, and large-scale pretraining, feature learning is non-negotiable. The Lazy regime will underperform significantly.
  • Transfer Learning: If the goal is to pretrain on one dataset and finetune on another, you must be in the Rich/$\mu$P regime. NTK features are fixed and cannot adapt to the new task.5
  • Hyperparameter Tuning: Use $\mu$P scaling rules to transfer hyperparameters from small proxy models to large production models, saving massive compute resources.32

10. Conclusion

The study of infinite-width limits has evolved from a tool for proving convergence (NTK) to a framework for engineering better large-scale models ($\mu$P). The initial discovery of the Neural Tangent Kernel provided a comforting, albeit simplified, view of deep learning as kernel regression. However, the field has arguably moved “beyond the NTK,” recognizing that the deviations from this limit—the richness of feature learning, the adaptation of the kernel, and the interactions in deep layers—are where the true power of artificial intelligence resides.

The “Lazy vs. Rich” dichotomy serves as a compass for modern theory. We now understand that “infinity” is not a single destination; the path taken there matters. By choosing the correct parameterization ($\mu$P), we can ensure that even as networks grow to trillion-parameter scales, they retain the dynamic, adaptive capabilities that define intelligent learning, rather than degenerating into static kernel machines. The future of deep learning theory lies in mastering these scaling laws to make the training of foundation models more predictable, stable, and efficient.

Data Sources

3 Neal (1996) – Priors for Infinite Networks.

4 Jacot et al. (2018) – Neural Tangent Kernel.

1 Double Descent & Generalization.

18 Lazy vs Rich Regimes (Chizat et al.).

5 Feature Learning Limits (Yang & Hu).

7 Maximal Update Parameterization ($\mu$P).

8 Tensor Programs IV.

7 $\mu$P Scaling Rules.

26 Infinite Transformer Limits (Hron et al.).

25 Depth-$\mu$P (ResNet Scaling).

32 $\mu$Transfer.

26 Attention Collapse (Bordelon et al.).

2 NTK vs Finite Net Empirical Gaps.

6 Convolutional NTK Performance.

27 Infinite Head Limits.