{"id":9096,"date":"2025-12-26T10:47:34","date_gmt":"2025-12-26T10:47:34","guid":{"rendered":"https:\/\/uplatz.com\/blog\/?p=9096"},"modified":"2025-12-26T10:47:34","modified_gmt":"2025-12-26T10:47:34","slug":"the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws","status":"publish","type":"post","link":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/","title":{"rendered":"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws"},"content":{"rendered":"<h2><b>1. Introduction: The Unreasonable Effectiveness of Overparameterization<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The theoretical understanding of deep neural networks has undergone a fundamental transformation over the last decade. Historically, statistical learning theory relied on concepts such as the Vapnik-Chervonenkis (VC) dimension and Rademacher complexity to explain the generalization capabilities of machine learning models. These classical frameworks suggested a trade-off between model complexity and generalization: as the number of parameters increases, the model&#8217;s capacity to fit the training data improves, but the risk of overfitting\u2014memorizing noise rather than learning signal\u2014escalates correspondingly. This view implies a &#8220;sweet spot&#8221; of model complexity, beyond which test performance should degrade.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, the empirical reality of modern deep learning stands in stark contradiction to this classical wisdom. State-of-the-art neural networks are massively overparameterized, often possessing parameters numbering in the billions or trillions, far exceeding the number of training samples available. Yet, rather than overfitting, these models exhibit a phenomenon known as &#8220;double descent&#8221;: as model size increases beyond the interpolation threshold (where training error reaches zero), test error frequently continues to decrease, defying the U-shaped curve predicted by traditional bias-variance trade-offs.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To resolve this paradox, the theoretical community turned to asymptotic analysis, specifically studying the behavior of neural networks as their width\u2014the number of neurons in hidden layers\u2014approaches infinity. This &#8220;infinite-width limit&#8221; serves a role analogous to the thermodynamic limit in statistical physics: by taking the number of components to infinity, individual fluctuations average out, revealing deterministic macroscopic laws that govern the system&#8217;s behavior.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This line of inquiry led to the discovery of the Neural Tangent Kernel (NTK) in 2018, a mathematical object that describes the training dynamics of infinitely wide networks as a linear evolution in function space.<\/span><span style=\"font-weight: 400;\">4<\/span><span style=\"font-weight: 400;\"> The NTK framework provided the first rigorous proof that massive, overparameterized networks can be trained to global optimality. It established a duality between training neural networks with gradient descent and performing kernel ridge regression, suggesting that deep learning could be understood through the lens of kernel methods.<\/span><span style=\"font-weight: 400;\">4<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Despite its mathematical elegance, the NTK regime\u2014often termed &#8220;lazy training&#8221;\u2014failed to capture the arguably most critical aspect of deep learning: the ability to learn data-dependent features. Empirical evidence demonstrated a significant performance gap between NTK predictors and finite-width networks on complex tasks like ImageNet, suggesting that practical networks operate in a different regime.<\/span><span style=\"font-weight: 400;\">5<\/span><span style=\"font-weight: 400;\"> This realization catalyzed the development of alternative scaling theories, most notably the Maximal Update Parameterization ($\\mu$P) and the Mean-Field limit, which preserve feature learning and representation alignment even as width approaches infinity.<\/span><span style=\"font-weight: 400;\">7<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This report provides an exhaustive analysis of these regimes. It details the mathematical foundations of the Gaussian Process (GP) limit at initialization, the freezing of the kernel in the NTK regime, and the specific scaling laws required to unlock feature learning. It further explores the application of these theories to modern architectures, including Residual Networks (ResNets) and Transformers, identifying critical scaling adjustments (such as Depth-$\\mu$P and attention head scaling) necessary to maintain stability and expressivity in deep, large-scale models.<\/span><\/p>\n<h2><b>2. The Gaussian Process Limit at Initialization<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Before analyzing the dynamics of training, it is essential to understand the statistical properties of neural networks at initialization. The foundational link between wide neural networks and probabilistic methods was established by Radford Neal in 1994, fundamentally shifting the perspective from geometric to probabilistic analysis.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<h3><b>2.1 Neal\u2019s Theorem and the Central Limit Theorem<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Neal proved that a single-hidden-layer neural network with random weights converges to a Gaussian Process (GP) as the number of hidden units approaches infinity.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> The intuition relies on the Central Limit Theorem (CLT). Consider a neuron in the first hidden layer. Its pre-activation value is a weighted sum of the inputs. If the weights are independent and identically distributed (i.i.d.) with zero mean and finite variance, and the number of inputs is large, the pre-activation distribution approaches a Gaussian.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When we extend this to a network with an infinite number of hidden units, the output of the network becomes a sum of infinitely many i.i.d. variables (the contributions of each hidden neuron). Consequently, the distribution of functions represented by the network at initialization converges to a Gaussian Process. This means that for any finite set of input points $\\{x_1, \\dots, x_k\\}$, the joint distribution of the network outputs $\\{f(x_1), \\dots, f(x_k)\\}$ is a multivariate Gaussian distribution.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This result implies that, at initialization, a wide network is simply a random field defined entirely by its mean function (usually zero) and its covariance function (kernel). This &#8220;Neural Network Gaussian Process&#8221; (NNGP) allows researchers to perform exact Bayesian inference with infinite-width networks without ever training a finite model, simply by computing the kernel and applying standard GP regression formulas.<\/span><span style=\"font-weight: 400;\">10<\/span><\/p>\n<h3><b>2.2 Recursive Covariance Propagation<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">For deep networks, the Gaussian behavior propagates through the layers. This can be formalized recursively. Let $n_l$ be the width of layer $l$. As $n_l \\to \\infty$ sequentially (or simultaneously), the pre-activations $z^{(l)}$ at each layer behave as Gaussian processes.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The covariance kernel $\\Sigma^{(l)}(x, x&#8217;)$ at layer $l$ describes the similarity between the network&#8217;s representations of two inputs $x$ and $x&#8217;$. For a standard Multilayer Perceptron (MLP) with nonlinearity $\\sigma$, the kernel propagates as follows:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Base Case (Input Layer):<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">$$\\Sigma^{(1)}(x, x&#8217;) = \\frac{\\sigma_w^2}{d_{in}} \\langle x, x&#8217; \\rangle + \\sigma_b^2$$<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">Here, $\\sigma_w^2$ and $\\sigma_b^2$ are the variances of the weights and biases, respectively.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Inductive Step (Hidden Layers):<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">$$\\Sigma^{(l+1)}(x, x&#8217;) = \\sigma_w^2 \\mathbb{E}_{(u, v) \\sim \\mathcal{N}(0, \\Lambda^{(l)})} [\\sigma(u)\\sigma(v)] + \\sigma_b^2$$<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">where $\\Lambda^{(l)}$ is the covariance matrix of the pre-activations at the previous layer:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">$$\\Lambda^{(l)} = \\begin{pmatrix} \\Sigma^{(l)}(x, x) &amp; \\Sigma^{(l)}(x, x&#8217;) \\\\ \\Sigma^{(l)}(x&#8217;, x) &amp; \\Sigma^{(l)}(x&#8217;, x&#8217;) \\end{pmatrix}$$<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">This recursive formula reveals that the covariance at layer $l+1$ is determined by the expected product of the activations from layer $l$.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> This propagation allows for the analytical computation of the NNGP kernel for arbitrary depths, effectively solving the &#8220;forward pass&#8221; of the infinite network in closed form.<\/span><\/p>\n<h3><b>2.3 The Phase Transition: Order vs. Chaos<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The properties of the NNGP kernel at deep layers depend critically on the choice of initialization variances ($\\sigma_w^2, \\sigma_b^2$) and the nonlinearity $\\sigma$. Research has identified an &#8220;Order-to-Chaos&#8221; phase transition in deep networks.<\/span><span style=\"font-weight: 400;\">13<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Ordered Phase:<\/b><span style=\"font-weight: 400;\"> If the weights are small, the correlations between inputs decay slowly or converge to a fixed point (often 1). The network maps all inputs to similar outputs, resulting in a very smooth, low-frequency bias.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Chaotic Phase:<\/b><span style=\"font-weight: 400;\"> If the weights are large, the correlations decay exponentially with depth. Two slightly different inputs $x$ and $x&#8217;$ will have nearly orthogonal representations in deep layers. This corresponds to a highly sensitive, &#8220;chaotic&#8221; function that is difficult to train.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Edge of Chaos:<\/b><span style=\"font-weight: 400;\"> The optimal initialization usually lies on the boundary between these phases (e.g., $\\sigma_w^2 = 2$ for ReLU networks, known as He initialization). In this regime, the correlation between inputs is preserved through the layers, allowing signals to propagate deeply without vanishing or exploding.<\/span><span style=\"font-weight: 400;\">14<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">This analysis of the NNGP provides the static picture. However, deep learning is fundamentally about <\/span><i><span style=\"font-weight: 400;\">dynamics<\/span><\/i><span style=\"font-weight: 400;\">\u2014how the network changes to fit the data. This requires the introduction of the Neural Tangent Kernel.<\/span><\/p>\n<h2><b>3. The Neural Tangent Kernel (NTK): Linearization of Training<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">While the GP limit describes the network <\/span><i><span style=\"font-weight: 400;\">before<\/span><\/i><span style=\"font-weight: 400;\"> training, the Neural Tangent Kernel (NTK) describes how it evolves <\/span><i><span style=\"font-weight: 400;\">during<\/span><\/i><span style=\"font-weight: 400;\"> training. Introduced by Jacot et al. in 2018, the NTK framework was a breakthrough because it extended the infinite-width analysis to the optimization trajectory of gradient descent.<\/span><span style=\"font-weight: 400;\">4<\/span><\/p>\n<h3><b>3.1 The Geometry of Parameter Space vs. Function Space<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Consider a neural network $f(x; \\theta)$ parameterized by a vector $\\theta \\in \\mathbb{R}^P$. We train this network to minimize a loss function $\\mathcal{L}$ using gradient descent with a learning rate $\\eta$:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$\\theta_{t+1} = \\theta_t &#8211; \\eta \\nabla_\\theta \\mathcal{L}(\\theta_t)$$<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In the continuous time limit (gradient flow), the evolution of the parameters is $\\frac{d\\theta}{dt} = -\\nabla_\\theta \\mathcal{L}$. Crucially, we are often interested not in the parameters themselves, but in the evolution of the network&#8217;s output $f(x; \\theta)$ in function space. Using the chain rule, the time derivative of the output for an input $x$ is:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$\\frac{df(x; \\theta_t)}{dt} = \\nabla_\\theta f(x; \\theta_t)^T \\frac{d\\theta_t}{dt} = &#8211; \\nabla_\\theta f(x; \\theta_t)^T \\nabla_\\theta \\mathcal{L}$$<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If we consider the Mean Squared Error loss over a dataset, the gradient $\\nabla_\\theta \\mathcal{L}$ is a sum over training points $x_j$:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$\\nabla_\\theta \\mathcal{L} = \\sum_{j=1}^N (f(x_j; \\theta_t) &#8211; y_j) \\nabla_\\theta f(x_j; \\theta_t)$$<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Substituting this back, we obtain the evolution equation for the function:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$\\frac{df(x; \\theta_t)}{dt} = &#8211; \\sum_{j=1}^N \\underbrace{\\langle \\nabla_\\theta f(x; \\theta_t), \\nabla_\\theta f(x_j; \\theta_t) \\rangle}_{\\Theta_t(x, x_j)} (f(x_j; \\theta_t) &#8211; y_j)$$<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The term $\\Theta_t(x, x&#8217;) = \\langle \\nabla_\\theta f(x; \\theta_t), \\nabla_\\theta f(x&#8217;; \\theta_t) \\rangle$ is the <\/span><b>Neural Tangent Kernel<\/b><span style=\"font-weight: 400;\"> at time $t$.<\/span><span style=\"font-weight: 400;\">4<\/span><span style=\"font-weight: 400;\"> It is a symmetric, positive semi-definite kernel that governs the dynamics of the network outputs.<\/span><\/p>\n<h3><b>3.2 The &#8220;Frozen&#8221; Kernel Property<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The most profound discovery of NTK theory is the behavior of $\\Theta_t$ as the network width $n$ tends to infinity.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Deterministic Initialization:<\/b><span style=\"font-weight: 400;\"> By the Law of Large Numbers, the empirical kernel $\\Theta_0$ at initialization converges to a deterministic limiting kernel $\\Theta_\\infty$ that depends only on the architecture and nonlinearity, not on the random draw of weights.<\/span><span style=\"font-weight: 400;\">4<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Stability During Training:<\/b><span style=\"font-weight: 400;\"> Most surprisingly, as $n \\to \\infty$, the kernel $\\Theta_t$ remains <\/span><i><span style=\"font-weight: 400;\">constant<\/span><\/i><span style=\"font-weight: 400;\"> throughout the training process: $\\Theta_t \\approx \\Theta_0$.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">This &#8220;frozen kernel&#8221; property arises because in a sufficiently wide network, individual weights need to change only by an infinitesimal amount $O(1\/\\sqrt{n})$ to effect a finite change in the output. Since the weights barely move, the feature map $\\nabla_\\theta f(x; \\theta)$ (which depends on the weights) also remains approximately constant.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<h3><b>3.3 Equivalence to Kernel Ridge Regression<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Because the kernel $\\Theta_t$ is constant ($\\Theta_t \\approx \\Theta_\\infty$), the training dynamics become linear. The differential equation governing the output becomes:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$\\frac{df_t(x)}{dt} = &#8211; \\eta \\sum_{j=1}^N \\Theta_\\infty(x, x_j) (f_t(x_j) &#8211; y_j)$$<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is a linear ordinary differential equation (ODE) that can be solved analytically. The solution path $f_t(x)$ is identical to the path taken by gradient descent on a linear model with fixed features $\\phi(x) = \\nabla_\\theta f(x; \\theta_0)$.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Furthermore, if the loss function is convex (like MSE), the network converges to a unique global minimum. The final function learned by the infinite-width network is exactly the kernel ridge regression solution using the NTK:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$f_{\\infty}(x) = \\Theta_\\infty(x, X_{train}) (\\Theta_\\infty(X_{train}, X_{train}))^{-1} Y_{train}$$<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">(assuming zero initialization for simplicity).4<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This equivalence provided a rigorous explanation for the training stability of overparameterized networks. It showed that &#8220;training&#8221; a massive network in this regime is effectively just projecting the target function onto the tangent space defined by the random initialization.<\/span><\/p>\n<h3><b>3.4 Computing the NTK<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Like the NNGP, the NTK can be computed recursively for deep networks. For fully connected networks, the recursion involves both the covariance of the activations ($\\Sigma^{(l)}$) and the derivative of the nonlinearity ($\\dot{\\sigma}$).<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let $\\dot{\\Sigma}^{(l)}(x, x&#8217;) = \\sigma_w^2 \\mathbb{E}[\\dot{\\sigma}(u)\\dot{\\sigma}(v)]$. The NTK recursion is:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$\\Theta^{(l+1)}(x, x&#8217;) = \\Theta^{(l)}(x, x&#8217;) \\dot{\\Sigma}^{(l+1)}(x, x&#8217;) + \\Sigma^{(l+1)}(x, x&#8217;)$$<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The first term represents the backpropagation of gradients through the weights, while the second term captures the contribution of the weights in the current layer. This recursive structure allows efficient computation of the exact infinite-width limit for various architectures, including CNNs (CNTK) and Graph Neural Networks.<\/span><span style=\"font-weight: 400;\">6<\/span><\/p>\n<h2><b>4. The Lazy vs. Rich Dichotomy: When Theory Meets Practice<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">While the NTK framework provides a powerful theoretical tool, it introduced a significant dissonance. Empirical studies quickly revealed that the NTK approximation does not fully capture the performance of state-of-the-art neural networks. Real networks, especially those trained with standard parameterizations and hyperparameters, often outperform their corresponding NTK kernels, sometimes by large margins.<\/span><span style=\"font-weight: 400;\">5<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This discrepancy led to the categorization of training regimes into two distinct phases: <\/span><b>Lazy Training<\/b><span style=\"font-weight: 400;\"> (the NTK regime) and <\/span><b>Rich Training<\/b><span style=\"font-weight: 400;\"> (the Feature Learning regime).<\/span><span style=\"font-weight: 400;\">18<\/span><\/p>\n<h3><b>4.1 Characteristics of the Lazy (NTK) Regime<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">In the lazy regime, the network behaves as a linearized model.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Mechanism:<\/b><span style=\"font-weight: 400;\"> The weights $\\theta$ stay very close to their initialization $\\theta_0$. The change $\\|\\theta &#8211; \\theta_0\\|$ is small enough that the curvature of the loss landscape is negligible.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Feature Evolution:<\/b><span style=\"font-weight: 400;\"> The internal representations (hidden layer features) are static. The network does not learn &#8220;new&#8221; features; it reweights the random features provided at initialization.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Inductive Bias:<\/b><span style=\"font-weight: 400;\"> The generalization properties are determined entirely by the spectral properties of the initial kernel $\\Theta_0$. If the target function aligns well with the eigenfunctions of the NTK (usually low-frequency components), learning is fast and generalization is good. If not, the model fails to learn efficiently.<\/span><span style=\"font-weight: 400;\">21<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Scaling:<\/b><span style=\"font-weight: 400;\"> This regime is reached when the scaling of the initialization variance is large or the scaling of the network output is small (e.g., dividing by $n$). This suppresses the gradient signal relative to the initialization, locking the features in place.<\/span><\/li>\n<\/ul>\n<h3><b>4.2 Characteristics of the Rich (Feature Learning) Regime<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">In the rich regime, the network significantly departs from its initialization, and the internal representations evolve to adapt to the data structure.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Mechanism:<\/b><span style=\"font-weight: 400;\"> The weights move significantly, traversing the non-linear manifold of the loss landscape. The Taylor expansion around $\\theta_0$ breaks down.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Feature Evolution:<\/b><span style=\"font-weight: 400;\"> The hidden layers actively learn features. For example, in a CNN, early layers evolve from random noise to Gabor-like edge detectors. In a transformer, attention heads specialize to track syntactic dependencies. This adaptation is crucial for transfer learning and generalization on complex tasks.<\/span><span style=\"font-weight: 400;\">5<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Mean-Field Limit:<\/b><span style=\"font-weight: 400;\"> This regime is mathematically described by the <\/span><b>Mean-Field limit<\/b><span style=\"font-weight: 400;\"> (distinct from the NTK limit). Here, the network is viewed as a distribution of particles (neurons). As width $n \\to \\infty$, the training dynamics are described by a Wasserstein gradient flow on the probability distribution of the weights (the Vlasov equation).<\/span><span style=\"font-weight: 400;\">24<\/span><span style=\"font-weight: 400;\"> Unlike the NTK, the kernel in the Mean-Field limit is <\/span><i><span style=\"font-weight: 400;\">data-dependent<\/span><\/i><span style=\"font-weight: 400;\"> and evolves over time.<\/span><\/li>\n<\/ul>\n<h3><b>4.3 The Role of Initialization Scale (<\/b><b>$\\alpha$<\/b><b>)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The transition between Lazy and Rich regimes can be controlled by a scaling parameter $\\alpha$ that governs the magnitude of the model&#8217;s output at initialization.20<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Consider a network parameterized as $f(x) = \\alpha \\sum_{i=1}^n v_i \\sigma(w_i^T x)$.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Large $\\alpha$ (Lazy):<\/b><span style=\"font-weight: 400;\"> If $\\alpha$ is large (e.g., $O(1)$ while weights are small), the output is sensitive to small changes in $v_i$. The model can fit the data with infinitesimal weight updates, preserving the linearization.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Small $\\alpha$ (Rich):<\/b><span style=\"font-weight: 400;\"> If $\\alpha$ is small (e.g., $O(1\/n)$), the initial output is small. The gradients are such that the weights <\/span><i><span style=\"font-weight: 400;\">must<\/span><\/i><span style=\"font-weight: 400;\"> grow and align with the signal to reduce the error. This forces the weights to travel far from initialization, inducing feature learning.<\/span><span style=\"font-weight: 400;\">19<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Chizat et al. demonstrated that by simply varying this scale, one can interpolate between the kernel regime and the feature learning regime, showing that they are two ends of a continuum.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<h2><b>5. Architectural Limits and Scaling Laws<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The general theory of infinite-width networks applies to all architectures, but the specific mechanics of convergence and the quality of the limit depend heavily on the architecture type (ResNet vs. Transformer vs. CNN) and the specific parameterization used.<\/span><\/p>\n<h3><b>5.1 Deep Residual Networks (ResNets)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">ResNets introduce skip connections ($x_{l+1} = x_l + f_l(x_l)$), which fundamentally alter signal propagation.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Signal Explosion:<\/b><span style=\"font-weight: 400;\"> In a standard ResNet, if the residual branch $f_l$ has variance $O(1)$, the variance of the signal $x_L$ at the output grows linearly with depth $L$. For extremely deep networks ($L \\to \\infty$), this leads to exploding forward signals and gradients.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Depth-$\\mu$P:<\/b><span style=\"font-weight: 400;\"> To enable feature learning in infinitely deep ResNets, the residual branch must be scaled down. The <\/span><b>Depth-$\\mu$P<\/b><span style=\"font-weight: 400;\"> scaling suggests multiplying the residual branch by $1\/\\sqrt{L}$.<\/span><span style=\"font-weight: 400;\">25<\/span><span style=\"font-weight: 400;\"> This ensures that the total variance at the output remains $O(1)$ regardless of depth.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Impact on Features:<\/b><span style=\"font-weight: 400;\"> Snippet <\/span><span style=\"font-weight: 400;\">25<\/span><span style=\"font-weight: 400;\"> notes that Depth-$\\mu$P maximizes &#8220;feature diversity.&#8221; Without this scaling, deep networks either fall into the lazy regime (if initialization is large) or suffer from rank collapse (if initialization is too small). With $1\/\\sqrt{L}$ scaling, the contribution of each layer is balanced, allowing the network to learn complex hierarchical features.<\/span><\/li>\n<\/ul>\n<h3><b>5.2 Transformers and Attention Collapse<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Transformers pose a unique challenge because they possess two distinct dimensions that can be taken to infinity: the embedding width $N$ (or $d_{model}$) and the number of attention heads $H$.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Infinite Embedding Dimension ($N \\to \\infty$):<\/b><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The Scaling Issue:<\/b><span style=\"font-weight: 400;\"> Standard attention uses a scaling factor of $1\/\\sqrt{N}$ in the softmax: $\\text{Attention}(Q, K, V) = \\text{softmax}(\\frac{QK^T}{\\sqrt{N}})V$.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Attention Collapse:<\/b><span style=\"font-weight: 400;\"> Recent research by Bordelon et al. <\/span><span style=\"font-weight: 400;\">26<\/span><span style=\"font-weight: 400;\"> and Yang <\/span><span style=\"font-weight: 400;\">28<\/span><span style=\"font-weight: 400;\"> reveals that in the strict $N \\to \\infty$ limit with standard parameterization, the attention mechanism degenerates. Specifically, all attention heads collapse to the same dynamics, effectively behaving as a single-head model. This occurs because the fluctuations that distinguish heads are suppressed by the Law of Large Numbers.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The Fix:<\/b><span style=\"font-weight: 400;\"> To preserve head diversity and feature learning in the $N \\to \\infty$ limit, the scaling must be adjusted. Some theories suggest a $1\/N$ scaling (the $\\mu$P scaling for attention) or specific reparameterizations of the query\/key weights to maintain $O(1)$ updates that are distinct across heads.<\/span><span style=\"font-weight: 400;\">26<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Infinite Heads ($H \\to \\infty$):<\/b><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Hron et al. <\/span><span style=\"font-weight: 400;\">26<\/span><span style=\"font-weight: 400;\"> investigated the limit where the number of heads $H \\to \\infty$ while keeping $N$ fixed. In this regime, the output of the multi-head attention block converges to a Gaussian Process. This limit is often more stable and easier to analyze but may not capture the &#8220;feature selection&#8221; capability of finite-head transformers where specific heads attend to specific syntactic structures.<\/span><span style=\"font-weight: 400;\">27<\/span><\/li>\n<\/ul>\n<h3><b>5.3 Convolutional Networks (CNNs) and the CNTK<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The NTK for convolutional networks (CNTK) incorporates the weight sharing and locality of convolutions.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Performance:<\/b><span style=\"font-weight: 400;\"> The CNTK performs significantly better than the MLP NTK on image tasks because it encodes translation invariance. On CIFAR-10, CNTK (with global average pooling and data augmentation) can achieve ~89% accuracy.<\/span><span style=\"font-weight: 400;\">6<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Gap:<\/b><span style=\"font-weight: 400;\"> Despite this high performance, it still lags behind standard ResNets (which reach &gt;96% on CIFAR-10). The gap is even wider on ImageNet. This confirms that while the inductive bias of convolution is powerful, the <\/span><i><span style=\"font-weight: 400;\">adaptive<\/span><\/i><span style=\"font-weight: 400;\"> features learned by deep CNNs (hierarchical composition of parts) are not fully captured by the static CNTK kernel.<\/span><span style=\"font-weight: 400;\">30<\/span><\/li>\n<\/ul>\n<h2><b>6. Maximal Update Parameterization (<\/b><b>$\\mu$<\/b><b>P): The &#8220;Grand Unified Theory&#8221; of Scaling<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The limitations of the NTK\/Lazy regime and the instability of standard parameterization (SP) in large models led to the development of the <\/span><b>Tensor Programs<\/b><span style=\"font-weight: 400;\"> framework and <\/span><b>Maximal Update Parameterization ($\\mu$P)<\/b><span style=\"font-weight: 400;\"> by Greg Yang and collaborators.<\/span><span style=\"font-weight: 400;\">7<\/span><\/p>\n<h3><b>6.1 The Problem with Standard Parameterization (SP)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">In standard parameterization (e.g., PyTorch default), weights are typically initialized with variance $\\sigma^2 \\propto 1\/n$ to ensure that activations have $O(1)$ variance. However, the learning rate is usually treated as a scalar independent of width.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Vanishing Updates:<\/b><span style=\"font-weight: 400;\"> As width $n \\to \\infty$, the gradient updates to the hidden layers in SP scale as $O(1\/n)$ or $O(1\/\\sqrt{n})$. This means that for very wide networks, the weights effectively stop moving. The network enters the Lazy regime by default, or worse, different layers learn at vastly different speeds, leading to instability.<\/span><\/li>\n<\/ul>\n<h3><b>6.2 The <\/b><b>$\\mu$<\/b><b>P Solution: Ensuring Maximal Feature Learning<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">$\\mu$P is a principled set of scaling rules derived to ensure that <\/span><b>every layer learns features maximally<\/b><span style=\"font-weight: 400;\">. This means that as width $n \\to \\infty$, the change in the pre-activations $\\Delta z$ due to a gradient update is $O(1)$\u2014neither vanishing (lazy) nor exploding (unstable).<\/span><span style=\"font-weight: 400;\">7<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The derivation relies on &#8220;Tensor Programs,&#8221; a formal language for tracking the scaling exponents of every computation in the network graph. By balancing the exponents, $\\mu$P prescribes specific scalings for initialization and learning rates.<\/span><\/p>\n<p><b>Table 1: Scaling Rules for Standard (SP), NTK, and $\\mu$P Parameterizations ($n = \\text{width}$)<\/b> <span style=\"font-weight: 400;\">7<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Parameter Type<\/b><\/td>\n<td><b>Standard Param (SP)<\/b><\/td>\n<td><b>NTK Param<\/b><\/td>\n<td><b>\u03bcP (Maximal Update)<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Input Weights ($W_{in}$)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1$<\/span><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto n$<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Hidden Weights ($W_{hid}$)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1\/n$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1\/n$<\/span><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1\/n$<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Output Weights ($W_{out}$)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1\/n$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Init Var $\\propto 1\/n^2$<\/span><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1$<\/span><\/td>\n<td><span style=\"font-weight: 400;\">LR $\\propto 1\/n$<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><i><span style=\"font-weight: 400;\">Note: In the table above, the scaling factors ensure the forward pass is $O(1)$. The Learning Rate (LR) scalings are the critical differentiators. In $\\mu$P, the input layer gets a massive learning rate boost ($O(n)$) to drive feature learning from the raw data, while the output layer is scaled down to prevent logits from exploding.<\/span><\/i><\/p>\n<h3><b>6.3 Practical Application: <\/b><b>$\\mu$<\/b><b>Transfer<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The most significant practical application of this theory is <\/span><b>$\\mu$Transfer<\/b><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">32<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Principle:<\/b><span style=\"font-weight: 400;\"> Because $\\mu$P ensures that training dynamics (loss curves, update magnitudes) converge to a stable limit as $n \\to \\infty$, the optimal hyperparameters for a maximal-width model are approximately the same as those for a small-width model.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Workflow:<\/b><\/li>\n<\/ul>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Parametrize the target architecture (e.g., GPT-3) using $\\mu$P.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Train a small &#8220;proxy&#8221; model (e.g., width 128) and tune hyperparameters (learning rate, schedule, initialization) aggressively.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Copy the optimal hyperparameters <\/span><i><span style=\"font-weight: 400;\">directly<\/span><\/i><span style=\"font-weight: 400;\"> to the multi-billion parameter model.<\/span><\/li>\n<\/ol>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Results:<\/b><span style=\"font-weight: 400;\"> Empirical results show that this method yields hyperparameters that are near-optimal for the large model, bypassing the need for expensive tuning at scale. This technique was reportedly used to tune models like GPT-3 and huge ResNets with a fraction of the compute cost.<\/span><span style=\"font-weight: 400;\">34<\/span><\/li>\n<\/ul>\n<h2><b>7. Empirical Validation and Performance Gaps<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The theoretical distinction between Lazy and Rich regimes is validated by robust empirical data demonstrating performance gaps across various tasks.<\/span><\/p>\n<h3><b>7.1 CIFAR-10 and ImageNet Benchmarks<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Comparison of accuracy on standard vision benchmarks highlights the limitation of pure kernel methods.<\/span><\/p>\n<p><b>Table 2: Test Accuracy Comparison (Approximate)<\/b> <span style=\"font-weight: 400;\">2<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Model \/ Method<\/b><\/td>\n<td><b>CIFAR-10 Accuracy<\/b><\/td>\n<td><b>ImageNet Top-1 Accuracy<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>ResNet-50 (Standard)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">~96%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">~76%<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Convolutional NTK (CNTK)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">~89% (w\/ Augmentation)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">~60-65% (Estimated)<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Simple NTK (MLP)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">~60%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">&lt; 40%<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>$\\mu$P ResNet<\/b><\/td>\n<td><span style=\"font-weight: 400;\">~96%<\/span><\/td>\n<td><span style=\"font-weight: 400;\">~76%<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">The gap on ImageNet (~10-15%) is critical. It underscores that while CNTK captures translation invariance, it fails to capture the deep hierarchical abstractions (feature learning) required for high-resolution, complex object recognition. The CNTK performs well on &#8220;small data&#8221; (e.g., CIFAR-10 with few samples), but fails to scale with dataset size as effectively as feature-learning networks.<\/span><span style=\"font-weight: 400;\">22<\/span><\/p>\n<h3><b>7.2 Word Embeddings and Semantic Structure<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">A striking demonstration of the difference is found in Word2Vec.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>NTK:<\/b><span style=\"font-weight: 400;\"> Embeddings are essentially fixed random vectors. They possess no semantic structure; the vector arithmetic King &#8211; Man + Woman results in a random vector, not Queen.<\/span><span style=\"font-weight: 400;\">5<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>$\\mu$P \/ Feature Learning:<\/b><span style=\"font-weight: 400;\"> The infinite-width limit of $\\mu$P learns embeddings that exhibit correct semantic clustering and algebraic relationships, matching the behavior of finite-width networks trained with SGD.<\/span><span style=\"font-weight: 400;\">8<\/span><\/li>\n<\/ul>\n<h2><b>8. Finite-Width Corrections and the Neural Tangent Hierarchy<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">While infinite-width theories are powerful, real networks are finite. How much does &#8220;finiteness&#8221; matter?<\/span><\/p>\n<h3><b>8.1 The <\/b><b>$1\/n$<\/b><b> Expansion<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Researchers have developed perturbation theories to calculate corrections to the infinite-width limit, typically as a power series in $1\/n$ (where $n$ is width).<\/span><span style=\"font-weight: 400;\">35<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>First-Order Correction ($1\/n$):<\/b><span style=\"font-weight: 400;\"> Introduces the first interactions between features. The kernel $\\Theta_t$ begins to evolve.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Neural Tangent Hierarchy:<\/b><span style=\"font-weight: 400;\"> The evolution of the kernel is coupled to 3-point and 4-point correlation functions (higher-order tensors). This hierarchy of coupled differential equations describes how the network departs from Gaussianity.<\/span><\/li>\n<\/ul>\n<h3><b>8.2 Depth vs. Width: The Aspect Ratio<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">A critical finding is that the validity of the NTK approximation depends on the ratio of depth $L$ to width $n$.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>If $L\/n \\to 0$ (Wide and Shallow):<\/b><span style=\"font-weight: 400;\"> The NTK approximation is excellent.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>If $L\/n \\sim O(1)$ (Deep and Narrow):<\/b><span style=\"font-weight: 400;\"> The NTK becomes stochastic. The variance of the kernel scales exponentially with $L\/n$. In this regime, the &#8220;frozen kernel&#8221; assumption breaks down completely, and the training dynamics are far more complex and chaotic.<\/span><span style=\"font-weight: 400;\">35<\/span><span style=\"font-weight: 400;\"> This suggests that for very deep networks (like modern Transformers), the infinite-width approximation is only valid if the width is scaled significantly faster than the depth.<\/span><\/li>\n<\/ul>\n<h2><b>9. Practical Guide: When to Use Which Theory?<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">For deep learning practitioners and researchers, understanding these regimes is not just theoretical\u2014it informs model design and debugging.<\/span><\/p>\n<h3><b>9.1 When to use NTK \/ Kernel Methods?<\/b><\/h3>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Small Data:<\/b><span style=\"font-weight: 400;\"> When training data is scarce (e.g., &lt; 10,000 samples), NTK and CNTK often outperform deep networks because they are less prone to overfitting and provide strong regularization.<\/span><span style=\"font-weight: 400;\">2<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Uncertainty Quantification:<\/b><span style=\"font-weight: 400;\"> The NNGP limit allows for exact Bayesian inference, providing calibrated uncertainty estimates that are often better than ensemble methods for small datasets.<\/span><span style=\"font-weight: 400;\">10<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Architecture Search:<\/b><span style=\"font-weight: 400;\"> Computing the NTK condition number or score at initialization can be a cheap proxy for predicting the trainability of a neural architecture without running full training (Neural Architecture Search).<\/span><span style=\"font-weight: 400;\">38<\/span><\/li>\n<\/ul>\n<h3><b>9.2 When to use Feature Learning \/ <\/b><b>$\\mu$<\/b><b>P?<\/b><\/h3>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Large Data \/ Foundation Models:<\/b><span style=\"font-weight: 400;\"> For ImageNet, LLMs, and large-scale pretraining, feature learning is non-negotiable. The Lazy regime will underperform significantly.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Transfer Learning:<\/b><span style=\"font-weight: 400;\"> If the goal is to pretrain on one dataset and finetune on another, you <\/span><i><span style=\"font-weight: 400;\">must<\/span><\/i><span style=\"font-weight: 400;\"> be in the Rich\/$\\mu$P regime. NTK features are fixed and cannot adapt to the new task.<\/span><span style=\"font-weight: 400;\">5<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Hyperparameter Tuning:<\/b><span style=\"font-weight: 400;\"> Use $\\mu$P scaling rules to transfer hyperparameters from small proxy models to large production models, saving massive compute resources.<\/span><span style=\"font-weight: 400;\">32<\/span><\/li>\n<\/ul>\n<h2><b>10. Conclusion<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The study of infinite-width limits has evolved from a tool for proving convergence (NTK) to a framework for engineering better large-scale models ($\\mu$P). The initial discovery of the Neural Tangent Kernel provided a comforting, albeit simplified, view of deep learning as kernel regression. However, the field has arguably moved &#8220;beyond the NTK,&#8221; recognizing that the <\/span><i><span style=\"font-weight: 400;\">deviations<\/span><\/i><span style=\"font-weight: 400;\"> from this limit\u2014the richness of feature learning, the adaptation of the kernel, and the interactions in deep layers\u2014are where the true power of artificial intelligence resides.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The &#8220;Lazy vs. Rich&#8221; dichotomy serves as a compass for modern theory. We now understand that &#8220;infinity&#8221; is not a single destination; the path taken there matters. By choosing the correct parameterization ($\\mu$P), we can ensure that even as networks grow to trillion-parameter scales, they retain the dynamic, adaptive capabilities that define intelligent learning, rather than degenerating into static kernel machines. The future of deep learning theory lies in mastering these scaling laws to make the training of foundation models more predictable, stable, and efficient.<\/span><\/p>\n<h3><b>Data Sources<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">3 Neal (1996) &#8211; Priors for Infinite Networks.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">4 Jacot et al. (2018) &#8211; Neural Tangent Kernel.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">1 Double Descent &amp; Generalization.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">18 Lazy vs Rich Regimes (Chizat et al.).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">5 Feature Learning Limits (Yang &amp; Hu).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">7 Maximal Update Parameterization ($\\mu$P).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">8 Tensor Programs IV.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">7 $\\mu$P Scaling Rules.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">26 Infinite Transformer Limits (Hron et al.).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">25 Depth-$\\mu$P (ResNet Scaling).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">32 $\\mu$Transfer.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">26 Attention Collapse (Bordelon et al.).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">2 NTK vs Finite Net Empirical Gaps.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">6 Convolutional NTK Performance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">27 Infinite Head Limits.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Introduction: The Unreasonable Effectiveness of Overparameterization The theoretical understanding of deep neural networks has undergone a fundamental transformation over the last decade. Historically, statistical learning theory relied on concepts <span class=\"readmore\"><a href=\"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/\">Read More &#8230;<\/a><\/span><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2374],"tags":[],"class_list":["post-9096","post","type-post","status-publish","format-standard","hentry","category-deep-research"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws | Uplatz Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws | Uplatz Blog\" \/>\n<meta property=\"og:description\" content=\"1. Introduction: The Unreasonable Effectiveness of Overparameterization The theoretical understanding of deep neural networks has undergone a fundamental transformation over the last decade. Historically, statistical learning theory relied on concepts Read More ...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/\" \/>\n<meta property=\"og:site_name\" content=\"Uplatz Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-12-26T10:47:34+00:00\" \/>\n<meta name=\"author\" content=\"uplatzblog\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:site\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"uplatzblog\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"20 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/\"},\"author\":{\"name\":\"uplatzblog\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\"},\"headline\":\"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws\",\"datePublished\":\"2025-12-26T10:47:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/\"},\"wordCount\":4505,\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"articleSection\":[\"Deep Research\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/\",\"name\":\"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws | Uplatz Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\"},\"datePublished\":\"2025-12-26T10:47:34+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"name\":\"Uplatz Blog\",\"description\":\"Uplatz is a global IT Training &amp; Consulting company\",\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\",\"name\":\"uplatz.com\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"contentUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"width\":1280,\"height\":800,\"caption\":\"uplatz.com\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Uplatz-1077816825610769\\\/\",\"https:\\\/\\\/x.com\\\/uplatz_global\",\"https:\\\/\\\/www.instagram.com\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\",\"name\":\"uplatzblog\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"caption\":\"uplatzblog\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws | Uplatz Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/","og_locale":"en_US","og_type":"article","og_title":"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws | Uplatz Blog","og_description":"1. Introduction: The Unreasonable Effectiveness of Overparameterization The theoretical understanding of deep neural networks has undergone a fundamental transformation over the last decade. Historically, statistical learning theory relied on concepts Read More ...","og_url":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/","og_site_name":"Uplatz Blog","article_publisher":"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","article_published_time":"2025-12-26T10:47:34+00:00","author":"uplatzblog","twitter_card":"summary_large_image","twitter_creator":"@uplatz_global","twitter_site":"@uplatz_global","twitter_misc":{"Written by":"uplatzblog","Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/#article","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/"},"author":{"name":"uplatzblog","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e"},"headline":"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws","datePublished":"2025-12-26T10:47:34+00:00","mainEntityOfPage":{"@id":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/"},"wordCount":4505,"publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"articleSection":["Deep Research"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/","url":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/","name":"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws | Uplatz Blog","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/#website"},"datePublished":"2025-12-26T10:47:34+00:00","breadcrumb":{"@id":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/uplatz.com\/blog\/the-infinite-width-limit-a-comprehensive-analysis-of-neural-tangent-kernels-feature-learning-and-scaling-laws\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/uplatz.com\/blog\/"},{"@type":"ListItem","position":2,"name":"The Infinite-Width Limit: A Comprehensive Analysis of Neural Tangent Kernels, Feature Learning, and Scaling Laws"}]},{"@type":"WebSite","@id":"https:\/\/uplatz.com\/blog\/#website","url":"https:\/\/uplatz.com\/blog\/","name":"Uplatz Blog","description":"Uplatz is a global IT Training &amp; Consulting company","publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/uplatz.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/uplatz.com\/blog\/#organization","name":"uplatz.com","url":"https:\/\/uplatz.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","contentUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","width":1280,"height":800,"caption":"uplatz.com"},"image":{"@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","https:\/\/x.com\/uplatz_global","https:\/\/www.instagram.com\/","https:\/\/www.linkedin.com\/company\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz"]},{"@type":"Person","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e","name":"uplatzblog","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","caption":"uplatzblog"}}]}},"_links":{"self":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9096","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/comments?post=9096"}],"version-history":[{"count":1,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9096\/revisions"}],"predecessor-version":[{"id":9097,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9096\/revisions\/9097"}],"wp:attachment":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/media?parent=9096"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/categories?post=9096"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/tags?post=9096"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}