From First Principles to High-Fidelity Synthesis: A Mathematical Exposition of Diffusion and Score-Based Generative Models

Abstract

This report provides a comprehensive mathematical treatment of diffusion-based generative models. We begin by deriving the foundational principles of Denoising Diffusion Probabilistic Models (DDPMs), detailing the forward and reverse Markovian processes and the formulation of the training objective. We then explore the parallel paradigm of Score-Based Generative Models (SGMs), examining the roles of the score function and Langevin dynamics. The central thesis of this work is the unification of these two frameworks through the lens of continuous-time Stochastic Differential Equations (SDEs), demonstrating how both DDPMs and SGMs emerge as specific discretizations of a more general process. Finally, we conduct an in-depth case study of Stable Diffusion, a Latent Diffusion Model (LDM), dissecting the mathematical functions of its core components—the Variational Autoencoder (VAE), the U-Net, and the CLIP-based cross-attention mechanism—to illustrate how these theoretical principles are operationalized to achieve state-of-the-art, high-fidelity image synthesis.

bundle-multi-5-in-1—sap-bpc By Uplatz

Section 1: The Probabilistic Framework of Denoising Diffusion

The conceptual bedrock for modern generative systems like Stable Diffusion is the Denoising Diffusion Probabilistic Model (DDPM). This framework provides a rigorous probabilistic method for transforming noise into coherent data. The process is inspired by non-equilibrium statistical physics, which describes how molecules move from areas of high concentration to low concentration.1 In machine learning, this physical analogy is adapted to describe the systematic destruction and subsequent reconstruction of data distributions.1

1.1 The Forward Process: A Markovian Corruption of Data

The forward process, also known as the diffusion process, is a fixed, non-learned procedure that systematically degrades a data sample by iteratively adding small amounts of Gaussian noise.4 This gradual corruption is mathematically defined as a Markov chain, where the state at a given time step depends only on the state of the previous time step.2

Given an initial data point $x_0$ sampled from the true data distribution $q(x_0)$, a sequence of increasingly noisy latent variables $x_1, x_2,…, x_T$ is generated. The transition from step $t-1$ to $t$ is defined by a conditional probability distribution, specifically a Gaussian kernel 6:

 

$$q(x_t|x_{t-1}) := \mathcal{N}(x_t; \sqrt{1 – \beta_t}x_{t-1}, \beta_t\mathbf{I})$$

 

Here, $\mathcal{N}(\cdot; \mu, \sigma^2\mathbf{I})$ denotes a Gaussian distribution with mean $\mu$ and covariance $\sigma^2\mathbf{I}$. The sequence $\{\beta_t\}_{t=1}^T$ is a pre-defined variance schedule, where $\beta_t \in (0, 1)$. Typically, these values are small and increase over the timesteps, such as a linear schedule from $\beta_1 = 0.0001$ to $\beta_T = 0.02$ for $T=1000$ steps.5 The term $\sqrt{1 – \beta_t}$ scales the mean of the previous state, a crucial factor that prevents the variance of the data from exploding as noise is added.6 After a sufficiently large number of steps $T$, the final state $x_T$ becomes indistinguishable from a standard isotropic Gaussian distribution, $x_T \approx \mathcal{N}(0, \mathbf{I})$.1

A critical property for making the training of these models computationally feasible is the ability to sample $x_t$ for any arbitrary timestep $t$ directly from the initial data point $x_0$. This avoids the need to iteratively apply the Markovian transition kernel $t$ times. By defining $\alpha_t := 1-\beta_t$ and the cumulative product $\bar{\alpha}_t := \prod_{s=1}^t \alpha_s$, we can derive a closed-form expression for the marginal distribution $q(x_t|x_0)$ 8:

 

$$q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 – \bar{\alpha}_t)\mathbf{I})$$

 

Using the reparameterization trick, this sampling operation can be expressed as a simple function: $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 – \bar{\alpha}_t}\boldsymbol{\epsilon}$, where $\boldsymbol{\epsilon}$ is a noise sample drawn from a standard normal distribution, $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$.8 This property is the cornerstone of the DDPM training loop. Without it, generating a single training sample would require a sequential computation of $t$ steps, making the process prohibitively slow. This closed-form solution allows for the random sampling of timesteps for each data point in a training batch, enabling massive parallelization and making the training of deep diffusion models on large datasets practical.3

 

1.2 The Reverse Process: Learning to Denoise

 

The generative power of a diffusion model lies in its ability to reverse the forward process. Starting with a sample of pure noise, $x_T \sim \mathcal{N}(0, \mathbf{I})$, the model must learn to iteratively remove the noise, tracing a trajectory back to a plausible sample from the original data distribution $q(x_0)$.5 This requires learning the reverse transition probabilities, $p(x_{t-1}|x_t)$.

However, the true reverse posterior distribution, $q(x_{t-1}|x_t)$, is computationally intractable. Its calculation would require access to the entire data distribution, which is precisely what we are trying to model.5 To overcome this, the reverse process is approximated using a neural network parameterized by $\theta$. Since the forward transitions are Gaussian, it is a common and effective choice to also model the reverse transitions as Gaussian distributions 6:

 

$$p_\theta(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \boldsymbol{\mu}_\theta(x_t, t), \boldsymbol{\Sigma}_\theta(x_t, t))$$

 

The central task of the neural network is to learn the parameters of these reverse transitions: the mean $\boldsymbol{\mu}_\theta(x_t, t)$ and the covariance $\boldsymbol{\Sigma}_\theta(x_t, t)$ for each timestep $t$.

 

1.3 The Variational Objective: From Log-Likelihood to a Simplified Loss

 

The training objective is formulated to make the learned reverse process $p_\theta$ a good approximation of the true (but intractable) reverse process. This is formally achieved by maximizing the log-likelihood of the training data, $\log p_\theta(x_0)$. The problem is framed within the context of variational inference, where the goal is to maximize the Variational Lower Bound (VLB), also known as the Evidence Lower Bound (ELBO), on the log-likelihood.5

The VLB for a diffusion model can be decomposed into a sum of terms, each corresponding to a specific timestep in the process 8:

 

$$\mathcal{L}_{vlb} = \mathbb{E}_q \left = L_T + \sum_{t=2}^{T} L_{t-1} + L_0$$

 

This expression consists of three types of terms:

  • $L_T = D_{KL}(q(x_T|x_0) |

| p_\theta(x_T))$: This term compares the distribution of the final noisy latent with the prior (a standard Gaussian). As both are typically set to be $\mathcal{N}(0, \mathbf{I})$, this term has no learnable parameters and can be ignored during training.

  • $L_{t-1} = D_{KL}(q(x_{t-1}|x_t, x_0) |

| p_\theta(x_{t-1}|x_t))$ for $t > 1$: These terms are Kullback-Leibler (KL) divergences that measure the discrepancy between the learned reverse transition and the true reverse posterior. The true posterior $q(x_{t-1}|x_t, x_0)$ is tractable when conditioned on the original data $x_0$.

  • $L_0 = -\log p_\theta(x_0|x_1)$: This is a reconstruction term that measures the likelihood of the final data sample given the first denoised latent.

While optimizing the full VLB is a valid objective, the work by Ho et al. (2020) revealed that a much simpler objective function not only works but often produces higher-quality image samples.7 The key insight is that the KL divergence term, $L_{t-1}$, which compares two Gaussian distributions, is primarily minimized by matching their means. Instead of parameterizing the neural network to directly predict the mean $\boldsymbol{\mu}_\theta(x_t, t)$, it was found to be more stable and effective to reparameterize the mean in terms of the noise $\boldsymbol{\epsilon}$ that was added at step $t$ and train the network to predict this noise.6

This reparameterization leads to a simplified loss function that is a simple Mean Squared Error (MSE) between the true noise and the noise predicted by the neural network, $\boldsymbol{\epsilon}_\theta$ 8:

 

$$\mathcal{L}_{simple} = \mathbb{E}_{t,x_0, \boldsymbol{\epsilon}} \left\| \boldsymbol{\epsilon} – \boldsymbol{\epsilon}_\theta(x_t, t) \right\|^2 = \mathbb{E}_{t,x_0, \boldsymbol{\epsilon}} \left\| \boldsymbol{\epsilon} – \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 – \bar{\alpha}_t}\boldsymbol{\epsilon}, t) \right\|^2$$

 

This shift from optimizing the full VLB to minimizing a simple MSE on the noise was a pivotal development. It represents a pragmatic trade-off in generative modeling. The VLB is a principled objective for maximizing the data log-likelihood, which is a measure of how well the model captures the true data density. However, the simplified MSE objective, while less directly tied to likelihood, was empirically shown to produce samples of superior visual fidelity. This choice prioritizes perceptual quality over theoretical density modeling, a decision that has profoundly influenced the field and enabled the success of models like Stable Diffusion. To achieve competitive log-likelihoods, later work reintroduced a learned variance and a hybrid loss function, but for high-quality image synthesis, the simple noise-prediction objective remains the standard.8

 

Section 2: Score-Based Generative Modeling

 

Parallel to the development of DDPMs, another powerful paradigm for generative modeling emerged: score-based generative models (SGMs). Instead of learning the probabilities of transitioning between states, these models learn the gradient field of the data distribution itself.

 

2.1 The Score Function: Gradient Fields of Log-Density

 

For a given probability density function $p(x)$, its score function is formally defined as the gradient of the logarithm of the density with respect to the data variable $x$ 16:

 

$$\mathbf{s}(x) := \nabla_x \log p(x)$$

 

Intuitively, the score function defines a vector field over the data space. At any point $x$, the vector $\mathbf{s}(x)$ points in the direction in which the probability density $p(x)$ increases most steeply.20 This vector field provides a local guide for how to perturb a point to make it more likely under the target distribution. A significant mathematical advantage of the score function is its independence from the normalizing constant of the probability distribution. If $p(x) = \frac{\tilde{p}(x)}{Z}$, where $Z$ is the (often intractable) normalizing constant, then $\nabla_x \log p(x) = \nabla_x (\log \tilde{p}(x) – \log Z) = \nabla_x \log \tilde{p}(x)$. This property allows score-based models to sidestep the difficult problem of computing the normalizing constant, which is a major challenge for many other energy-based models.19 This effectively reframes the task of generative modeling from density estimation to vector field estimation.

 

2.2 Learning the Score: Denoising Score Matching

 

The objective of a score-based model is to train a neural network $s_\theta(x)$ to approximate the true data score, $\nabla_x \log p_{data}(x)$. This is typically done by minimizing the Fisher Divergence between the model’s score and the data’s score.19 However, this objective requires access to the ground-truth score, which is unknown.

The solution to this problem is a technique called score matching. A practical and robust variant is denoising score matching.20 This approach addresses two key challenges: the unavailability of the true data score and the fact that real-world data often resides on a low-dimensional manifold within a high-dimensional space.25 On this manifold, the score may be ill-defined. To resolve this, the training data is perturbed with Gaussian noise at various levels, or scales, $\{\sigma_i\}_{i=1}^L$. The model is then trained to estimate the score of these perturbed data distributions, $p_{\sigma_i}(\tilde{x}) = \int p_{data}(x) \mathcal{N}(\tilde{x}; x, \sigma_i^2 \mathbf{I}) dx$. The training objective becomes a weighted sum of score matching losses over all noise scales 20:

 

$$\mathcal{L}(\theta) = \sum_{i=1}^{L} \lambda(\sigma_i) \mathbb{E}_{p_{\sigma_i}(\tilde{x})} \left\| s_\theta(\tilde{x}, \sigma_i) – \nabla_{\tilde{x}} \log p_{\sigma_i}(\tilde{x}) \right\|^2$$

 

The target score of the perturbed distribution, $\nabla_{\tilde{x}} \log p_{\sigma_i}(\tilde{x})$, is tractable to compute and is related to the noise that was added. The use of multiple noise scales is not merely a heuristic; it is a solution to a fundamental geometric problem. It effectively “lifts” the data from its low-dimensional manifold into the full ambient space, ensuring the density is non-zero everywhere and the score function is well-defined and smoother, making it easier for a neural network to learn.19 This multi-scale perturbation acts as a form of curriculum learning, enabling the model to learn a complete vector field that can guide a sample from a simple noise distribution all the way back to the intricate data manifold.

 

2.3 Sampling via Langevin Dynamics: Traversing the Probability Manifold

 

Once the score model $s_\theta(x, \sigma_i)$ is trained, it can be used to generate new data samples via an iterative Markov Chain Monte Carlo (MCMC) procedure known as Langevin Dynamics.23 This algorithm uses the learned score to iteratively guide a random sample toward regions of high probability density.

The update rule for Langevin dynamics is given by 23:

 

$$\mathbf{x}_{k+1} \gets \mathbf{x}_k + \frac{\alpha}{2} s_\theta(\mathbf{x}_k) + \sqrt{\alpha} \mathbf{z}_k, \quad \text{where } \mathbf{z}_k \sim \mathcal{N}(0, I)$$

 

Each step consists of two parts: a small step in the direction of the score (a gradient ascent step on the log-density) and the addition of a small amount of Gaussian noise. The noise term helps the sampling process explore the full distribution and avoid getting trapped in local modes.

For high-quality sample generation, this procedure is performed in an “annealed” fashion, known as Annealed Langevin Dynamics.24 The process begins with a sample drawn from a broad noise distribution corresponding to the highest noise scale, $x_L \sim \mathcal{N}(0, \sigma_L^2 \mathbf{I})$. Then, Langevin dynamics is applied for a number of steps using the score model for that noise level, $s_\theta(x, \sigma_L)$. The process is then repeated for progressively smaller noise levels, $s_\theta(x, \sigma_{L-1}),…, s_\theta(x, \sigma_1)$, gradually refining the sample until it conforms to the learned data distribution.

 

Section 3: A Unifying Perspective via Stochastic Differential Equations

 

The frameworks of DDPMs and SGMs, though developed with different formalisms, are deeply connected. This connection is elegantly revealed through the lens of continuous-time stochastic processes, described by Stochastic Differential Equations (SDEs). This perspective provides a “grand unified theory” for this class of generative models, showing that the discrete-time models are simply different practical implementations of the same underlying continuous-time process.27

 

3.1 Continuous-Time Diffusion: The Forward SDE

 

The discrete, step-by-step noising process of DDPMs can be generalized to a continuous-time process. This continuous evolution of the data distribution is described by an SDE.26 A general form of such an SDE is:

 

$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t) dt + g(t) d\mathbf{w}$$

 

Here, $t \in$ is a continuous time variable, $\mathbf{f}(\mathbf{x}, t)$ is a vector-valued function called the drift coefficient, $g(t)$ is a scalar function called the diffusion coefficient, and $\mathbf{w}$ is a standard Wiener process (also known as Brownian motion), with $d\mathbf{w}$ representing infinitesimal white noise.19 This SDE defines a smooth transformation that gradually converts a complex data distribution $p_0$ at time $t=0$ into a simple, tractable prior distribution $p_T$ (e.g., a Gaussian) at time $t=T$.

 

3.2 The Reverse-Time SDE: The Central Role of the Score

 

A foundational result in stochastic calculus, established by Anderson (1982), states that a diffusion process described by an SDE has a corresponding reverse-time SDE. This reverse SDE describes a process that transforms samples from the prior distribution $p_T$ back into samples from the original data distribution $p_0$.26 The reverse-time SDE is given by:

 

$$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) – g(t)^2 \nabla_x \log p_t(\mathbf{x})] dt + g(t) d\bar{\mathbf{w}}$$

 

In this equation, $d\bar{\mathbf{w}}$ is a reverse-time Wiener process, and $dt$ represents an infinitesimal negative time step. The most crucial component of this equation is the term $\nabla_x \log p_t(\mathbf{x})$ in the drift coefficient. This is precisely the time-dependent score function of the perturbed data distribution $p_t$ at time $t$. This formula provides the profound and unifying insight: the process of reversing diffusion is mathematically governed by the score function of the data distribution at every point in time.29 To generate data, one must first estimate this score function with a neural network $s_\theta(\mathbf{x}, t)$ and then solve the reverse-time SDE using a numerical solver.

 

3.3 Connecting the Dots: DDPMs and SGMs as SDE Discretizations

 

The SDE framework elegantly encapsulates both DDPMs and SGMs as special cases, corresponding to different choices for the drift and diffusion coefficients, $\mathbf{f}$ and $g$.27

  • Score-Matching with Langevin Dynamics (SMLD) as Variance Exploding (VE) SDE: The original score-based models correspond to an SDE where the variance of the perturbed data increases over time. This is a discretization of the VE-SDE 27:

    $$d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\mathbf{w}$$

    In this formulation, the drift coefficient $\mathbf{f}(\mathbf{x}, t)$ is zero.33
  • DDPM as Variance Preserving (VP) SDE: The DDPM formulation corresponds to an SDE designed to keep the variance of the perturbed data bounded, approaching a fixed value. This is a discretization of the VP-SDE 27, which is a form of the Ornstein-Uhlenbeck process 34:

    $$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x} dt + \sqrt{\beta(t)} d\mathbf{w}$$

Furthermore, the SDE framework clarifies that the simplified MSE loss used in DDPMs is equivalent to a weighted denoising score matching objective.19 The noise prediction network $\epsilon_\theta(x_t, t)$ of a DDPM is implicitly learning a scaled version of the score function: $s_\theta(x_t, t) = -\epsilon_\theta(x_t, t) / \sqrt{1-\bar{\alpha}_t}$.27

 

3.4 The Probability Flow ODE: A Deterministic Path to Generation

 

For any given SDE, there exists a corresponding Ordinary Differential Equation (ODE), known as the Probability Flow (PF) ODE. The trajectories of this ODE have the same marginal probability densities $p_t(x)$ at each time $t$ as the original SDE.22 The PF-ODE is given by:

 

$$\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t) – \frac{1}{2}g(t)^2 \nabla_x \log p_t(\mathbf{x})$$

 

This equation is entirely deterministic; it lacks the stochastic Wiener process term $d\mathbf{w}$.33 This has two major implications. First, it enables deterministic sampling. By solving this ODE backwards in time (from $t=T$ to $t=0$) with a numerical solver, one can map a single point in the prior noise space to a unique, corresponding point in the data space. This creates a one-to-one mapping that is invaluable for tasks like image editing and inversion.29 Second, this deterministic and invertible transformation, much like a normalizing flow, allows for the exact computation of the data log-likelihood using the instantaneous change of variables formula—a capability not present in the stochastic sampling process.23 The PF-ODE thus elevates diffusion models from purely stochastic samplers to a class of invertible, flow-like models, granting them the “best of both worlds”: the flexible training of score-matching with the desirable theoretical properties of normalizing flows.

Feature Denoising Diffusion Probabilistic Models (DDPM) Score-Based Generative Models (SGM) Unified SDE Framework
Time Domain Discrete ($t \in \{1,…, T\}$) Discrete (Multiple noise scales $\sigma_i$) Continuous ($t \in$)
Forward Process Markov Chain adding Gaussian noise Perturbation with multiple noise kernels Forward-time SDE
Reverse Process Learned Gaussian transitions $p_\theta(x_{t-1} x_t)$ Annealed Langevin Dynamics (MCMC)
Core Learned Object Noise prediction $\epsilon_\theta(x_t, t)$ Score function $s_\theta(x, \sigma_i)$ Time-dependent score $s_\theta(x_t, t)$
Training Objective Simplified MSE on noise (reweighted VLB) Denoising Score Matching Continuous Score Matching
Underlying SDE Variance Preserving (VP) SDE Variance Exploding (VE) SDE General form: $d\mathbf{x} = f(\mathbf{x},t)dt + g(t)d\mathbf{w}$

 

Section 4: Case Study – The Mathematical Architecture of Stable Diffusion

 

Stable Diffusion is a prominent example of a Latent Diffusion Model (LDM), which applies the theoretical principles of diffusion models in a computationally efficient manner to achieve state-of-the-art text-to-image synthesis. Its success stems not from a single new theory, but from a masterful engineering integration of several existing mathematical concepts into a synergistic and practical system. The architecture reveals a fundamental decoupling of tasks: perceptual compression, semantic understanding, and conditional denoising.

 

4.1 Perceptual Compression: The Variational Autoencoder (VAE)

 

A primary challenge for diffusion models is the immense computational cost of operating directly in the high-dimensional pixel space of images. A model trained on raw pixels must expend a significant portion of its capacity modeling perceptually meaningless, high-frequency details, leading to long training times and expensive inference.41

The LDM framework solves this by performing the diffusion process in a much smaller, lower-dimensional latent space.44 This compression is handled by a pre-trained Variational Autoencoder (VAE). The VAE consists of two components:

  1. An encoder $\mathcal{E}$ that maps an image $x \in \mathbb{R}^{H \times W \times 3}$ to a compressed latent representation $z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}$. The spatial dimensions are significantly reduced (e.g., by a factor of 8, so $h=H/8, w=W/8$), focusing the model on the semantic content of the image.41
  2. A decoder $\mathcal{D}$ that reconstructs the image from the latent space, $\tilde{x} = \mathcal{D}(z)$.

The VAE’s mathematical role is to learn a continuous and structured latent space. The encoder outputs the parameters (mean $\mu$ and log-variance $\log \sigma^2$) of a Gaussian distribution for each input image. The latent code $z$ is then sampled from this distribution via the reparameterization trick: $z = \mu + \sigma \cdot \epsilon$.13 The VAE is trained by optimizing the ELBO, which comprises a reconstruction loss and a KL-divergence term: $L_{VAE} = |x – \mathcal{D}(\mathcal{E}(x))|^2 + D_{KL}(q(\mathcal{E}(x)|x) |

| p(z))$.13 The KL term regularizes the latent space, encouraging it to be smooth and well-organized (close to a standard Gaussian prior), which is essential for the subsequent diffusion process to work effectively.47

 

4.2 The Denoising Engine: The U-Net in Latent Space

 

The core of Stable Diffusion’s generative process is a U-Net architecture that operates entirely on the latent representations $z_t$. Its function is to predict the noise $\epsilon$ that was added to a latent vector $z$ at a given timestep $t$, conditioned on that timestep and a text prompt.50

The U-Net features a symmetric encoder-decoder structure. The encoder path consists of a series of down-sampling blocks (containing residual connections and convolutions) that extract hierarchical features from the latent input. The decoder path mirrors this structure, using up-sampling blocks to progressively reconstruct the latent vector’s original dimensions.51

Two key architectural features are critical to its function:

  • Skip Connections: These connections link layers in the encoder directly to their corresponding layers in the decoder. By concatenating feature maps, they allow the network to pass high-resolution spatial information directly across the model, which is vital for preserving detail and enabling accurate noise prediction.51
  • Time Embedding: The discrete timestep $t$ is transformed into a high-dimensional vector using sinusoidal embeddings, similar to positional encodings in transformers. This embedding is then injected into the residual blocks of the U-Net, allowing the network’s behavior to be conditioned on the current noise level in the diffusion trajectory.41

 

4.3 Conditional Synthesis: The Cross-Attention Mechanism

 

To guide image generation based on a text prompt, the denoising process must be conditioned on the prompt’s semantic meaning. This is achieved through a two-stage process involving a text encoder and a cross-attention mechanism.

  1. Text Encoder (CLIP): The input text prompt is first converted into a sequence of numerical embeddings by a pre-trained text encoder. Stable Diffusion uses the powerful text encoder from the CLIP (Contrastive Language-Image Pre-training) model.54 CLIP is trained to align text and image representations in a shared multimodal embedding space, endowing it with a rich semantic understanding of language.55 The output is a sequence of context vectors $\tau_\theta(y)$ that numerically represent the prompt.
  2. Cross-Attention: These context vectors are then integrated into the U-Net using a cross-attention mechanism, which is borrowed from the Transformer architecture.42 At various layers within the U-Net, the spatial feature maps (representing the image) are treated as a sequence of Query vectors ($Q$). The CLIP text embeddings serve as the Key ($K$) and Value ($V$) vectors. The attention operation is then computed as 43:

    $$Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

    This mechanism allows each part of the image representation (each query vector) to “attend to” the most relevant parts of the text prompt. It computes a weighted average of the text-derived Value vectors, where the weights are determined by the semantic similarity between the image features and the text features. This fusion of information allows the text prompt to guide the denoising process at a granular level.

 

4.4 The Inference Loop: Schedulers and the Denoising Trajectory

 

At inference time, the process begins with a random latent vector $z_T \sim \mathcal{N}(0, \mathbf{I})$. The model then iteratively denoises this latent for a specified number of steps.

The scheduler (or sampler) is a crucial component that dictates the precise update rule for each denoising step.56 It determines the sequence of timesteps and calculates the next, less noisy latent $z_{t-1}$ based on the current latent $z_t$ and the U-Net’s noise prediction $\epsilon_\theta(z_t, t, \tau_\theta(y))$.

While the original DDPM scheduler is stochastic and requires thousands of steps, a key innovation for practical use was the development of more efficient schedulers like DDIM (Denoising Diffusion Implicit Models).56 The DDIM scheduler formulates a non-Markovian reverse process that can be made deterministic. This allows for much faster sampling, producing high-quality images in as few as 20-50 steps by taking larger “jumps” along the denoising trajectory.59 This dramatic increase in sampling speed was essential for making diffusion models widely accessible.

After the final denoising step produces a clean latent vector $z_0$, the VAE’s decoder $\mathcal{D}$ is used in a single forward pass to transform $z_0$ back into the final, full-resolution pixel-space image $\tilde{x} = \mathcal{D}(z_0)$.44

This modular architecture, separating perceptual encoding, semantic interpretation, and the core denoising logic, represents a powerful and efficient design pattern. It allows for the reuse of powerful pre-trained components (VAE, CLIP) and focuses the diffusion training on the central task of conditional denoising in an efficient latent space, enabling the generation of complex, high-fidelity images from textual descriptions.