While Denoising Diffusion Probabilistic Models (DDPMs) have shown remarkable success in generating high-fidelity samples, they typically require many sampling steps (e.g.,
A key insight of DDIM is that the DDPM objective function does not strictly depend on the Markovian nature of the forward noising process. This allows DDIM to use the same trained DDPM model but employ a different, more flexible sampling procedure.
Recall that DDPMs define a forward process
This objective only depends on the marginal $q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}t)I)$. It does not explicitly depend on the step-by-step Markovian structure of $q(x_t|x{t-1})$.
Recall the KL divergence term in the DDPM objective: $$ \mathbb{D}{KL}(q(x{t-1}|x_t, x_0) || p_{\theta}(x_{t-1}|x_t)) = \frac{1}{2\sigma_t^2} \frac{\beta_t^2}{\alpha_t(1-\bar{\alpha}t)} ||\epsilon - \epsilon{\theta}(\sqrt{\bar{\alpha}t}x_0 + \sqrt{1-\bar{\alpha}t}\epsilon, t)||^2 $$ in which the forward process $q(x{t-1}|x_t, x_0)$ is defined as: $$q(x{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)$$
Where the mean
and the reverse process
Let's define a more general forward process in terms of matrices
and at the same way a more general reverse process:
The KL term in this case becomes: $$ \mathbb{D}{KL}(q(x{t-1}|x_t, x_0) || p_{\theta}(x_{t-1}|x_t)) = \frac{1}{2\tilde{\sigma}^2_{t}} ||(A_t x_t + B_t \epsilon_\theta) - (A_t x_t - B_t \epsilon)||^2 $$ $$ = \frac{1}{2\tilde{\sigma}^2_{t}} B_t ||\epsilon_\theta - \epsilon||^2. $$
And since in many DDPM implementations ignore the variance term, we can see that the objective function is not strictly dependent on the Markovian nature of the forward process. The DDPM objective can be optimized with respect to any forward process
DDIM leverages this by proposing a class of non-Markovian forward processes that share the same marginals
The author of DDIMs propose the following Non-Markovian forward process: $$ q_\sigma(x_{1:T}|x_0) = q_\sigma(x_T|x_0)\prod_{t=1}^T q_\sigma(x_{t-1}|x_t, x_0) $$ with this new kernel definition: $$ q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}{t-1}}x_0 + \sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t} \cdot \frac{x_t - \sqrt{\bar{\alpha}t}x_0}{\sqrt{1 - \bar{\alpha}t}}, \sigma^2_t I\right) $$ $$ = \mathcal{N}(x{t-1}; \mu_q = A_t x_t + B_t x_0, \sigma_q = \tilde{\sigma}^2{t} I) $$
As discussed in Sec. 2.1 if we can prove that:
we don't need to train a different model and we can reuse the DDPM one.
Mathematical Proof 🧮
Okay, let's detail the proof. The core of this demonstration is to show that the marginal distribution
Recall: Gaussian Marginalization Rule
Before diving into the proof, let's state the Gaussian marginalization rule that we will use.
If we have two random variables
- The distribution of
$x$ is Gaussian:$p(x) = \mathcal{N}(x | \mu_x, \Sigma_x)$ - The conditional distribution of
$y$ given$x$ is also Gaussian and linear in$x$ :$p(y|x) = \mathcal{N}(y | Ax + b, \Sigma_y)$
Then, the marginal distribution of
In our context, for the inductive step:
-
$x$ will represent$x_t$ (the state at timestep$t$ ). -
$y$ will represent$x_{t-1}$ (the state at timestep$t-1$ ). -
$p(x)$ will be$q_\sigma(x_t|x_0)$ . -
$p(y|x)$ will be the DDIM kernel$q_\sigma(x_{t-1}|x_t, x_0)$ . -
$p(y)$ will be the target marginal$q_\sigma(x_{t-1}|x_0)$ .
Proof by Induction
We want to prove the following statement
This states that the distribution of the noisy sample
1. Base Case:
Let's consider the timestep
2. Inductive Hypothesis:
Assume that
$\mu_x = \sqrt{\bar{\alpha}_t}x_0$ $\Sigma_x = (1-\bar{\alpha}_t)I$
3. Inductive Step:
We need to show that
We use the DDIM non-Markovian kernel
$A = \frac{\sqrt{1-\bar{\alpha}_{t-1} - \sigma^2_t}}{\sqrt{1 - \bar{\alpha}_t}}I$ - $b = \left(\sqrt{\bar{\alpha}{t-1}} - \frac{\sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t}\sqrt{\bar{\alpha}_t}}{\sqrt{1 - \bar{\alpha}_t}}\right)x_0$
$\Sigma_y = \sigma^2_t I$
Now, we apply the Gaussian marginalization rule to find
Mean of $q_\sigma(x_{t-1}|x_0)$ : $A\mu_x + b$
$$\text{Mean} = \left(\frac{\sqrt{1-\bar{\alpha}_{t-1} - \sigma^2_t}}{\sqrt{1 - \bar{\alpha}t}}I\right)(\sqrt{\bar{\alpha}t}x_0) + \left(\sqrt{\bar{\alpha}{t-1}} - \frac{\sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t}\sqrt{\bar{\alpha}_t}}{\sqrt{1 - \bar{\alpha}t}}\right)x_0$$
$$= \left( \frac{\sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t}\sqrt{\bar{\alpha}t}}{\sqrt{1 - \bar{\alpha}t}} + \sqrt{\bar{\alpha}{t-1}} - \frac{\sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t}\sqrt{\bar{\alpha}_t}}{\sqrt{1 - \bar{\alpha}t}} \right) x_0$$
The two terms with fractions cancel out:
$$\text{Mean} = \sqrt{\bar{\alpha}{t-1}}x_0$$
This matches the mean required for
Covariance of $q_\sigma(x_{t-1}|x_0)$ : $\Sigma_y + A\Sigma_x A^T$
$$\text{Covariance} = \sigma^2_t I + \left(\frac{\sqrt{1-\bar{\alpha}_{t-1} - \sigma^2_t}}{\sqrt{1 - \bar{\alpha}t}}I\right) ((1-\bar{\alpha}t)I) \left(\frac{\sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t}}{\sqrt{1 - \bar{\alpha}t}}I\right)^T$$
Since $I^T = I$:
$$= \sigma^2_t I + \frac{(1-\bar{\alpha}{t-1} - \sigma^2_t)}{(1 - \bar{\alpha}t)} (1-\bar{\alpha}t) I \cdot I \cdot I$$
$$= \sigma^2_t I + (1-\bar{\alpha}{t-1} - \sigma^2_t)I$$
$$= (\sigma^2_t + 1-\bar{\alpha}{t-1} - \sigma^2_t)I$$
$$\text{Covariance} = (1-\bar{\alpha}{t-1})I$$
This matches the covariance required for
Therefore, we have shown that if
4. Conclusion of Induction:
Since the base case
This means that for any timestep
The DDIM forward process impose a generative process (reverse process) that is a specific instance of a more general family. Instead of the DDPM sampling step:
$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}t}} \boldsymbol{\epsilon}\theta(x_t, t) \right) + \sigma_t \epsilon_t$$ where $\sigma_t^2 = \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t}\beta_t$ for DDPMs.
DDIM introduces a more general sampling step. The key is how
Starting from the DDIM kernels: $$ q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}{t-1}}x_0 + \sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t} \cdot \frac{x_t - \sqrt{\bar{\alpha}t}x_0}{\sqrt{1 - \bar{\alpha}t}}, \sigma^2_t I\right) $$ and since we proved that the marginal is the same: $$ q\sigma(x{t}|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}t}x_0, (1-\bar{\alpha}t)I) \implies $$ $$ x_t = \sqrt{\bar{\alpha}t}x_0 + \sqrt{1-\bar{\alpha}t}\epsilon \implies x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}t}\epsilon}{\sqrt{\bar{\alpha}t}} \implies $$ $$ q\sigma(x{t-1}|x_t, x_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}{t-1}}\frac{x_t - \sqrt{1-\bar{\alpha}t}\epsilon}{\sqrt{\bar{\alpha}t}} + \sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t \epsilon}, \sigma^2_t I\right) $$ and we know that we are minimizing the KL with the reverse process $p(x{t-1}|x_t)$, so: $$ p\theta(x{t-1}|x_t) = \mathcal{N}\left(\sqrt{\bar{\alpha}{t-1}}\frac{x_t - \sqrt{1-\bar{\alpha}t}\epsilon\theta}{\sqrt{\bar{\alpha}t}} + \sqrt{1-\bar{\alpha}{t-1} - \sigma^2_t \epsilon_\theta}, \sigma^2_t I\right) $$
we end-up to DDIM sampling step via the reparameterization trick:
$$x_{t-1} = \underbrace{\sqrt{\bar{\alpha}{t-1}} \hat{x}0(x_t)}{\text{component pointing to predicted } x_0} + \underbrace{\sqrt{1-\bar{\alpha}{t-1} - \sigma_t^2} \cdot \boldsymbol{\epsilon}{\theta}(x_t,t)}{\text{component related to noise direction}} + \underbrace{\sigma_t \epsilon_t}_{\text{random noise}}$$
where
The stochasticity parameter
-
If
$\eta = 1$ : Then $\sigma_t^2 = \tilde{\beta}t$. The DDIM sampling step becomes: $$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}_{t-1}} \hat{\mathbf{x}}_0(\mathbf{x}t) + \sqrt{1-\bar{\alpha}{t-1} - \tilde{\beta}t} \cdot \boldsymbol{\epsilon}{\theta}(\mathbf{x}_t,t) + \sqrt{\tilde{\beta}t} \mathbf{\epsilon_t}$$ With some algebra, one can show that $\sqrt{1-\bar{\alpha}{t-1} - \tilde{\beta}t} = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{ \sqrt{1-\bar{\alpha}_t}}$. Substituting $\hat{\mathbf{x}}0$ and simplifying, this recovers the DDPM sampling equation: $$\mathbf{x}{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}t}} \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t) \right) + \sqrt{\tilde{\beta}_t} \mathbf{\epsilon_t}$$ So, DDPM is a special case of DDIM when$\eta=1$ . This process is stochastic. -
If
$\eta = 0$ : Then$\sigma_t^2 = 0$ . The DDIM sampling step becomes deterministic: $$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}} \hat{\mathbf{x}}0(\mathbf{x}t) + \sqrt{1-\bar{\alpha}{t-1}} \cdot \boldsymbol{\epsilon}{\theta}(\mathbf{x}_t,t)$$ Substituting $\hat{\mathbf{x}}_0(\mathbf{x}_t) = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}t - \sqrt{1-\bar{\alpha}t}\boldsymbol{\epsilon}{\theta}(\mathbf{x}t,t))$: $$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}} \left( \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}t}\boldsymbol{\epsilon}{\theta}(\mathbf{x}t,t)}{\sqrt{\bar{\alpha}t}} \right) + \sqrt{1-\bar{\alpha}{t-1}} \cdot \boldsymbol{\epsilon}{\theta}(\mathbf{x}_t,t)$$ This is the Denoising Diffusion Implicit Model (DDIM). Since no random noise$\mathbf{\epsilon_t}$ is added at each step (beyond the initial $\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})$), the entire generation process from$\mathbf{x}_T$ to $\mathbf{x}0$ is deterministic given the model $\boldsymbol{\epsilon}{\theta}$.
This means that for a fixed starting noise
The term "implicit" in DDIM refers to the fact that the model implicitly defines a generative process. Unlike DDPMs, where the reverse process
The original DDPM paper (Ho et al., 2020) required
How is this achieved?
Instead of sampling at every timestep
The DDIM update rule uses $\bar{\alpha}{\tau_i}$ and $\bar{\alpha}{\tau_{i-1}}$ (where
And the update to get $\mathbf{x}{\tau{i-1}}$ (for
Why does this work?
-
Consistency of
$\hat{\mathbf{x}}_0$ Prediction: The model$\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t, t)$ is trained to predict the noise component, which in turn allows a prediction of$\mathbf{x}_0$ . This prediction of$\mathbf{x}_0$ is relatively stable across different$t$ values. DDIM leverages this by directly using the predicted$\hat{\mathbf{x}}_0$ to guide the trajectory towards a cleaner image. -
No Compounding Noise (for
$\eta=0$ ): In DDPM, noise is added at each step ($\sigma_t \mathbf{\epsilon_t}$ ). Small errors in$\boldsymbol{\epsilon}_{\theta}$ can be amplified by this noise. In deterministic DDIM ($\eta=0$ ), no new noise is injected. The process directly "denoises" towards the predicted$\hat{\mathbf{x}}_0$ . This allows for larger "jumps" between timesteps without significant degradation in quality. The path from$\mathbf{x}_T$ to$\mathbf{x}_0$ is much smoother.
The deterministic nature of DDIM (when
Encoding:
Because the model
If the reverse (denoising) step is: $$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}} \hat{\mathbf{x}}0(\mathbf{x}t) + \sqrt{1-\bar{\alpha}{t-1}} \cdot \boldsymbol{\epsilon}{\theta}(\mathbf{x}_t, t)$$
Then the encoding (noising) step becomes: $$\mathbf{x}{t} = \sqrt{\bar{\alpha}{t}} \hat{\mathbf{x}}0(\mathbf{x}{t-1}) + \sqrt{1-\bar{\alpha}{t}} \cdot \boldsymbol{\epsilon}{\theta}(\mathbf{x}_{t-1}, t-1)$$
Where: $$ \hat{\mathbf{x}}0(\mathbf{x}{t-1}) = \frac{1}{\sqrt{\bar{\alpha}{t-1}}}(\mathbf{x}{t-1} - \sqrt{1-\bar{\alpha}{t-1}}\boldsymbol{\epsilon}{\theta}(\mathbf{x}_{t-1}, t-1)) $$
In this formulation, we are leveraging the fact that
By iteratively applying this forward deterministic step starting from
This property is extremely useful for:
- Image Reconstruction: Verifying model fidelity.
-
Image Manipulation/Editing: Modifying the latent code
$\mathbf{x}_T$ and then decoding can lead to controlled edits of the original image. For example, one can interpolate between latent codes of two images. -
Semantic Latent Spaces: The structure of the latent space
$\mathbf{x}_T$ learned via DDIM can capture semantic features of the data.
| Feature | DDPM | DDIM ( |
DDIM ( |
|---|---|---|---|
| Training | Learns $\boldsymbol{\epsilon}_\theta(\mathbf{x}t, t)$ via $L{simple}$ | Uses DDPM-trained model | Uses DDPM-trained model |
| Forward Process | Markovian (fixed noise schedule) | Implies a non-Markovian process (same $q(x_t|x_0)$) | Same as DDPM |
| Generative Steps |
|
|
|
| Stochasticity | Stochastic (noise |
Deterministic (no noise |
Stochastic (same as DDPM) |
| Invertibility | Not directly invertible | Invertible (allows encoding |
Not directly invertible |
| Sampling Speed | Slow | Fast | Slow |
| Sample Quality | High | High (often better than DDPM for same number of model evaluations if |
High (similar to DDPM) |
| Typical Use | Standard high-quality generation | Fast generation, image editing, latent space exploration | Equivalent to DDPM sampling |
The key is to use the same network
# DDIM Sampling and Encoding Pseudo-code (with η=0 for deterministic process)
def ddim_sample(model, x_T, timesteps):
"""
Sample from a trained diffusion model using DDIM.
Args:
model: The trained epsilon_θ model that predicts noise
x_T: Starting noise (typically sampled from N(0, I))
timesteps: List of timesteps to use for sampling (subset of original schedule)
For example, if original T=1000, we might use timesteps=[1000, 950, 900, ..., 50]
Returns:
x_0: The generated sample
"""
x_t = x_T
for i in range(len(timesteps) - 1):
# Current and next timestep in the subsequence
t = timesteps[i]
t_prev = timesteps[i + 1]
# Get the current alpha values
alpha_bar_t = get_alpha_bar(t)
alpha_bar_prev = get_alpha_bar(t_prev)
# Predict the noise component using the model
pred_noise = model(x_t, t)
# Predict x_0 from x_t and predicted noise
pred_x_0 = (x_t - math.sqrt(1 - alpha_bar_t) * pred_noise) / math.sqrt(alpha_bar_t)
# Use the predicted x_0 and the noise to compute x_{t-1} (DDIM update)
# For η=0 (deterministic DDIM)
x_t_prev = math.sqrt(alpha_bar_prev) * pred_x_0 + math.sqrt(1 - alpha_bar_prev) * pred_noise
# Update x_t for next iteration
x_t = x_t_prev
return x_t # This is x_0 after the final iteration
def ddim_encode(model, x_0, timesteps):
"""
Encode a real image x_0 into its latent representation x_T using DDIM.
Args:
model: The trained epsilon_θ model that predicts noise
x_0: The real image to encode
timesteps: List of timesteps to use for encoding (subset of original schedule, in reverse)
For example, if original T=1000, we might use timesteps=[50, 100, ..., 1000]
Returns:
x_T: The latent representation
"""
x_t = x_0
for i in range(len(timesteps) - 1):
# Current and next timestep in the subsequence
t_prev = timesteps[i]
t = timesteps[i + 1]
# Get the current alpha values
alpha_bar_prev = get_alpha_bar(t_prev)
alpha_bar_t = get_alpha_bar(t)
# Predict the noise component using the model
pred_noise = model(x_t, t_prev)
# Predict x_0 from x_t and predicted noise
pred_x_0 = (x_t - math.sqrt(1 - alpha_bar_prev) * pred_noise) / math.sqrt(alpha_bar_prev)
# Compute x_t using DDIM forward step (encoding)
x_t_next = math.sqrt(alpha_bar_t) * pred_x_0 + math.sqrt(1 - alpha_bar_t) * pred_noise
# Update x_t for next iteration
x_t = x_t_next
return x_t # This is x_T after the final iteration
def ddim_sample_with_eta(model, x_T, timesteps, eta=0.0):
"""
Sample from a trained diffusion model using DDIM with controllable stochasticity.
Args:
model: The trained epsilon_θ model that predicts noise
x_T: Starting noise (typically sampled from N(0, I))
timesteps: List of timesteps to use for sampling
eta: Controls stochasticity (0.0 = deterministic DDIM, 1.0 = DDPM equivalent)
Returns:
x_0: The generated sample
"""
x_t = x_T
for i in range(len(timesteps) - 1):
# Current and next timestep in the subsequence
t = timesteps[i]
t_prev = timesteps[i + 1]
# Get the current alpha values
alpha_t = get_alpha(t)
alpha_bar_t = get_alpha_bar(t)
alpha_bar_prev = get_alpha_bar(t_prev)
# Compute beta values
beta_t = 1 - alpha_t
beta_tilde = (1 - alpha_bar_prev) / (1 - alpha_bar_t) * beta_t
# Set sigma according to eta parameter
sigma_t = eta * math.sqrt(beta_tilde)
# Predict the noise component using the model
pred_noise = model(x_t, t)
# Predict x_0 from x_t and predicted noise
pred_x_0 = (x_t - math.sqrt(1 - alpha_bar_t) * pred_noise) / math.sqrt(alpha_bar_t)
# Compute coefficient for the noise direction
direction_coefficient = math.sqrt(1 - alpha_bar_prev - sigma_t**2)
# Use the predicted x_0 and the noise to compute x_{t-1} (DDIM update with stochasticity)
x_t_prev = (math.sqrt(alpha_bar_prev) * pred_x_0 +
direction_coefficient * pred_noise +
sigma_t * torch.randn_like(x_t))
# Update x_t for next iteration
x_t = x_t_prev
return x_t # This is x_0 after the final iteration- Song, J., Meng, C., & Ermon, S. (2020). "Denoising Diffusion Implicit Models." arXiv preprint arXiv:2010.02502. (Often cited as ICLR 2021)
- Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." arXiv preprint arXiv:2006.11239.

