Introduction to Diffusion Models and Diffusion Policy

Diffusion models have emerged as a powerful class of generative models that have revolutionized image generation, achieving state-of-the-art results in tasks ranging from image synthesis to text-to-image generation. In robotics, diffusion models have been adapted to create diffusion policies, which represent a paradigm shift in how robots learn and execute complex manipulation and navigation tasks. Unlike traditional policy learning methods that output deterministic or simple stochastic actions, diffusion policies model the action distribution as a denoising process, allowing robots to capture multi-modal behaviors and handle ambiguous situations more effectively. This article provides a comprehensive introduction to diffusion models, their mathematical foundations through both Ordinary Differential Equation (ODE) and Stochastic Differential Equation (SDE) formulations, and their practical application in robotics through diffusion policies.

Fundamentals of Diffusion Models

Diffusion models are generative models that learn to generate data by reversing a gradual noising process. The core idea is inspired by non-equilibrium thermodynamics: starting from a simple noise distribution (typically Gaussian), the model learns to iteratively denoise the data until it produces a realistic sample from the target distribution.

The Forward Diffusion Process

The forward diffusion process gradually adds Gaussian noise to data over a series of timesteps. Given an initial data point $\mathbf{x}_0$ sampled from the data distribution $q(\mathbf{x}_0)$, we define a forward process that produces a sequence of increasingly noisy versions $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$, where $T$ is the total number of diffusion steps.

The forward process is defined as:

\[q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t | \mathbf{x}_{t-1})\]

where each step adds Gaussian noise:

\[q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})\]

Here, $\beta_t$ is a variance schedule that controls how much noise is added at each timestep $t$. Typically, $\beta_t$ increases with $t$, meaning more noise is added as we progress through the diffusion process.

A key insight is that we can sample $\mathbf{x}_t$ directly from $\mathbf{x}_0$ without going through all intermediate steps:

\[q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})\]

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^{t} \alpha_s$. This allows efficient training by sampling random timesteps.

The Reverse Diffusion Process

The reverse process is what we learn during training. We want to model $p_\theta(\mathbf{x}_{t-1}

\mathbf{x}_t)$, which reverses the forward process. The reverse process is parameterized by a neural network $\theta$:

\[p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)\]

where $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})$ is the prior distribution (pure noise).

The reverse step is also Gaussian:

\[p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))\]

Training Objective

The training objective for diffusion models is derived from variational inference. The key insight is to predict the noise $\boldsymbol{\epsilon}$ that was added to $\mathbf{x}0$ to obtain $\mathbf{x}_t$, rather than predicting $\mathbf{x}{t-1}$ directly.

Given that $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, the training loss becomes:

\[\mathcal{L} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \right]\]

where $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ is a neural network that predicts the noise given the noisy sample $\mathbf{x}_t$ and timestep $t$.

ODE Formulation: Deterministic Sampling

The Ordinary Differential Equation (ODE) formulation provides a deterministic view of the diffusion process, enabling faster and more stable sampling. This formulation is based on the observation that the diffusion process can be described as a continuous-time ODE.

Probability Flow ODE

In the continuous-time limit, the forward diffusion process can be written as a stochastic differential equation. However, there exists a corresponding deterministic ODE, called the Probability Flow ODE, that has the same marginal distributions:

\[\frac{d\mathbf{x}}{dt} = -\frac{1}{2}\beta(t)\left[\mathbf{x} + \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\right]\]

where $\beta(t)$ is the continuous-time noise schedule and $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ is the score function.

The score function can be approximated using the learned noise predictor:

\[\nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \approx -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}\]

This leads to the practical ODE:

\[\frac{d\mathbf{x}}{dt} = -\frac{1}{2}\beta(t)\left[\mathbf{x} - \frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}\right]\]

Advantages of ODE Formulation

The ODE formulation offers several practical advantages:

Deterministic Sampling: Given the same initial noise, the ODE produces the same output, enabling reproducible results.
Faster Sampling: ODE solvers can use adaptive step sizes and higher-order methods, allowing fewer function evaluations than the standard discrete diffusion steps.
Exact Likelihood Computation: The ODE formulation enables exact likelihood computation through the change of variables formula, useful for model evaluation and likelihood-based training.
Latent Space Interpolation: The deterministic nature allows smooth interpolation in the latent space, useful for generating intermediate samples.

Numerical Integration Methods

Common ODE solvers used with diffusion models include:

Euler Method: Simple first-order method, $\mathbf{x}_{t-\Delta t} = \mathbf{x}_t + \Delta t \cdot \frac{d\mathbf{x}}{dt}$
Heun’s Method: Second-order Runge-Kutta method for better accuracy
DPM-Solver: Specialized solver designed for diffusion models that can achieve high-quality samples with very few steps (e.g., 10-20 steps instead of 1000)

SDE Formulation: Stochastic Sampling

The Stochastic Differential Equation (SDE) formulation provides a more general framework that encompasses both the forward and reverse diffusion processes as continuous-time stochastic processes.

Forward SDE

The forward diffusion process can be written as a forward SDE:

\[d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}\]

where:

$\mathbf{f}(\mathbf{x}, t)$ is the drift coefficient
$g(t)$ is the diffusion coefficient
$d\mathbf{w}$ is a Wiener process (Brownian motion)

For the variance-preserving diffusion process:

\[\mathbf{f}(\mathbf{x}, t) = -\frac{1}{2}\beta(t)\mathbf{x}, \quad g(t) = \sqrt{\beta(t)}\]

Reverse SDE

The reverse-time SDE that generates samples is:

\[d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\right]dt + g(t)d\bar{\mathbf{w}}\]

where $d\bar{\mathbf{w}}$ is a reverse-time Wiener process. The key term is $g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$, which corrects for the drift to reverse the diffusion process.

Variance-Exploding vs Variance-Preserving

There are two main classes of diffusion SDEs:

Variance-Preserving (VP) SDE: The variance of the noise stays bounded. This is the most common formulation:

\[d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x}dt + \sqrt{\beta(t)}d\mathbf{w}\]

Variance-Exploding (VE) SDE: The variance grows unbounded:

\[d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}}d\mathbf{w}\]

where $\sigma(t)$ is a time-dependent noise scale.

Advantages of SDE Formulation

Theoretical Rigor: Provides a complete mathematical framework connecting forward and reverse processes
Flexibility: Can model different noise schedules and diffusion types
Stochasticity: The Brownian motion term introduces randomness, which can help explore the data distribution
Unified Framework: Connects diffusion models to other generative models (e.g., score-based models)

Diffusion Policy for Robotics

Diffusion policies adapt the diffusion model framework to learn action distributions for robotic control. Instead of generating images, the model generates action sequences that the robot should execute.

Problem Formulation

In imitation learning, we have a dataset $\mathcal{D} = {(\mathbf{o}_i, \mathbf{a}_i)}$ of observation-action pairs. Traditional methods learn a deterministic policy $\pi(\mathbf{a}

\mathbf{o})$ or a simple Gaussian policy. Diffusion policies model the action distribution as:

\[\pi(\mathbf{a} | \mathbf{o}) = \int p(\mathbf{a}_T) \prod_{t=1}^{T} p_\theta(\mathbf{a}_{t-1} | \mathbf{a}_t, \mathbf{o}) d\mathbf{a}_{1:T}\]

where $\mathbf{a}$ is the action and $\mathbf{o}$ is the observation (e.g., camera image, proprioceptive state).

Architecture

A diffusion policy network takes as input:

Observation $\mathbf{o}$: Current robot state and sensor readings
Noisy action $\mathbf{a}_t$: The action at diffusion timestep $t$
Timestep $t$: The current diffusion step

The network outputs a prediction of the noise $\boldsymbol{\epsilon}_\theta(\mathbf{a}_t, \mathbf{o}, t)$ that was added to the clean action.

Action Sequence Generation

For robotics, we often need to generate action sequences rather than single actions. The diffusion policy can generate a sequence of $H$ actions $\mathbf{a}^{0:H-1}$:

Sample initial noise: $\mathbf{a}_T^{0:H-1} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
For $t = T$ down to $1$:
- Predict noise: $\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{a}_t^{0:H-1}, \mathbf{o}, t)$
- Denoise: $\mathbf{a}_{t-1}^{0:H-1} = \text{denoise}(\mathbf{a}_t^{0:H-1}, \hat{\boldsymbol{\epsilon}}, t)$
Return $\mathbf{a}_0^{0:H-1}$ as the action sequence

The robot executes the first action $\mathbf{a}_0^0$, then re-plans at the next timestep (receding horizon control).

Key Advantages for Robotics

Multi-Modal Action Distributions: Unlike deterministic policies, diffusion policies can represent multiple valid actions for the same observation, crucial when there are multiple ways to accomplish a task.
Temporal Consistency: By generating action sequences, the policy naturally maintains temporal smoothness, important for stable robot control.
Robustness: The iterative denoising process can recover from poor initial samples, making the policy more robust to distribution shift.
Data Efficiency: Can learn complex behaviors from relatively small demonstration datasets.

Practical Implications: ODE vs SDE

The choice between ODE and SDE formulations has significant practical implications for deploying diffusion policies in robotics.

Sampling Speed

ODE Formulation:

Typically requires 10-50 function evaluations for high-quality samples
Deterministic, reproducible results
Faster for real-time control applications
Can use advanced ODE solvers (DPM-Solver, DPM-Solver++) for even fewer steps

SDE Formulation:

Usually requires 50-1000 steps for good quality
Stochastic sampling introduces variability
Slower, but can explore the distribution better
More suitable for offline planning or when sampling speed is not critical

Sample Quality and Diversity

ODE Formulation:

Produces high-quality, deterministic samples
Less diversity due to deterministic nature
Better for tasks requiring precise, repeatable actions
Suitable for safety-critical applications where consistency matters

SDE Formulation:

Can produce more diverse samples due to stochasticity
Better exploration of the action distribution
Useful when multiple valid action modes exist
May introduce unwanted variability in robot behavior

Computational Requirements

ODE Formulation:

Lower computational cost per sample (fewer steps)
Can leverage efficient ODE solvers
Better suited for on-robot deployment with limited compute
Enables real-time control at higher frequencies

SDE Formulation:

Higher computational cost (more steps typically needed)
May require GPU acceleration for real-time performance
Better suited for offline planning or simulation

Implementation Considerations

For robotics applications, consider the following:

Real-Time Constraints: If the robot needs to make decisions at high frequency (e.g., 10-30 Hz), ODE formulation with fast solvers is preferable.
Action Diversity Needs: If the task benefits from exploring multiple action modes (e.g., manipulation with multiple grasp strategies), SDE formulation may be better.
Hardware Limitations: Resource-constrained robots should prefer ODE formulation for its efficiency.
Safety Requirements: Deterministic ODE sampling provides more predictable behavior, important for safety-critical applications.

Hybrid Approaches

Many practical implementations use hybrid approaches:

Training: Use SDE formulation for better coverage of the action distribution
Inference: Use ODE formulation for faster, deterministic sampling
Adaptive Sampling: Start with ODE for fast initial samples, use SDE refinement when needed

Training Diffusion Policies

Training a diffusion policy follows similar principles to training image diffusion models, with modifications for the robotics domain.

Dataset Preparation

The training dataset consists of:

Observations: $\mathbf{o}_i$ - sensor readings, images, proprioceptive state
Actions: $\mathbf{a}_i$ - demonstrated actions (joint velocities, end-effector poses, etc.)
Action Sequences: For sequence-based policies, include action sequences $\mathbf{a}_i^{0:H-1}$

Loss Function

The training loss is similar to standard diffusion models:

\[\mathcal{L} = \mathbb{E}_{(\mathbf{o}, \mathbf{a}) \sim \mathcal{D}, t \sim \mathcal{U}(1,T), \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{a}_t, \mathbf{o}, t)\|^2 \right]\]

where $\mathbf{a}_t = \sqrt{\bar{\alpha}_t}\mathbf{a} + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ is the noisy action.

Observation Encoding

The observation $\mathbf{o}$ needs to be encoded into a representation that the diffusion model can use. Common approaches:

CNN Encoders: For image observations
Transformer Encoders: For multi-modal observations or sequences
Proprioceptive Encoders: MLPs for joint positions, velocities, etc.

The encoded observation is concatenated with the noisy action and timestep embedding before being fed to the diffusion model.

Conditioning Strategies

Different conditioning strategies affect how observations influence action generation:

Concatenation: Simply concatenate observation features with noisy actions
Cross-Attention: Use attention mechanisms to condition on observations
FiLM (Feature-wise Linear Modulation): Modulate intermediate features based on observations

Applications in Robotics

Diffusion policies have shown success in various robotic domains:

Manipulation Tasks

Grasping: Learning diverse grasping strategies from demonstrations
Assembly: Complex manipulation sequences for assembly tasks
Tool Use: Using tools with multiple valid manipulation modes

Mobile Robotics

Navigation: Planning paths with multiple valid routes
Obstacle Avoidance: Generating smooth avoidance trajectories

Human-Robot Interaction

Handover Tasks: Adapting to different handover scenarios
Collaborative Manipulation: Coordinating with human partners

Challenges and Limitations

While diffusion policies offer significant advantages, they also face challenges:

Computational Cost: Even with ODE solvers, diffusion policies require multiple forward passes, making them slower than simple policy networks.
Hyperparameter Sensitivity: The noise schedule $\beta_t$ and number of steps $T$ significantly affect performance and need careful tuning.
Distribution Shift: Like all imitation learning methods, performance degrades when test conditions differ from training.
Sequential Inference: Generating action sequences requires careful handling of observation updates during execution.

Summary

Diffusion models represent a powerful paradigm for generative modeling that has been successfully adapted to robotics through diffusion policies. The mathematical foundations can be understood through both ODE and SDE formulations, each offering distinct advantages. The ODE formulation provides deterministic, fast sampling suitable for real-time control, while the SDE formulation offers a more complete theoretical framework and better exploration capabilities. In practice, the choice between formulations depends on the specific requirements of the robotic application, including real-time constraints, need for action diversity, and computational resources available. Diffusion policies excel at learning multi-modal action distributions from demonstrations, making them particularly valuable for complex manipulation and navigation tasks where multiple valid solutions exist.

References

Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., & Song, S. (2023). Diffusion policy: Visuomotor policy learning via action diffusion. Robotics: Science and Systems.

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.

Lu, C., et al. (2022). DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35, 5775-5787.

Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. International Conference on Learning Representations.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations.

Introduction to Diffusion Models and Diffusion Policy

Fundamentals of Diffusion Models

The Forward Diffusion Process

The Reverse Diffusion Process

Training Objective

ODE Formulation: Deterministic Sampling

Probability Flow ODE

Advantages of ODE Formulation

Numerical Integration Methods

SDE Formulation: Stochastic Sampling

Forward SDE

Reverse SDE

Variance-Exploding vs Variance-Preserving

Advantages of SDE Formulation

Diffusion Policy for Robotics

Problem Formulation

Architecture

Action Sequence Generation

Key Advantages for Robotics

Practical Implications: ODE vs SDE

Sampling Speed

Sample Quality and Diversity

Computational Requirements

Implementation Considerations

Hybrid Approaches

Training Diffusion Policies

Dataset Preparation

Loss Function

Observation Encoding

Conditioning Strategies

Applications in Robotics

Manipulation Tasks

Mobile Robotics

Human-Robot Interaction

Challenges and Limitations

Summary

See Also:

Further Reading

References