Training Flow and Diffusion Models: Flow Matching and Score Matching

3 minute read

Published: October 21, 2025

In our previous post, we defined the theoretical targets for Flow and Diffusion models. However, calculating these global targets directly is often mathematically intractable.In this post, we bridge the gap between theory and practice. We will explore:

Conditional Objectives: How to replace global targets with tractable, point-wise conditions using the Law of Total Expectation.
Flow Matching (FM): Breaking down the training procedure for Gaussian paths and velocity fields.
Score Matching (SM): Understanding how diffusion models learn to “denoise” by approximating the score function $\nabla \log p_t(x)$.
Practical Algorithms: Step-by-step training pseudocode for both FM and DDPM-style models.

By the end of this post, you will see how complex SDEs and ODEs collapse into a simple, elegant loss function: predicting the noise or the velocity.

Training the Generative Model

Flow Matching

Recall that the flow matching loss is defined as

\[\begin{align} \mathcal{L}_\text{FM}(\theta) &= \mathbb{E}_{t\sim \text{Unif},x\sim p_t}[\|u_t^\theta(x)-u_t^{target}(x)\|^2]\\ &= \mathbb{E}_{t\sim \text{Unif},z\sim p_{target}, x\sim p_t(\cdot\mid z)}[\|u_t^\theta(x)-u_t^{target}(x)\|^2] \end{align}\]

The transformation is conducted by law of total expectation. However, in real-world scenario, it is hard to compute the $u_t^{target}(x)$ term. Notice that the conditional velocity field $u_t^{target}(x \mid z)$ is tractable. Let us define the conditional flow matching loss

\[\mathcal{L}_{\text{CFM}} = \mathbb{E}_{t\sim \text{Unif}, z\sim p_{target}, x\sim p_t(\cdot\mid z)}[\|u_t^\theta(x)-u_t^{target}(x\mid z)\|^2]\]

Now we prove that $\mathcal{L}_{\text{FM}} = \mathcal{L}_{\text{CFM}} + C$

Proof. $\begin{align} \mathcal{L}_\text{FM}&= \mathbb{E}_{t\sim \text{Unif},x\sim p_t}[\|u_t^\theta(x)-u_t^{target}(x)\|^2]\\ &= \mathbb{E}_{t\sim \text{Unif},x\sim p_t}[\|u_t^\theta(x)\|^2-2u_t^\theta(x)^Tu_t^{target}(x)+\|u_t^{target}(x)\|^2]\\ &= \mathbb{E}_{t\sim \text{Unif},z\sim p_{target}, x\sim p_t(\cdot\mid z)}[\|u_t^\theta(x)\|^2]-2\mathbb{E}_{t\sim \text{Unif},z\sim p_{target}, x\sim p_t(\cdot\mid z)}[u_t^\theta(x)^Tu_t^{target}(x|z)]+C_1\\ &= \mathbb{E}_{t\sim \text{Unif}, z\sim p_{target}, x\sim p_t(\cdot\mid z)}[\|u_t^\theta(x)-u_t^{target}(x\mid z)\|^2]+C_2+C_1\\ &= \mathcal{L}_{\text{CFM}} + C \end{align}$

Once $u_t^\theta$ is trained, we can simulate the flow model

\[\text{d}X_t = u_t^\theta(X_t)\text{d}t,\quad X_0\sim p_{init}\]

Flow Matching Training Procedure (here for Gaussian CondOT path $p_t(x|z) = \mathcal{N}(tz, (1-t^2)I_d)$) Require: A dataset of samples $z\sim p_{target}$, neural network $u_t^\theta$ For each mini-batch of data do

Sample a data example $z$ from the dataset
Sample a random time $t\sim \text{Unif}_{[0,1]}$
Sample noise $\epsilon\sim \mathcal{N}(0, I_d)$
Set $x = tz+(1-t)\epsilon$
Compute loss $\mathcal{L}(\theta) = \|u_t^\theta(x)-(z-\epsilon)\|^2$
Update model parameter $\theta$ via gradient descent on $\mathcal{L}(\theta)$

Score Matching

Recall the SDE with the same marginal distribution

\[\text{d}X_t = \left[u_t^{target}(X_t)+\frac{\sigma^2_t}{2}\nabla\text{log}p_t(X_t)\right]\text{d}t+\sigma_t\text{d}W_t\]

In the similar way, we define score matching loss and conditional score matching loss:

\[\begin{align} \mathcal{L}_{\text{SM}} &= \mathbb{E}_{t\sim \text{Unif}, z\sim p_{target}, x\sim p_t(\cdot\mid z)}[\|s_t^\theta(x)-\nabla\text{log}p_t(x)\|^2]\\ \mathcal{L}_{\text{CSM}} &= \mathbb{E}_{t\sim \text{Unif}, z\sim p_{target}, x\sim p_t(\cdot\mid z)}[\|s_t^\theta(x)-\nabla\text{log}p_t(x\mid z)\|^2] \end{align}\]

Consider the toy example provided last post $p_t(x\mid z) = \mathcal{N}(\mu_t(z), \sigma_t^2I_d)$, the conditional score $\nabla\text{log}p_t(x\mid z) = -\frac{x-\mu_t(z)}{\sigma^2_t}$

\[\mathcal{L}_{\text{CSM}} = \mathbb{E}_{t\sim \text{Unif}, z\sim p_{target}, x\sim p_t(\cdot\mid z)}[\frac{1}{\sigma^2_t}\|\sigma_ts_t^\theta(\mu_t(z)+\sigma_t\epsilon)+\epsilon\|^2]\]

Notice that the score network $s_t^\theta$ essentially learns to predict the noise that was used to corrupt a data sample z. Therefore, the above training loss is also called denoising score matching in early stage work. In Denoising Diffusion Probabilistic Models, constant $\frac{1}{\sigma_t^2}$ is dropped and reparameterize $s_t^\theta$ into a noise predictor network $\epsilon_t^\theta: \mathbb{R}\times[0,1]\rightarrow\mathbb{R}^d$ via:

\[-\sigma_ts_t^\theta(x) = \epsilon_t^\theta(x) \Rightarrow \mathcal{L}_\text{DDPM}=\mathbb{E}_{t\sim\text{Unif}, z\sim p_{target},\epsilon\sim\mathcal{N}(0,I_d)}[\|\epsilon_t^\theta(\mu_t(z)+\sigma_t\epsilon)+\epsilon\|^2]\]

Score Matching Training Procedure for Gaussian probability path Require: A dataset of samples $z\sim p_{target}$, neural network $s_t^\theta$ For each mini-batch of data do

Sample a data $z$ from the dataset
Sample a random time $t\sim \text{Unif}_{[1,0]}$
Sample noise $\epsilon \sim \mathcal{N}(0,I_d)$
Set $x_t = \mu_t(z)+\sigma_t\epsilon$
Compute loss $\mathcal{L}(\theta) = \|s_t^\theta(x) + \frac{\epsilon}{\sigma_t}\|^2$
Updata $\theta$ via gradient descent on $\mathcal{L}(\theta)$

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Chen Chen

Training Flow and Diffusion Models: Flow Matching and Score Matching

Training the Generative Model

Flow Matching

Score Matching

Share on

You May Also Enjoy

Transformer

Embedding

Attention

Constructing Training Targets: Probability Paths, Vector Fields, and Scores