Doob’s h-Transform and Conditional Markov Chains
In this post, I want to give an intuitive introduction to Doob’s h-transform. It is often presented either in a very abstract way or in a way that is too tied to a specific application. My goal here is to first give a general and intuitive picture, and then explain how it appears in guided diffusion models. This post is not meant to be fully rigorous, and some notions are used in a slightly informal way, but I hope it gives a good starting point for understanding this construction.
A useful way to think about Doob’s h-transform is as a way of tilting the transitions of a Markov chain so that the new chain satisfies some desired property, while remaining as close as possible to the original dynamics.
1. A (Not So) Simple Example
Let us start with a simple example that, upon close inspection, proves surprisingly subtle. Consider a random walk (X_t) on \mathbb Z, and suppose we wish to construct a new process that behaves like the original chain while remaining positive forever.
Consider the birth-death chain on \mathbb Z defined by P(x, x+1) = p, \qquad P(x, x-1) = q, \qquad p + q = 1.
Let \tau = \inf\{t \ge 0 : X_t \le 0\} be the first hitting time of the non-positive integers. We would like to describe the law of the chain conditioned on the event \{\tau = \infty\}.
To do this, it is convenient to restrict attention to the positive integers so that the conditioning is well-defined. Let \mathbb P_x denote the law of the chain started from X_0 = x.
Now fix x, y \ge 1. By the Markov property, \mathbb P_x(X_1 = y, \tau = \infty) = \mathbb P_x(X_1 = y)\,\mathbb P_y(\tau = \infty). Therefore, \mathbb P_x(X_1 = y \mid \tau = \infty) = \frac{\mathbb P_x(X_1 = y, \tau = \infty)}{\mathbb P_x(\tau = \infty)}.
This suggests defining a new transition kernel for positive integers by P^h(x, y) = \frac{P(x, y)\,h(y)}{h(x)}, \qquad x, y \ge 1, where h(x) = \mathbb P_x(\tau = \infty). This new kernel is called the Doob h-transform of the original chain.
We can see that for this to be a valid transition kernel, h(x) must be strictly positive inside the domain we are considering (here \mathbb{N}) and harmonic: for every x \ge 1, h(x) = \sum_y P(x, y)\,h(y), or equivalently, h = Ph.
In the birth-death example, the harmonicity condition becomes h(x) = p\,h(x+1) + q\,h(x-1), \qquad x \ge 1, with boundary condition h(0) = 0.
Now there are three cases to consider. Let us start with the easy one: p > q. Then one obtains h(x) = 1 - \left(\frac{q}{p}\right)^x, \qquad x \ge 1.
Therefore, h(x) > 0 and the transformed chain has transition probabilities P^h(x, x+1) = p\,\frac{1 - (q/p)^{x+1}}{1 - (q/p)^x}, and P^h(x, x-1) = q\,\frac{1 - (q/p)^{x-1}}{1 - (q/p)^x}.
In particular, at x = 1, P^h(1, 0) = q\,\frac{h(0)}{h(1)} = 0, as expected: under the tilted law, the chain never leaves the positive integers.
Let us now focus on the case p \le q. In this case, one can show that \mathbb P_x(\tau = \infty) = 0 for all x. So we cannot naively define the same h-transform. However, we still want to tilt our dynamics to avoid the negative numbers.
To do this, one classical way is to condition on the event \tau \ge n, and then, if possible, take the limit as n \to \infty.
Define the process \mathbb Q_x = \mathbb P_x(\cdot \mid \tau \ge n). Using the Markov property, we have \mathbb Q_x(X_{k+1} = y \mid X_k = x) = \frac{P(x, y)\,\mathbb P_y(\tau \ge n-k-1)}{\mathbb P_x(\tau \ge n-k)} = \frac{P(x, y)\,h_{n-k-1}(y)}{h_{n-k}(x)}, where h_n(x) = \mathbb P_x(\tau \ge n).
This example highlights a slightly more general requirement on the function h the time-inhomogeneous family (h_k)_{k \ge 0} has to satisfy the recurrence relation h_{k+1} = P_k h_k, where P_k is a potentially time-inhomogeneous transition kernel.
If p = q, one can show that h_n(x) \sim C \frac{x}{\sqrt{n}}, therefore we can define a limiting transition kernel P^h(x, y) = \frac{y}{x} P(x, y). If p < q, a little more calculation gives h_n(x) \sim C x \left(\sqrt{\frac{q}{p}}\right)^x (2\sqrt{pq})^n n^{-3/2}. Interestingly, the resulting function is not harmonic for the original killed chain, but it is an eigenvector with eigenvalue 2\sqrt{pq}.
However, in both cases one obtains the same limiting kernel: P^h(x, x+1) = \frac12\left(1 + \frac1x\right), \qquad P^h(x, x-1) = \frac12\left(1 - \frac1x\right).
A nice property of this kernel is that we clearly see the repulsion effect that gets less and less prominent as x grows, until it becomes almost perfectly symmetric like the original chain. This is exactly what we would expect for the case p = q. For the case p < q, however, it is quite surprising: it seems like the downward drift has disappeared. This is because the positive conditioning is incompatible with that downward drift in the infinite time limit, so the most “likely” behavior is to exactly cancel it and stay at the edge of recurrence.
The cases p \le q where we use the limit of the conditioning on \{\tau \ge n\} is actually a very delicate topic, and we do not claim to be fully rigorous about it here. One can find more details Sawyer’ book on Martin boundaries and random walks, listed in the references.
2. Doob-h and Stopping Times
The example above is actually a special case of a very classical use of the Doob h-transform of a Markov chain: conditioning on a stopping event.
Let us consider two sets A and B in \mathcal X, and the stopping time \tau = \inf\{t \ge 0 : X_t \in A \cup B\}. We want to condition on the event \{X_\tau \in A\}, that is, on the chain hitting A before B, or equivalently on the event \{\tau_A < \tau_B\} where \tau_A = \inf\{t \ge 0 : X_t \in A\}, \qquad \tau_B = \inf\{t \ge 0 : X_t \in B\}.
Then we can define the function h(x) = \mathbb P_x(\tau_A < \tau_B). This function is harmonic on \mathcal X \setminus (A \cup B), and it is strictly positive on the states from which A can be reached before B with positive probability. Similarly to the previous example, we can define a new transition kernel on \mathcal X \setminus (A \cup B) by P^h(x, y) = \frac{P(x, y) h(y)}{h(x)}. Using exactly the same argument as above, one checks that this new kernel corresponds to the law of the original chain conditioned on hitting A before B.
However, in order to simulate from this new kernel, we need to be able to compute h. In general, this is not an easy task but here h solves a linear system.
Let I = \mathcal X \setminus (A \cup B) denote the interior states, then for x \in I, h(x) = \sum_{y \in I} P(x, y) h(y) + \sum_{y \in A} P(x, y), because the contribution of B vanishes once we impose the boundary values h = 1 on A and h = 0 on B.
Equivalently, if P_{II} denotes the interior-to-interior block of the transition kernel and P_{IA} the interior-to-A block: h_I = P_{II} h_I + P_{IA}\mathbf{1}.
This is the finite-state Dirichlet problem, and it can be solved by standard linear algebra. Uniqueness follows from the discrete maximum principle (informally, if there were two different solutions, their difference would be harmonic and would have to attain its maximum or minimum on the boundary and since it is zero on the boundary, it must be zero everywhere).
For the animation, I also impose a finite time budget and condition on the event \{\tau_A < \tau_B \text{ and } \tau_A < N\}. This makes the tilting function time-dependent. If h_n(x) denotes the probability of reaching A before B within the next n steps, then h_0 = 0, \qquad h_{n+1} = P_{II} h_n + P_{IA}\mathbf{1}, where P_{II} is the interior-to-interior kernel and P_{IA} collects transitions from the interior to the exit set A. The transformed walk at time k then uses the remaining-time function h_{N-k}, which is why the green shading in the animation, corresponding to the value of h at each square, changes over time.
3. Taking a Step Back: The General Setting
Up until now, we have derived correspondences between Doob’s h-transform and conditioning on certain events in very specific settings. Let us now formalize this in a more general setting.
Let D \subset \mathbb{R}^d be an open set, let \tau_D = \inf\{t \ge 0 : X_t \notin D\} be the first exit time of D, and let (\mathcal F_t) be the filtration of the canonical process.
Let A \subset \partial D be a subset of the boundary, and suppose we want to condition our process on the event \{X_{\tau_D} \in A\}. We define h(x) = \mathbb P_x(X_{\tau_D} \in A).
Then for any F_t \in \mathcal F_t, \mathbb P_x(F_t \mid X_{\tau_D} \in A) = \frac{\mathbb P_x(F_t, X_{\tau_D} \in A)}{\mathbb P_x(X_{\tau_D} \in A)} = \frac{\mathbb E_x[\mathbb{1}_{F_t}\mathbb{1}_{X_{\tau_D} \in A}]}{h(x)}.
By the Markov property, \mathbb E_x[\mathbb{1}_{F_t}\mathbb{1}_{X_{\tau_D} \in A}] = \mathbb E_x[\mathbb{1}_{F_t} h(X_t)]. Therefore, \mathbb P_x(F_t \mid X_{\tau_D} \in A) = \frac{\mathbb E_x[\mathbb{1}_{F_t} h(X_t)]}{h(x)}.
Taking F_t = \Omega shows that h is harmonic for the original chain.
This construction suggests the right general setup to relate the conditioned process to the original process in terms of their generators. Define the semigroup P_t f(x) = \mathbb E_x[f(X_t)]. Let h be a harmonic function, so that P_t h = h.
Now define M_t^h = \frac{h(X_t)}{h(X_0)}. Then M_t^h is a martingale with respect to (\mathcal F_t) under \mathbb P_x. Therefore, for each t we can define a new probability measure on \mathcal F_t by \left.\frac{d\mathbb Q_x}{d\mathbb P_x}\right|_{\mathcal F_t} = M_t^h.
By construction, under the conditioned process we have \mathbb E_x^h[f(X_t)] = \mathbb E_x[M_t^h f(X_t)] = \frac{P_t(hf)(x)}{h(x)}. Therefore, the transformed semigroup is P_t^h f(x) = \frac{P_t(hf)(x)}{h(x)}.
It is straightforward to verify that the transformed law is again Markov.
Let \mathcal L be the generator of the original process. Since \mathcal L h = 0, we obtain \mathcal L^h f(x) = \lim_{t \to 0} \frac{P_t^h f(x) - f(x)}{t} = \lim_{t \to 0} \frac{\frac{P_t(hf)(x)}{h(x)} - f(x)}{t} = \frac{\mathcal L(hf)(x)}{h(x)}.
4. Continuous-Time and Space Processes
Now that we introduced a few examples as well as a more general setting, let us see what happens in the continuous-time and continuous-space setting. The same ideas apply, and the results are often even more elegant.
Let (X_t)_{t \ge 0} be a diffusion on \mathbb R defined by the SDE dX_t = b(X_t)\,dt + \sigma(X_t)\,dW_t, where b is the drift, \sigma is the diffusion coefficient, and W_t is a standard Brownian motion.
The generator of this diffusion is given by \mathcal L f(x) = \sum_{i=1}^d b_i(x)\frac{\partial f}{\partial x_i}(x) + \frac12 \sum_{i,j=1}^d a_{ij}(x)\frac{\partial^2 f}{\partial x_i \partial x_j}(x), where a(x) = \sigma(x)\sigma(x)^T is the diffusion matrix.
Let h be a positive harmonic function for \mathcal L, so that \mathcal L h = 0. We saw that \mathcal L^h f(x) = \frac{\mathcal L(hf)(x)}{h(x)}. Therefore, using the product rule, we have \mathcal L(hf)(x) = h(x)\mathcal L f(x) + f(x)\mathcal L h(x) + \langle a(x)\nabla h(x), \nabla f(x) \rangle = h(x)\mathcal L f(x) + \langle a(x)\nabla h(x), \nabla f(x) \rangle.
So, \mathcal L^h f(x) = \mathcal L f(x) + \left\langle a(x)\frac{\nabla h(x)}{h(x)}, \nabla f(x) \right\rangle = \mathcal L f(x) + \langle a(x)\nabla \log h(x), \nabla f(x) \rangle.
We can absorb the gradient term into the drift to see that the transformed process is again a diffusion with the same diffusion coefficient but with a different drift b^h(x) = b(x) + a(x)\nabla \log h(x).
Bessel-Type Conditioning
Let us go back to our first example of the random walk conditioned to stay positive. The case p = q can be seen as a diffusion process with b(x) = 0 and \sigma(x) = 1. The generator is \mathcal L f(x) = \frac12 f''(x).
Therefore we need to solve the Dirichlet problem \frac12 h''(x) = 0 \quad \text{in } D, \qquad h(0) = 0.
This gives h(x) = Cx for some constant C > 0, and we can take C = 1. Therefore, \mathcal L^h f(x) = \frac{1}{h}\mathcal L(hf)(x) = \frac{1}{x}\frac12(xf''(x) + 2f'(x)) = \frac12 f''(x) + \frac{1}{x} f'(x).
This is the generator of the 3-dimensional Bessel process, which can be seen as the norm of a 3-dimensional Brownian motion.
Besides its generator, let’s explicitly compute the dynamics. Since b^h(x) = (\log x)' = \frac{1}{x}, its dynamics are dX_t = \frac{1}{X_t}\,dt + dW_t.
This is the continuous counterpart of the discrete repulsion effect we saw above.
Brownian Bridges
Now let us condition a Brownian motion to hit a certain point z at a certain time T.
This enters the same framework as before, with h(t, x) = \mathbb P(X_T \in dz \mid X_t \in dx). In this case, we know that X_T \mid X_t has a normal distribution with mean X_t and variance T - t. Therefore, the tilted process has dynamics dX_t = \nabla \log h(t, X_t)\,dt + dW_t = \frac{z - X_t}{T - t}\,dt + dW_t.
This is exactly the dynamics of a Brownian bridge.
Here, we computed h directly with probabilistic arguments, but we could also solve the PDE \partial_t h + \mathcal L h = 0 with terminal condition h(T, x) = \delta_z(x) to obtain the same result.
We can see this in discrete time as well. If we consider the Markov chain defined by X_{k+1} = X_k + \xi_k, where \xi_k are i.i.d. standard normal random variables, then the Doob h-transform of the transition kernel of this chain is given by P^h(k, x, y) = \frac{P(x, y) h(k+1, y)}{h(k, x)} = \frac{\frac{1}{\sqrt{2\pi}} e^{-\frac{(y-x)^2}{2}} \frac{1}{\sqrt{2\pi(T-k-1)}} e^{-\frac{(z-y)^2}{2(T-k-1)}}}{\frac{1}{\sqrt{2\pi(T-k)}} e^{-\frac{(z-x)^2}{2(T-k)}}}.
5. Doob-h, Soft Conditioning, and Guided Diffusion Models
Until now, we have only seen the Doob h-transform as conditioning on events. More generally, we may want to tilt the law by a soft reward rather than by a hard constraint.
To keep the notation simple, let us work on a fixed time horizon T. Consider the tilted path measure \tilde{\mathbb P} \propto e^{r(X_T)}\mathbb P, where r(x) is a reward function. The hard-conditioning case can be viewed heuristically as an extreme reward that strongly favors some terminal set over the others.
The key step is to write \frac{d\tilde{\mathbb P}}{d\mathbb P} = \frac{e^{r(X_T)}}{\mathbb E_{\mathbb P}[e^{r(X_T)}]}. Then, just like before, we define h(t, x) = \mathbb E[e^{r(X_T)} \mid X_t = x], and the Markov property gives \tilde{\mathbb P}(X_t \in dy \mid X_s \in dx) = \frac{\mathbb P(X_t \in dy \mid X_s \in dx) h(t, y)}{h(s, x)}. Again, h solves the backward PDE \partial_t h + \mathcal L h = 0 with terminal condition h(T, x) = e^{r(x)}.
In learned diffusion models, we typically only have access to an estimate of the score, s_\theta(t, x) \approx \nabla \log p_t(x). Assuming a variance-exploding forward process, the original backward SDE is dX_t = \left(b(t, X_t) + \varepsilon_t s_\theta(t, X_t)\right)dt + \sqrt{2\varepsilon_t}\,dW_t, the Doob transform adds the correction predicted by the generator formula: dX_t = \left(b(t, X_t) + \varepsilon_t s_\theta(t, X_t) + 2\varepsilon_t \nabla \log h(t, X_t)\right)dt + \sqrt{2\varepsilon_t}\,dW_t.
The real difficulty is hidden in the term \nabla \log h(t, X_t), since it requires estimating h(t, x) = \mathbb E[e^{r(X_T)} \mid X_t = x]. In practice this conditional expectation is usually approximated rather than computed exactly.
To go a little deeper: SDE, ODE and guidance
The following derivations address in more detail the connection between the Doob h-transform and the forward and backward SDE in diffusion. In particular, we will see how (surprisingly) we can recover an ODE for the guided process. We thank Yazid Janati for the fruitful discussions on this topic which led to the addition of this paragraph.
Let’s consider the general backward time SDE dX_t = \left(b(t, X_t) + \varepsilon_t \nabla \log p_t(X_t)\right)dt + \sqrt{2\varepsilon_t}\,dW_t, where b is a drift term and \varepsilon_t is a time-dependent noise scale. Given the discussion above, the Doob h-transform of this SDE is given by dX_t = \left(b(t, X_t) + \varepsilon_t \nabla \log p_t(X_t) + 2\varepsilon_t \nabla \log h(t, X_t)\right)dt + \sqrt{2\varepsilon_t}\,dW_t.
Suppose that we want to recover an ODE for the guided process. We can write the Fokker-Planck equation for the density p_t^h of the guided process as \partial_t p_t^h = -\nabla \cdot \left(p_t^h \left(b + \varepsilon_t \nabla \log p_t + 2\varepsilon_t \nabla \log h\right)\right) + \varepsilon_t \Delta p_t^h. Using the identity \Delta p = \nabla \cdot (p \nabla \log p), this becomes \partial_t p_t^h = -\nabla \cdot \left(p_t^h \left(b + \varepsilon_t \nabla \log p_t + 2\varepsilon_t \nabla \log h\right)\right) + \varepsilon_t \nabla \cdot \left(p_t^h \nabla \log p_t^h\right).
Now p_t^h \propto h(t, \cdot)\,p_t, so \nabla \log p_t^h = \nabla \log h + \nabla \log p_t. Substituting this identity into the previous equation yields \partial_t p_t^h = -\nabla \cdot \left(p_t^h \left(b + \varepsilon_t \nabla \log h\right)\right). Therefore, the probability-flow ODE associated with the guided SDE is dX_t = \left(b(t, X_t) + \varepsilon_t \nabla \log h(t, X_t)\right)dt.
It is worth noting that for each \varepsilon_t > 0, the h function is different since it is defined as h(t, x) = \mathbb E[e^{r(X_T)} \mid X_t = x] where the expectation is taken with respect to the law of the SDE with noise scale \varepsilon_t. For practical reasons, we might want to use the h function that corresponds to the “canonical” SDE, i.e. the one that makes the forward process an Ornstein-Uhlenbeck process. One reason is that the denoiser is trained using this joint law so we are able to estimate E(X_T \mid X_t = x) which can be used tu approximate h (as in Chung et al., 2023).
One thing that is a little surprising is that it looks like to get the ODE, we have to go through the SDE and then go back to an ODE, because taking the ODE first, i.e. taking \varepsilon_t \to 0 in the original SDE does not allow any guidance.
However, in the case of the canonical SDE, we can also directly find the drift of the guided ODE by writing \nabla \log p_t^h(x_t) = \nabla \log \int q_{t|0}(x_t|x_0) p_0(x_0) h_0(x_0) dx_0, where q_{t|0} is the transition kernel of the original SDE. So \nabla \log p_t^h(x_t) = \nabla \log \int p_{0|t}(x_0|x_t) h_0(x_0) p_t(x_t) dx_0 = \nabla \log p_t(x_t) + \nabla \log \int p_{0|t}(x_0|x_t) h_0(x_0) dx_0 = \nabla \log p_t(x_t) + \nabla \log h(t, x_t). Which is the same as the one we find for the canonical SDE.
6. Conclusion: a general recipe for Doob’s h-transform
Let me end with the recipe that keeps showing up in all the examples above.
- Start from a Markov process whose dynamics you understand.
- Decide what event, endpoint, or reward you want to impose.
- Compute the corresponding tilting function h.
- Tilt the semigroup or generator by h.
- Read off the new dynamics.
The only real difference from one example to the next is the equation satisfied by h. There are three main cases.
1. Harmonic case: \mathcal L h = 0
This is the classical time-homogeneous Doob transform. It appears when we condition on the outcome of a stopping event, for instance hitting one part of the boundary before another one.
In that case, h is a positive harmonic function for the killed process, and the transformed generator is \mathcal L^h f = \frac{\mathcal L(hf)}{h}.
This is the setting of conditioning a process to avoid a set, hit one target before another, or more generally satisfy a time-independent boundary condition.
2. Space-time harmonic case: \partial_t h + \mathcal L h = 0
This is the time-dependent version. It appears when the conditioning is imposed at a fixed terminal time, or more generally when the tilt depends explicitly on time.
Typical examples are Brownian bridges or terminal-time soft constraints of the form h(t,x)=\mathbb E[g(X_T)\mid X_t=x].
Here h is no longer harmonic in space alone, but space-time harmonic, and the transform becomes time-inhomogeneous.
3. Feynman-Kac case
More generally, we may tilt the law not only by a terminal payoff, but also by a running weight accumulated along the path. Then h is no longer characterized just by harmonicity, but by a backward equation of Feynman-Kac type.
For example, if h(t,x) = \mathbb E\left[ \exp\left(\int_t^\tau V(s,X_s)\,ds\right) g(X_\tau) \,\middle|\, X_t=x \right], then h typically solves an equation of the form \partial_t h + \mathcal L h + Vh = 0, with the appropriate terminal or boundary condition.
So the general picture is always the same: identify the right linear equation for h, solve it if possible, and then use it to tilt the dynamics.
What makes Doob’s h-transform so useful is that it turns conditioning problems into linear ones. The probabilistic statement may look complicated, but once the right h-function is identified, the transformed process often becomes completely explicit.
7. Bonus: Don’t Do Probability, Solve PDEs
An interesting aspect of Doob’s h-transform, which we somewhat brushed over in Section 2, is that we often reduce a hard probabilistic expectation to a linear algebra or PDE problem.
This is what we saw in the example of stopping times. We use harmonicity and boundary conditions to solve for h(x) = \mathbb P_x(\tau_A < \tau_B) through a linear system.
More generally:
- We can compute the expected exit time u(x) = \mathbb E_x[\tau_D] by solving \mathcal L u = -1 with boundary condition u = 0 on \partial D.
- We can compute the expected payoff u(x) = \mathbb E_x[g(X_{\tau_D})] by solving \mathcal L u = 0 with boundary condition u = g on \partial D.
- We can compute the running cost plus terminal cost u(x) = \mathbb E_x\left[\int_0^{\tau_D} f(X_t)\,dt + g(X_{\tau_D})\right] by solving \mathcal L u = -f with boundary condition u = g on \partial D.
Even though the resulting PDEs can still be difficult and often have no closed-form solution this is a very powerful tool, as it transforms hard probabilistic problems into more analytic ones, which can be tackled with a completely different set of tools. In particular, it allows us to leverage the rich theory of PDEs to understand the behavior of conditioned processes and to compute quantities of interest that would be otherwise intractable.
References
- Doob, J. L. (1957). Conditional Brownian motion and the boundary limits of harmonic functions. Bulletin de la Société Mathématique de France, 85, 431-458.
- Doob’s h-transform: theory and examples. Note from Alex Bloemendal
- Markov Chains and Mixing Times. Levin, Peres, Wilmer (2009).
- Martin boundaries and random walks, Stanley A Sawyer, Contemporary Mathematics,