Introduction To Probabilistic Programming
1.1 Model-based Reasoning
A model is an artifical construct designed to respond in the same way as the system we would like to understand. As computers have evolved, numerical models have come to the forefront and computer simulations have replaced physical models. Numerical models emulate stochasticity, i.e. they use pseudorandom number generators to simulate actually random phenomena and other uncertainties. Running a simulator with stochastic value generation leads to an explosion of possible simulation outcomes. Effective stochastic modeling means writing a program that can produce all possible explosions, each corresponding to a particular set of random values. Stochastic numerical simulation aims to computationally encompass the complete distribution of possible outcomes.
When we write “model”, we generally will mean stochastic simulator and the measurable values it produces. However this is not the only notion of model that one can adopt.
Models produce values for things we can measure in the real world; we call such measured values observations. What counts as an observation is specific to the model, experiment, and query. Generally one does not observe every detail produced by a model, and sometimes one simply cannot.
Models can be used in various ways. One way is to use them to falsify theories. For this, one needs to encode the theory as a model and then simulate from it many times – if the population distribution of observations generated by the model is not in agreement with observations generated by the real world process, then there is evidence that the theory can be falsified. Models can also be used to make decisions.
All model types have parameters. Fitting these parameters, when few, can sometimes be performed manually, by intensive theory-based reasoning and a priori experimentation, by measuring conditional subcomponents of a simulator, or by simply fiddling with parameters to see which values produce the most realistic outputs.
Automated model fitting describes the process of using algorithms to determine either point or distributional estimates for model parameters and structure.
1.1.1 Model Denotation
To elaborate on what is meant by model denotation, we first look at a simple statistical model and see how it is denoted. Statistical models are typically denoted mathematically, subsequently manipulated algebraically, and then “solved” computationally. By “solved”, we mean that an inference problem, involving conditioning on the values of a subset of the variables in the model, is answered.
A simple model one could write down is a beta-Bernoulli model for generating a coin flip from a potentially biased coin. Such a model is typically denoted:
x ~ Beta(α, β) y ~ Bernoulli(x)
where α and β are parameters, x is a latent variable (the bias of the coin), and y is the value of the flipped coin.
We can ascribe a meaning to the symbol ~ and the keywords Beta and Bernoulli.
For example, Beta(a, b) means that given the value of arguments a and b, we can construct what is effectively an object with two methods.
- The first method is a probability density (or distribution) function that computes:
- The second method is something which can draw exact samples from the said distribution.
We can also intuit, not only that some variables in a model are to be observed (here for instance y), but that there is an inference objective, here for instance to characterize p(x|y)
.
We will generally focus on conditioning as the goal, namely the characterisation of some conditional distribution given a specification of a model in the form of a joint distribution. This involves the extensive use of Bayes rule:
Bayes rule tells us how to derive a conditional probability from a joint probability. Conditioning tells us how to rationally update our beliefs. Updating our beliefs is what learning and inference are all about.
The constituents of Bayes rule are:
- p(Y|X) - the likelihood
- p(X) - the prior
- p(Y) - the marginal likelihood (or evidence)
- p(X|Y) - the posterior
For our purposes, a model is the joint distribution p(Y, X) = p(Y|X)p(X) of the observations Y and the random choices made in the generative model X, known as the latent variables.
An important thing to understand is that conditioning a joint distribution, i.e. the fundamental Bayesian update, describes a huge number of problems succinctly.
We would build some intuition about the power of both programming languages for model denotation and automated conditioning, by considering the following table:
In this table, we list a number of X, Y pairs where denoting the joint distribution of P(X,Y) is realistically only doable in a probabilistic programming language and the posterior distribution P(X|Y) is of interest.
Consider the first row, “scene description” and “image”. Visualising what P(X, Y) is, is difficult. However, thinking about P(X) as a distribution over possible scene graphs is not too hard, for example as writing a simulator that only needs to stochastically generate reasonably plausible scene graphs. Given that P(X, Y) = P(Y|X)P(X), all we need to do to get the joint distribution is a way to get from a scene graph to an observable image. There are many kinds of renderers that do this – although these are generally deterministic, this is fine when specifying a joint distribution because they map from some latent scene description to an observable pixel space, and with the addition of some image-level pixel noise, they form a perfectly valid likelihood.
An example of this is considering the image Y to be a Captcha image and the scene description X to include the obscured string. Let us consider alternative ways to solve a Captcha problem. A non-probabilistic programming approach would require gathering a very large number of Captchas, hand-labelling them all, and then designing and training a neural network to regress from the image to a text string. The probabilistic programming approach in contrast, merely requires one to write a program that generates Captchas that are stylistically similar to the type of Captchas one would like to break, i.e. writing a model of Captchas, in a probabilistic programming language. Conditioning such a model on its observable output, the Captcha image, will yield a posterior distribution over text strings. This kind of conditioning is what probabilistic programming evaluators do.
The below figure shows a representation of the output of such a conditioning computation:
Each captcha and bar-plot pair consists of a held-out Captcha image and a truncated marginal posterior distribution over unique string interpretations (marginal distribution meaning it gives the probabilities of various values of a subset of variables without reference to the values of the other variables). The middle of the bottow row expresses that the noise on the Captcha makes it difficult to distinguish whether the string is “aG8BPY” or “aG8RPY” - the posterior distribution P(X|Y) arrived at by conditioning reflects this uncertainty.
This simple example aims to liberate the idea of what a model is (a joint distribution produced by adding stochastic choice to normal computer programs) and what the output of a conditioning computation can be like. What probabilistic programming languages do is to allow denotation of any such model. This tutorial covers how to develop inference algorithms that allow computational characterization of the posterior distribution of interest.
1.1.2 Conditioning
We aim to demonstrate what the mathematical operations involved in conditioning are like, and why the problem of conditioning is generally hard. Consider the example of coin-flips and let us write out the joint probability density for the distribution of X and Y. Assume that the symbol Y denotes the observed outcome of the coin flip, where heads is encoded as 1 and tails is encoded as 0. We denote the bias of the coin, i.e. the probability it comes up as heads, using the symbol x and encode it using a real positive number between 0 and 1.
Then using standard definitions for the distributions indicated by the joint denotation in:
x ~ Beta(α, β)
y ~ Bernoulli(x)
we can write:
and use rules of algebra to simplify this expression to:
Our implicit objective here is not to compute the value of the joint probability of some variables, but to do conditioning instead, for example, to compute p(x|y=“heads”). Using Bayes rule, this is theoretically easy to do. It is just:
In this special case, the rules of algebra and semantics preserving transformations of integrals can be used to algebraically solve for an analytic form for this posterior distribution.
However, the integral in the denominator is the complicating crux of Bayesian inference. This integral is in general intractable as it involves integrating over the entire space of latent variables. Consider the Captcha example: simply summing over the latent character sequence itself would require an exponential-time operation.
This specific example distribution of coin-flipping has a special property called conjugacy, which means that this integral can be performed by inspection, by identifying that the integrand is the same as the non-constant part of the beta distribution.
…
The result of the conditioning operation is a distribution p(x|y) parameterized by the observed or given quantity. Unfortunately, this distribution will in general not have an analytic form, because we usually won’t be so lucky that the normalizing integral as an algebraic analytic solution, nor that it will be easily calculable.
However, remember that the ~ operator is overloaded to mean two things:
- Evaluation of a probability density (or distribution) function.
- A means of drawing exact sampling from the said distribution.
Neither of these are possible in general. The latter can be approximated, often without being able to do the former. For this reason, our focus will be on sampling-based characterisations of conditional distributions in general.
1.1.3 Query
Having a handle on the resulting posterior distribution or a method for drawing samples from it, allows us to ask queries in general. These are best expressed in integral form as well. For instance, we could ask “what is the probability that the bias of the coin is greater than 0.7, given that the coin came up heads?”.
This is mathematically denoted as:
Where I() is an indicator function which evaluates to 1 when its argument is true, and 0 otherwise – in this instance, this can be directly calculated using the cumulative distribution function of the beta distribution.
Fortunately we can still answer queries when we only have the ability to sample from the posterior distribution, due to the Markov strong law of large numbers which states that for general distributions p and functions f:
There are two things to note here:
- The distribution on the RHS is approximated by a set of L samples on the LHS.
- Different functions f can be evaluated at the same sample points chosen to represent the probability distribution p after the samples have been generated.
1.2 Probabilistic Programming
One view of probabilistic programming is that it is about automating Bayesian inference. In this view, probabilistic programming concerns 1) the development of syntax and semantics for languages that denote conditional inference problems, and 2) the development of corresponding evaluators or solvers that computationally characterize the denoted conditional distribution.
Computer science has largely been about finding ways to efficiently evaluate programs, given parameter values, to produce some output.
The typical computer science programming pipeline is to write a program, specify the values of its arguments, then evaluate the program to produce an output. The typical statistical modeling approach is to start with the output (the observations or data Y), then specify a usually abstract generative model p(X, Y), and finally use algebra and inference techniques to characterize the posterior distribution p(X|Y) of the unknown quantities in the model given the observed quantities. Probabilistic programming is about performing Bayesian inference using the tools of computer science - programming language for model denotation, and statistical inference algorithms for computing the conditional distribution of program inputs that could have given rise to the observed program output.
For instance, reasoning about the bias of a coin flip is an example of the kind of inference that probabilistic programming systems can do. Our data is the outcome of a coin flip. Our model (specified in a forward direction) stipulates that a coin and its bias is generated according to the hand-specified model. The coin flip outcome is then observed and analysed under this model.
- One challenge, the writing of the model, is a major focus of statistics research where useful models are painstakingly designed for every new important problem.
- The other challenge is computational, and is what Bayes rule gives us a theoretical framework in which to calculate: to computationally characterize the posterior distribution of the latent quantities (bias of a coin) given the observed quantity (heads or tails)
When performing inference in probabilistic programming systems, what counts as observable are the outputs generated from the forward computation. The inference objective is to computationally characterize the posterior distribution of all of the latent quantities random choices made during the forward execution of the program, given that the program produces a particular output.
Probabilistic programs are usual functional or imperative programs with two added constructs:
- The ability to draw values at random from distributions.
- The ability to condition values of variables in a program via observations.
The meaning of a probabilistic program is that it simultaneously denotes a joint and conditional distribution, the latter by syntactically indicating where conditioning will occur (what we condition on – i.e. which random variable values will be observed).
Almost all languages have pseudo-random value generators or packages; what they lack in comparison to probabilistic programming languages is syntactic constructs for conditioning and evaluators that implement conditioning. Languages that include these constructs are called probabilistic programming languages. Languages that do not include these constructs, but are used for forward modeling, are called stochastic simulation languages, or more simply, programming languages.
There are many libraries for constructing graphical models and performing inference (graphical models are probabilistic models for which a graph expresses the conditional dependence structure between random variables); this software works by programmatically constructing a data structure which represents a model, and then given observations, runs graphical model inference. What distinguishes between this kind of approach and probabilistic programming is that a program is used to explicitly construct a model as a data structure, rather than considering the “model” that arises implicitly from direct evaluation of the program itself.
In probabilistic programming systems, a model data structure is either:
- Constructed explicitly via a non-standard interpretation of the probabilistic program itself (in which case inference is performed by compiling the model data structure to a probability density function).
- A general Markov model whose state is the evolving evaluation environment generated by the probabilistic programming language evaluator (in which case inference is performed by employing methods that are fundamentally generative).
(Difference between density function and distribution: so the distribution is the shape, the density is the function. If you have the density you have the distribution, but the density may not have an analytical form i.e. it may not be a computable function, so you may want to approximate the distribution. I’d say the density refers to the mathematical function, whereas the distribution is the shape. You can in theory replace one with the other and things should make sense.)
2. A Probabilistic Programming Language Without Recursion
We present the key ideas of probabilistic programming using a first-order probabilistic programming language (FOPPL).
The restrictions that we impose are that:
- Functions must be first-order, i.e. functions cannot take other functions as arguments.
- Functions cannot be recursive.
These two restrictions result in a language where models describe distributions over a finite number of random variables.
We can compile any program in the FOPPL to a data structure that represents the corresponding graphical model - this is a very useful property when reasoning about inference, as it allows us to make use of existing theories and algorithms for inference in graphical models. This property then gives rise to the ability to completely determine the computation graph of any FOPPL program in advance. While in a FOPPL program, conditional branching might dictate that not all of the nodes of its computation graph are active in the sense of being on the control-flow path, it is the case that all FOPPL programs can be unrolled to computation graphs where all possible control-flow paths are explicitly and completely enumerated at compile time. FOPPL programs hence have static computation graphs.
2.1 Syntax
We define the FOPPL in terms of two sets of production rules: one for expressions e
and another for programs q
.
v ::= variable
c ::= constant value or primitive operation
f ::= procedure
e ::= c | v | (let [v e1] e2) | (if e1 e2 e3)
| (f e1 ... en) | (c e1 ... en)
| (sample e) | (observe e1 e2)
q ::= e | (defn f [v1 ... vn] e) q
The two forms of expressions sample
and observe
are what make the FOPPL a probabilistic programming language.
-
A sample form
(sample e)
represents an unobserved random variable , i.e. a latent variable. We call these latent variables, because they are typically variables we sample from a prior distribution to represent some parameter we do not know much about yet. Thesample
statements hence define a priorp(X)
on the random variables.It accepts a single expression
e
(which must evaluate to a “distribution object”), and returns a value that is a sample from this distribution. Distributions are constructed using primitives provided by the FOPPL – for example,normal 0.0 1.0
evaluates to a standard normal distribution.
-
An observe form
(observe e1 e2)
represents an observed random variable. These are observed because we typically use a latent variable to parameterise a distribution, and from this we can directly observe a value. Theobserve
statements hence define a likelihoodp(Y|X)
, which is the probability of observing the observed variableY
conditioned on the latent variableX
.It accepts an argument
e1
which must evaluate to a distribution, and conditions the distribution on the next argumente2
, which is the value of the random variable.
To elaborate on how sample
represents a latent variable, and observe
represents an observed variable, consider the following:
x ∼ Beta(α, β)
y ∼ Bernoulli(x)
Here, x
is a latent variable that we sample
from a prior distribution, for example, the bias of a coin. Then, y
will be a variable that we directly observe
from a distribution parameterised by our latent variable, for example, the value of a flipped coin. We can’t observe the bias of the coin, but we can sample an initial guess for it as a latent variable. Using this, we can directly observe values which are produced from a distribution parameterised by the latent variable, in order to indirectly observe and infer the latent variable.
Let’s now illustrate what a program in the FOPPL looks like. Below shows a simple univariate Bayesian linear regression model.
(defn observe-data [slope intercept x y]
(let [fx (+ (* slope x) intercept)]
(observe (normal fx 1.0) y )
)
)
(let [ slope ( sample ( normal 0.0 10.0))]
(let [ intercept ( sample ( normal 0.0 10.0))]
(let [ y1 ( observe-data slope intercept 1.0 2.1)]
(let [ y2 ( observe-data slope intercept 2.0 3.9)]
(let [ y3 ( observe-data slope intercept 3.0 5.3)]
(let [ y4 ( observe-data slope intercept 4.0 7.7)]
(let [ y5 ( observe-data slope intercept 5.0 10.2)]
[ slope intercept ])
)
)
)
)
)
)
The function observe-data
conditions the generative model given a pair (x, y)
, by observing the value y
from a normal distribution with mean ((slope * x) + intercept)
. The function returns the observed value, which is ignored in our case – the reason for this is that we are not interested in the returned values; we are using observe-data
in order to score/condition our distribution, which is a side-effect of the function call.
The program defines a prior on the slope
and intercept
, using the function normal
for creating an object for normal distribution.
After conditioning this prior with 5 observed data points, the program returns a pair [slope intercept]
which is a sample from the posterior distribution where we have conditioned on the 5 observed values.
2.2 Syntactic Sugar
foreach
The foreach
construct has the following syntax:
(foreach c
[v_1 e_1 ... v_n e_n]
e_1' ... e_k'
)
This desugars into a vector containing c let
constructs:
(vector
(let [ v_1 (get e_1 0)
...
v_n (get e_n 0)
]
e_1' ... e_k'
)
...
(let [ v_1 (get e_1 (c - 1))
...
v_n (get e_n (c - 1))
]
e_1' ... e_k'
)
)
The foreach
form associates each variable v_i
with a sequence e_i
, and then maps over the values in this sequence for a total of c
steps, returning a vector of results. If the length of any of the bound sequences is less than c
, then this will result in a runtime error.
Using the foreach
form, we can write our linear regression model without having to make use of the function observe-data
.
Old code:
(defn observe-data [slope intercept x y]
(let [fx (+ (* slope x) intercept)]
(observe (normal fx 1.0) y )
)
)
(let [ slope ( sample ( normal 0.0 10.0))]
(let [ intercept ( sample ( normal 0.0 10.0))]
(let [ y1 ( observe-data slope intercept 1.0 2.1)]
(let [ y2 ( observe-data slope intercept 2.0 3.9)]
(let [ y3 ( observe-data slope intercept 3.0 5.3)]
(let [ y4 ( observe-data slope intercept 4.0 7.7)]
(let [ y5 ( observe-data slope intercept 5.0 10.2)]
[ slope intercept ])
)
)
)
)
)
)
New code
(let [ y-values [2.1 3.9 5.3 7.7 10.2]
slope (sample (normal 0.0 10.0))
intercept (sample (normal 0.0 10.0))
]
(foreach 5
[ x (range 1 6)
y y-values
]
(let [fx ((slope * x) + intercept)]
(observe (normal fx 1.0) y)
)
)
)
The reason we use a constant c
for the number of loop iterations is so that we can guarantee that the number of iterations is known at compile time.
loop
The loop
construct has the following syntax:
(loop c e f e_1 ... e_n)
Here, c
must be an non-negative integer constant, and f
a function. Desugaring this yields the following, where v_0, ..., v_(c-1)
and a_0, ..., a_n
are fresh variables:
(let [ a_1 e_1
a_2 e_2
...
a_n e_n
]
(let [ v_0 (f 0 e a_1 ... a_n) ]
(let [ v_1 (f 1 v_0 a_1 ... a_n) ]
(let [ v_2 (f 2 v_1 a_1 ... a_n)]
...
(let [ v_(c-1) (f (c-1) v_(c-2) a_1 ... a_n) ]
v_(c-1)
)
)
)
)
)
let a_1 = e_1
a_2 = e_2
...
a_n = e_n
in (let v_0 = (f 0 e a_1 ... a_n)
in (let v_1 = (f 1 v_0 a_1 ... a_n)
in (let v_2 = (f 2 v_1 a_1 ... a_n)
in ...
(let v_(c-1) = (f (c-1) v_(c-2) a_1 ... a_n)
in v_(c-1)
)
)
)
)
)
We can see that the a
s are bound to each of the expressions in the sequence e_1, ..., e_n
and all of these are repeatedly passed to function f
in each loop iteration. The variables v
are bound to the return value of f
and then used as the next value passed to the parameter e
of function f
; they hence represent recursive accumulated computations.
To demonstrate loop
, we show a new variant of the linear regression example:
(defn regr-step [ n r2 xs ys slope intercept ]
(let [ x (get xn n)
y (get ys n)
fx ((slope * x ) + intercept )
r (y - fx)
]
( observe ( normal fx 1.0) y )
( r2 + (r * r) )
)
)
(let [ xs [1.0 2.0 3.0 4.0 5.0]
ys [2.1 3.9 5.3 7.7 10.2]
slope ( sample ( normal 0.0 10.0) )
bias ( sample ( normal 0.0 10.0) )
r2 ( loop 5 0.0 regr-step xs ys slope bias )
]
[ slope bias r2 ]
)
(defn regr-step [ n r2 xs ys slope intercept ]
(do let x = get xn n
y = get ys n
fx = (slope * x ) + intercept
r = (y - fx)
observe (normal fx 1.0) y
return $ r2 + (r * r)
)
)
(let xs = [1.0 2.0 3.0 4.0 5.0]
ys = [2.1 3.9 5.3 7.7 10.2]
slope = sample ( normal 0.0 10.0)
bias = sample ( normal 0.0 10.0)
r2 = loop 5 0.0 regr-step xs ys slope bias
in [ slope bias r2 ]
)
In this version of the program, we not only observe a sequence of values y_n
according to the normal distribution centered around f(x_n)
, but we also compute the sum of the squared residuals: r^2=Σ(y_n - f(x_n))^2
. To do this, we define a function regr-step
which accepts an argument n
(the index of the loop iteration) and an argument r2
which represents the sum of squares for the preceding data points. The value of the entire loop is the value of the final call to regr-step
which is the sum of all the squared residuals.
The difference between loop
and foreach
is that:
loop
can be used to accumulate a result over the course of iterations.foreach
provides a much more specific loop type that evaluates a single expression repeatedly with different values for its variables.
From a statistical perspective, we can think of them as:
loop
defines a sequence of dependent variablesforeach
creates conditionally independent variables
2.3 Examples
Likelihood: the likelihood function is a model of the data – it’s any stochastic algorithm which can just be sampling from a distribution. But remember the data is fixed, and the model parameters are the parameters of the likelihood function.
2.3.1 Gaussian Mixture Model
A Gaussian mixture model is a density estimation model often used for clustering, in which each data point y_n
is assigned to a latent class z_n
. We will consider the following generative model:
σ_k ~ Gamma(1.0, 1.0) for k = 1, 2, 3
µ_k ~ Normal(0.0, 10.0) for k = 1, 2, 3
π ~ Dirichlet(1.0, 1.0, 1.0)
z_n ~ Discrete(π) for n = 1, ..., 7
y_n | z_n ~ Normal (µ_k, σ_k)
The following program shows a translation of this generative model to the FOPPL:
I’m assuming discrete
takes a list of n
probabilities which sum to 1, and assign them correspondingly to the values [0 .. n]
.
(let [ data [1.1 2.1 2.0 1.9 -0.1 -0.05]
likelihoods (foreach 3 []
(let [ mu (sample (normal 0.0 10.0))
sigma (sample (gamma 1.0 1.0))
]
(normal mu sigma)
)
)
pi (sample (dirichlet [1.0 1.0 1.0]))
z-prior (discrete pi)
]
(foreach 7 [y data]
(let [z (sample z-prior)]
(observe (get likelihoods z) y)
z
)
)
)
(let data = [1.1 2.1 2.0 1.9 -0.1 -0.05]
likelihoods = (foreach 3 []
(let mu = sample (normal 0.0 10.0)
sigma = sample (gamma 1.0 1.0)
in normal mu sigma
)
)
pi = sample (dirichlet [1.0 1.0 1.0])
z-prior = discrete pi
in (foreach 7 [y in data]
(do let z = (sample z-prior)]
observe (get likelihoods z) y
return z
)
)
)
In this model, we first sample the mean µ
and standard deviation σ
for 3 mixture components, given us an list of 3 normal distributions.
Sampling from a dirichlet distribution of 3 dimensions (1.0, 1.0, 1.0) will return a sample such as [0.26592092 0.23521457 0.49886451]. Using this as a parameter to a discrete distribution will give a distribution where the integers [0 1 2] are weighted respectively by [0.26592092 0.23521457 0.49886451], hence sampling from this discrete distribution will return an index from [0 1 2] using these probabilities.
For each observation y
, we then sample a class assignment label z
from the prior distribution – the value of this sample will be from [0 1 2], and is used as an index to get one of the likelihood distributions from likelihoods
. After this, we observe y
according to the likelihood distribution likelihoods[z]
. Finally, we return the class assignment index z
. The return value from the entire program is the sequence of latent class assignments for each of the 7 data points. We can use this result to ask questions such as “Are these two datapoints similar?”, etc.
2.3.2 Hidden Markov Model
The following program denotes a hidden markov model (HMM) with a known initial state, transition, and observation distributions governing 16 sequential observations.
( defn hmm-step [ t states data trans-dists likelihoods ]
(let [ z ( sample (get trans-dists (last states)) )
]
( observe (get likelihoods z) (get data t) )
( append states z )
)
)
(let [ data [0.9 0.8 0.7 0.0 -0 .025 -5 .0 -2 .0 -0 .1 0.0 0.13 0.45 6 0.2 0.3 -1 -1 ]
trans-dists [( discrete [0.10 0.50 0.40])
( discrete [0.20 0.20 0.60])
( discrete [0.15 0.15 0.70])]
likelihoods [( normal -1 .0 1.0)
( normal 1.0 1.0)
( normal 0.0 1.0)]
states [( sample ( discrete [0.33 0.33 0.34]))]
]
( loop 16 states hmm-step data trans-dists likelihoods )
)
( defn hmm-step [ t states data trans-dists likelihoods ]
(do let z = sample (get trans-dists (last states))
observe (get likelihoods z) (get data t)
append states z
)
)
(let data = [0.9 0.8 0.7 0.0 -0.025 -5.0 -2.0 -0.1
0.0 0.13 0.45 6 0.2 0.3 -1 -1 ]
trans-dists = [( discrete [0.10 0.50 0.40]),
( discrete [0.20 0.20 0.60]),
( discrete [0.15 0.15 0.70])]
likelihoods = [( normal -1.0 1.0),
( normal 1.0 1.0),
( normal 0.0 1.0)]
states = [( sample ( discrete [0.33 0.33 0.34]))]
in loop 16 states hmm-step data trans-dists likelihoods
)
Here:
data
is a vector of data points.trans-dists
is a vector of transition distributions.likelihoods
is a vector of state likelihoods.
We then loop over the data
using a function hmm-step
, which at every loop iteration, does three things:
- It first samples a new state
z
from the transition distribution associated with the preceding state (or in the first iteration, the initial state). - It then observes the current data point at time
t
according to the likelihood component of the current state. - Finally it appends the state
z
to the sequencestates
.
The vector of accumulated latent states is the return value of the program, and thus is the object whose joint posterior distribution is of interest.
2.4 A Simply Purely Deterministic Language
The FOPPL can be understood in two different ways:
- A language for specifying graphical-model data structures on which traditional inference algorithms may be run.
- A language that requires a non-standard interpretation in some implementing language to characterize the denoted posterior distribution.
In the case of graphical-model construction it will be necessary to have a language for purely deterministic expressions. This language will be used to express link functions in the graphical model (a link function provides the relationship between the linear predictor and the mean of the distribution function, where a linear predictor is a linear function whose value is used to predict the outcome of a dependent variable). Contrasting to the usual definition of link functions from statistics, the pure deterministic language will encode functions that take values of parent random variables and produce distribution objects for children. These link functions cannot have random variables inside them; such a variable would be another node in the graphical model instead.
2.4.1 Deterministic Expressions
Expressions in the FOPPL that do not involve user-defined procedure calls and involve only deterministic computations, such as (17 + (2.0 / 6.0))
, will be called “0th-order expressions”.
To identify and work with these deterministic expressions, we define a language with the following grammar:
c ::= constant value or primitive operation
v ::= variable
E ::= c | v | (if E1 E2 E3) | (c E1 ... En)
Note that neither sample
nor observe
statements appear in the syntax, and that procedure calls are only allowed for primitive operations, not for defined procedures. These constraints ensure that epxressions E
cannot depend on any probabilistic choices or conditioning.
3. Graph-Based Inference
3.1 Compilation to a Graphical Model
Programs written in the FOPPL specify probabilistic models over finitely many random variables. We will make this aspect clear by presenting the translation of these programs into finite graphical models. In later sections, we will show how this translation can be exploited to adapt inference algorithms for graphical models to probabilistic programs.
In the following relation, translation is specified by ⇓.
ρ, φ, e ⇓ G, E
The inputs:
ρ
is an environment, mapping procedure names to their definitionsφ
is a logical predicate for the flow control contexte
is an expression we intend to compile
This expression e
is translated to:
G
, a graphical model. Vertices in G represent random variables, and arcs represent dependencies between them. For each random variable in G, we will define a probability density or mass function in the graph.E
, an expression in the deterministic sub-language (described in section 2.4.1). This expression is deterministic in the sense that it does not involvesample
norobserve
. It describes the return value of the original expressione
in terms of random variables inG
.
For observed random variables, we additionally define the observed value, as well as a logical predicate that indicates whether the observe
expression is on the control flow path, conditioned on the values of the latent variables.
Definition of a Graphical Model
We define a graphical model G
as a tuple (V, A, ℙ, Ƴ)
where:
V
is a set of vertices representing random variablesA
is a set of directed edges(V × V)
that represent conditional dependencies between random variablesℙ
is a map from vertices to deterministic expressions that specify the probability density function for each random variableƳ
is a partial map such that for each observed random variable, it contains a pair(E, Φ)
, whereE
is the deterministic expression for the observed value, andΦ
is a predicate expression that evaluates totrue
when this observation is on the control flow path. A control flow path is a graphical representation of all paths that might be traversed through a program during its execution.
Before presenting a set of translation rules that can be used to compile any FOPPL program to a graphical model, we will illustrate the intended translation using a simple example:
(let [ z (sample ( bernoulli 0.5))
mu (if (= z 0) -1 .0 1.0)
d (normal mu 1.0)
y 0.5
]
(observe d y)
z
)
The program first samples z
as a prior from a bernoulli distribution, based on which it sets a likelihood parameter µ
to -1.0 or 1.0. It then observes a value y = 0.5
from a normal distribution with mean µ
. This program defines a joint distribution p(y = 0.5, z)
. The inference problem is then to characterize the posterior distribution p(z | y)
.
The below figure shows the graphical model and pure deterministic link functions that correspond to the above program:
In the evaluation relation ρ, φ, e ⇓ G, E
:
e
is the source code of the program.ρ
is the empty map, as there are no procedure definitions.φ
(the control flow predicate) is true at the top level of the program.
The graphical model G = (V, A, ℙ, Ƴ)
and the result expression E
that the program translates to are:Ƴ
V = {z, y}
A = {(z, y)}
ℙ = [ z → p_bern z 0.5,
y → p_norm y (if (z = 0) -1.0 1.0) 1.0
]
Ƴ = [y → 0.5]
E = z
In here,
V
, the vertex set ofG
, contains two random variables.A
, the arc set ofG
, contains a single pair (z, y) to mark the conditional dependence relationship between these two variables.ℙ
, the mapping of random variables to probability density functions, states that:- The probability density for z is defined as the target language expression
p_bern z 0.5
. Here,p_bern
refers to a function in the target language that implements the probability density function for the Bernoulli distribution. - The probability density for y is defined using
p_norm
, which implements the probability density function for the normal distribution.
- The probability density for z is defined as the target language expression
Ƴ
, the mapping of observed random variables to pairs(E, Φ)
, contains a single entry that holds the observed value for `y_.
Assigning Symbols to Variable Nodes
In the above example, we used the mathematical symbols:
- z to refer to the random variable associated with the expression
(sample (bernoulli 0.5))
. - y to refer to the observed variable with expression
(observe d y)
.
In general, there will be one node in the network for each sample
or observe
expression that is evaluated in a program.
In the above example, there also happens to be a program variable z
that holds the value of the sample
expression for node z, and a program variable y
that holds the observed value for node y. This is of course not necessarily always the case. In programs that have procedures, the same sample
and observe
expressions in the procedure body can be evaluated multiple times.
We will later define a general set of translation rules that compile a FOPPL program to a graphical model, in which we assign each vertex in the graphical model a newly generated unique symbol. Assigning a label to each vertex is a way of assigning a unique address to each and every random variable in the program.
If Expressions in Graphical Models
When compiling a program to a graphical model, if
expressions require special consideration. Consider a simple mixture model, in which only the mean is treated as an unknown variable:
let z = sample (bernoulli 0.5)
mu = sample (normal (if z = 0 then -1.0 else 1.0) 1.0)
d = normal mu 1.0
y = 0.5
in observe d y
z
This is of course a really strange way of writing a mixture model. We define a single likelihood parameter µ, which is either distributed according to Normal(−1, 1) when z = 0 and according to Normal(1, 1) when z = 1.
Typically we would think of a mixture model as having two components with parameter µ_0 and µ_1 respectively, where z selects the component. A more natural way to write the model might be:
let z = sample (bernoulli 0.5)
mu_0 = sample (normal -1.0 1.0)
mu_1 = sample (normal 1.0 1.0)
d_0 = normal mu_0 1.0
d_1 = normal mu_1 1.0
y = 0.5
in observe (if z = 0 then d0 else d1) y
z
Here we sample parameters µ0 and µ1, which then define two component likelihoods d0 and d1. The variable z then selects the component likelihood for an observation y.
The first program defines a density on three variables p(y, µ, z). The second program defines a joint density on four variables p(y, µ1, µ0, z). Regardless, it seems intuitive that these programs are equivalent in some sense. The equivalence that we would want to achieve here is that both programs define the same marginal posterior on z.
p(z | y) = ∫ p(z, µ | y) dµ = ∫ ∫ p(z, µ0, µ1 | y) dµ0 dµ1
Recall that the number of nodes in the graphical model depends on the amount of sample
and observe
expressions. The essential difference between these two programs is that:
- In the first program, the
if
-expression is placed inside thesample
expression formu
. This results in a graph with 3 nodes. - In the second program, the
if
-expression is placed outside thesample
expression formu
. This results in a graph with 4 nodes.
However, the distribution on the return values of the programs should be equivalent.
It also matters how we evaluate if
-expressions – lazily or eagerly. Consider the following program:
let z = sample (bernoulli 0.5)
mu = if z = 0 then sample (normal -1.0 1.0) else sample (normal 1.0 1.0)
d = normal mu 1.0
y = 0.5
in observe d y
z
In lazy evaluation, we would first evaluate the predicate z = 0
and then either evaluate the first sample
expression or the second sample
expression, but never both.
In eager evaluation, we first evaluate the predicate and both branches, before returning the value of one of the branches based on the predicate value.
The problem with eager evaluation of observe
in if
-expressions:
Consider the following code:
let z = sample (bernoulli 0.5)
mu0 = sample (normal -1 .0 1.0)
mu1 = sample (normal 1.0 1.0)
y = 0.5
in if (z = 0)
then (observe (normal mu_0 1) y)
else (observe (normal mu1 1) y)
When performing eager evaluation, we would be calling observe
on two variables y_0
and y_1
, both with value 0.5 – this means we’re conditioning the probability distribution with both variables, even though we only want to do it for one of them. When performing lazy evaluation, only one of the two branches would be included in the probability density. The lazy interpretation is a lot more natural here.
The problem with lazy evaluation of sample
in if
-expressions:
Consider the following code:
let z = sample (bernoulli 0.5)
mu = if (z = 0)
then sample (normal -1.0 1.0)
else sample (normal 1.0 1.0)
d = normal mu 1.0
y = 0.5
in observe d y
z
Lazy evaluation of if
-expressions makes it difficult to characterize the support of the probability distribution defined by a program when branches contain sample
expressions.
How do we define the probabilities for mu_0
and mu_1
? One choice is to set the probability of unevaluated branches to 1. We would include either p(mu_0 | z = 0)
or p(mu_1 | z = 1)
in the probability, and assume a probability 1 for unevaluated branches. This would be equivalent to both:
- Defining
ℙ(mu_0)
asif (z = 0) then p_norm mu_0 -1.0 1.0 else 1.0
- Defining
ℙ(mu_0)
asif (z != 0) then p_norm mu_1 1.0 1.0 else 1.0
However, we cannot marginalize ∫ dµ_0 p(µ_0 | z = 0)
and ∫ dµ_1 p(µ_1 | z = 1)
if we assume both p(µ_0|z = 0) = 1
and p(µ1|z = 1) = 1
.
Conclusion:
We need to understand that observe
and sample
expressions affect the marginal posterior on a program output in very different ways:
sample
expressions that are not on the control flow path cannot affect the values of any expressions outside of their branch. This means they can be safely incorporated into the model as auxiliary variables, since they do not affect the marginal posterior on the return value.observe
expressions do not have the guarantee thatsample
expressions have, as they change the posterior distribution on the return value when incorporated into a graphical model.
Based on this intuition, the solution to our problem is:
- We can assign probability 1 to the observed variables that are not on the same control flow path. Since observed variables have constant values, the interpretability of their support for characterizing probability distributions is not an issue in the same way it is with sampled variables.
- Conversely, we assign the same probability to sampled variables, regardless of the branch they occur in. How this is accomplished will be described in later sections.
Translation Rules
ρ, φ, e ⇓ G, E
- Constants and Variables: we translate constants c and variables z to themselves and the empty graphical model.
──────────────────── ────────────────────
ρ, φ, c ⇓ G_empty, c ρ, φ, z ⇓ G_empty, z
- Let: we translate
let [v e_1] e_2
by first translatinge_1
, then substituting the outcome of this translation forv
ine_2
, and finally translating the result of this substitution:
ρ, φ, e_1 ⇓ G_1 E_1 ρ, φ, e_2[v := E_1] ⇓ G_2, E_2
──────────────────────────────────────────────────────
ρ, φ, let [v e_1] e_2 ⇓ G_1 ⊕ G_2, E_2
G_1 ⊕ G_2
is the concatenation of two disjoint graphical models. When G_1 = (V_1, A_1, P_1, Y_1)
and G_2 = (V_2, A_2, P_2, Y_2)
:
G_1 ⊕ G_2 = (V1 ∪ V2, A1 ∪ A2, P1 ⊕ P2, Y1 ⊕ Y2)
where P1 ⊕ P2
and Y1 ⊕ Y2
are the concatenation of two finite maps with disjoint domains (disjoint meaning having no elements in common). This operator ⊕
assumes that the input graphical models G_1
and G_2
use disjoint sets of vertices. This assumption always holds because every graphical model created by our translation uses fresh vertices, which do not appear in other networks generated.
- If: when we translate the sub-expressions for the alternative branches, the conjoin the logical predicate
φ
with the expressionE_1
or its negation – recall that the role of the logical predicate is useful for includingobserve
statements that are on the current-sample control-flow path, and excludingobserve
statements that are off the current-sample control-flow path.
ρ, φ, e_1 ⇓ G_1 E_1 ρ, (φ && E_1), e_2 ⇓ G_2, E_2 ρ, (φ && not E_1), e_3 ⇓ G_3, E_3
───────────────────────────────────────────────────────────────────────────────────────
ρ, φ, if e_1 e_2 e_3 ⇓ G_1 ⊕ G_2 ⊕ G_3, E_2
The set of rules so far for an expression e
do not extend graphical models from e
’s sub-expressions with any new vertices. This is because the programming constructs involved in these rules perform deterministic computations, not probabilistic computations, and graphical models are used to express random variables.
The next two rules for sample
and observe
will show how graphical models are extended with new random variables.
- Sample:
ρ, φ, e ⇓ (V, A, ℙ, Ƴ), E Choose a fresh variable v Z = FREE-VARS(E) F = SCORE(E, v)
───────────────────────────────────────────────────────────────────────────────────────
ρ, φ, sample e ⇓ ⇓ (V ∪ {v}, A ∪ {(z, v) | z ∈ Z}, ℙ ⊕ [v → F], Ƴ), v
This rule states that we translate (sample e)
in four steps.
- First, we translate the argument
e
to a graphical model(V, A, ℙ, Ƴ)
and a deterministic expressionE
. Both the argumente
and its translationE
represent the same distribution, from whichsample e
samples. - Second, we choose a fresh variable
v
that will represent the sampled random variable. - Third, we collect all the free variables in
E
that are used as random variables of the network, and setZ
to the set of these random variables. - Finally, we convert the expression
E
(that denotes a distribution) to the probability density functionF
of the distribution. This conversion is done by callingSCORE
on the distributionE
and the variablev
, which is defined as follows:
SCORE(if E_1, E_2, E_3, v) = if E_1 F_2 F_3
when F_i = SCORE(E_i, v) for i ∈ {2, 3} and it is not ⊥
SCORE((c E_1 ... E_n), v) = (p_c v E_1 ... E_n)
when c is a constructor for a distribution and p_c is its probability density function
SCORE(E, v) = ⊥
when E is not one of the above cases (when E is not a distribution)
- Observe: the rule for
observe
expressions is analogous tosample
expressions, but we additionally need to account for the observed valuee_2
and the predicateφ
.
ρ, φ, e_1 ⇓ G_1, E_1 ρ, φ, e_2 ⇓ G_2, E_2
(V, A, ℙ, Ƴ) = G_1 ⊕ G_2 Choose a fresh variable v
F_1 = SCORE(E_1, v) ≠ ⊥ F = if φ then F_1 else 1
Z = FREE-VARS(F_1) \ {v} FREE-VARS(E_2) ∩ V = ∅
B = {(z, v) : z ∈ Z}
───────────────────────────────────────────────────────────────────────────────────────
ρ, φ, observe e_1 e_2 ⇓ (V ∪ {v}, A ∪ B, ℙ ⊕ [v → F], Ƴ ⊕ [v → E_2]), E_2
The rule states that:
- First we translate the sub-expressions
e_1
ande_2
into the graphical modelsG_1
andG_2
and deterministic expressionsE_1
andE_2
, whereE_1
is the distribution andE_2
is the value of an observed variable. - Next, we construct a graphical network
(V, A, ℙ, Ƴ)
by merging the networks of the subexpressions. - Then, we pick a new variable
v
that will represent the observed random variable. - We use
SCORE
to construct an expressionF_1
that represents the probability density function of the observed variablev
under the distributionE_1
. (As withsample
, the expressione_1
which translates toE_1
must be a distribution.) - We then construct a new expression
F = if φ then F_1 else 1
to ensure that the probability of the observed variable evaluates to 1 if theobserve
expression occurs in a branch that was not taken. The free variables in this expressionF
are the union of the free variables inE_1
,φ
and the new variablev
. - We add a set of arcs
B
to the graphical network, consisting of edges from all the free variables inF
(i.e. the free variables in the probability density function ofv
under distributionE_1
) to the observed random variablev
, excludingv
itself.
In order for this notion of an observed random variable to make sense, the expression E_2
for the observed random variable must be fully deterministic. For this reason we require that FREE-VARS(E_2) ∩ V = ∅
, which ensures that E_2
cannot reference any other random variables in the graphical model G_1 ⊕ G_2
.
An important consequence of E_2
being a value is that although the return value of an observe
(being the value of the observed variable) may be used in subsequent computation, no graphical model edges will be generated with the observed random variable as a parent. An alternative rule could return a null
value in place of E_2
, which may be safer in terms of ensuring clarity to the programmer, so that there is no way to imagine that an edge could be created where one was not.
- Procedure Call: the remaining two cases are those for a user-defined procedure
f
and a primitive functionc
.
For primitive functions, we first translate the arguments e_i
to e_n
. Then we translate the expression for the call by substituting the translated arguments into the original expression, and merging the graphs for the arguments.
ρ, φ, e_i ⇓ G_i, E_i forall i ∈ [1 ... n]
───────────────────────────────────────────────────────────────────────────────────────
ρ, φ, (c e_1 ... e_2) ⇓ G_1 ⊕ ... ⊕ G_n, (c E_1 ... E_n)
For user-defined procedures, we first translate the arguments e_i
to e_n
. Then we transform the procedure body by replacing all instances of the variable v_i
with the expression for the argument E_i
.
3.2 Evaluating the Probability Density
We make explicit how we can use this representation of a probabilistic program to evaluate the probability of a particular setting of the variables in V
. The Bayesian network G = (V, A, ℙ, Ƴ)
that we construct by compiling a FOPPL
program is a mathematical representation of a directed graphical model. Like any graphical model, G
defines a probability density on its variables V
.
In a directed graphical model, each node v ∈ V
has a set of parents:
parents(v) := {u : (u, v) ∈ A}
The joint probability of all variables can be expressed as a product over conditional probabilities.
p(V) = ∏ p(v | parents(v))
v∈V
In our graph G
, each term p(v | parents(v))
is represented as a deterministic expression ℙ(v) = c v E_1 ... E_n
, where:
c
is either a probability density function (for continuous variables) or a probability mass function (for discrete functions)E_1 ... E_n
are expressions that evaluate to the parametersθ_1 ... θ_n
of that probability density/mass function.
Implicit in this notation is that fact that each expression has some set of free variables. In order to evaluate an expression to a value, we must specify values for each of these free variables. In other words, we can think of each of these expressions E_i
as a mapping from a value of a free variable to a parameter value. By construction, the set of parents parents(v)
is nothing but the free variables in ℙ(v)
except v
. We can imagine this as the probability of v
conditioned on its ancestor/parent random variables in the graph.
parents(v) = FREE-VARS(ℙ(v)) \ {v}
Recall that ℙ
is a map from vertices to deterministic expressions that specify the probability density or mass function for each random variable. The expression ℙ(v)
can hence be thought of as a function that maps the random variable v
and its parents parents(v)
to a probability or a probability density that is the probability of v
conditioned on its parents. Therefore, we will treat these two as equivalent:
p(v | parents(v)) ≡ ℙ(v)
This means that the conditional probability of random variable v
given its parent random variables can be looked up in ℙ
by doing ℙ(v)
.
We can decompose the joint probability p(V)
into a prior and a likelihood term.
-
Recall that
Ƴ
is a mapping from random variables to deterministic expressions representing their value. In the translation rule forobserve
, we require that the expressionƳ(v)
for an observed value can not have free variables. Each expressionƳ(v)
will hence simplify to a constant when we perform partial evaluation. -
We will use
Y
to refer to all the nodes inV
that correspond to observed random variables, which is to sayY = dom(Ƴ)
. Observed random variables are those produced by theobserve
expression. -
We will use
X
to refer to all the nodes inV
that correspond to unobserved random variables, which is to sayX = V \ Y
. Unobserved random variables are those produced by thesample
expression.
Since the observed nodes y ∈ Y
cannot have any children, we can re-express the joint probability p(V)
as:
p(V) = p(Y, X) = p(Y | X)p(X)
where
p(Y | X) = ∏ p(y | parents(y)) p(X) = ∏ p(x | parents(x))
y∈Y x∈X
In this manner, a probabilistic program defines a joint distribution p(Y, X)
.
The goal of probabilistic program inference is to characterize the posterior distribution:
p(X | Y) = p(X, Y)/p(Y) p(Y) = ∫ p(X, Y) dX
3.2.1 Conditioning with Factors
Not all inference problems for probabilistic programs target a posterior p(X | Y)
that is defined in terms of unobserved and observed random variables. There are inference problems in which there is no notion of observed data Y
but it is possible to define some notion of loss, reward, or fitness, given a choice of an unobserved variable X
.
In a probabilistic program:
- The
sample
statements define a priorp(X)
on the random variables. - The
observe
statements define a likelihoodp(Y | X)
on the random variables.
To support a more general notion of soft constraints, we can replace the likelihood p(Y | X)
with a strictly positive potential ψ(X)
to define an unnormalized density γ(X)
:
γ(X) = ψ(X)p(X)
In this more general setting, the goal of inference is to characterize a target density π(x)
, which is defined as:
π(X) = γ(X) Z = ∫ γ(X) dX
──────
Z
= ψ(X)p(X)
───────
Z
Here:
- The target density
π(X)
is analogous to the posteriorp(X | Y)
. - The strictly positive potential
ψ(X)
is analogous to the likelihoodp(Y | X)
. - The unnormalized density
γ(X)
is analogous to the joint distributionp(Y, X)
. - The normalizing constant
Z
is analogous to the marginal likelihoodp(Y)
.
From a language design perspective, we can now ask how the FOPPL would need to be extended in order to support this more general form of soft constraint.
For a probabilistic program in the FOPPL, the potential function is a product over terms:
ψ(X) = ∏ ψ_y(X_y)
y∈Y
where
ψ_y(X_y) := p(y = Ƴ(y) | parents(y)) ≡ ℙ(y)[y := Ƴ(y)]
X_y := FREE-VARS(ℙ(y)) \ {y} = parents(y)
Here, ψ_y(X_y)
is the potential that “p(y = Ƴ(y) | parents(y))
, i.e. the probability of random variable y
being the value Ƴ(y)
(where Ƴ
is the mapping of random variables to values) given the parent nodes of y
” is equal to “ℙ(y)[y := Ƴ(y)]
, i.e. the probability density function of y
where we specify the value Ƴ(y)
for its free variables X_y
”.
One way to support arbitrary potential functions is to provide a special form factor log-p
that takes an arbitrary log probability log-p
as an argument. The factor
expression lets the user manually set a probability for a new random variable, as opposed to observe
which sets the probability based on a distribution and a value of an observed variable.
We can then define a translation rule for the factor log-p
expression, which:
- Inserts a new node
v
with probabilityℙ(v) = exp log-p
, where applyingexp
to a log probability will convert it into a non-log form. - Inserts an observed value
nil
for the random variablev
into the graph.
ρ, φ, e ⇓ (V, A, ℙ, Ƴ), E F = (if φ then (exp E) else 1)
Choose a fresh variable v
───────────────────────────────────────────────────────────────────────────────────────
ρ, φ, (factor e) ⇓ (V, A, ℙ ⊕ [v → f], Ƴ ⊕ [v → nil]), nil
Comparing observe
with factor
:
Recall that the observe
expression represents the likelihood P(Y|X)
. The potential function ψ(X)
is analogous to the likelihood P(Y|X)
. Hence factor
is analogous to observe
– the difference is that observe
conditions using both observed and unobserved random variables, whereas factor
conditions using only unobserved random variables. This is why we set the value of v
in Ƴ
to nil
– we haven’t observed a value. However, we set the probability density of the value of v
to be the exponential of the log probability e
. We can imagine that factor e
is hence manually setting the probability of a random variable.
In practice, we don’t need to provide separate special forms for factor
and observe
, since each can be implement as a special case of the other.
One way of doing so is to define factor [log-p]
as a procedure which calls observe
on a pseudo distribution factor-dist
parameterized by log-p
, and a nil
value:
defn factor [log-p]
observe (factor-dist log-p) nil
where factor-dist
is a constructor for a pseudo distribution object with corresponding potential:
p_factor-dist(y ; log-p) := if y = nil
then exp log-p
else 0
Another way of doing this is by defining observe [dist v]
as a procedure which calls factor
with the log probability log-prob dist v
, where log-prob
is a primitive procedure that returns the log probability density for a value v
under a distribution dist
:
defn observe [dist v]
factor (log-prob dist v)
v
3.2.2 Partial Evaluation
A necessary optimization for our compilation procedure is to perform a partial evaluation step. The partial evaluation step pre-evaluates expressions E
in the target language that do not contain any free variables, which means that they take on the same value in every execution of the program. Partial evaluation of these expressions is necessary to avoid the appearance of false edges between variables that are in fact not connected, in the sense that the value of the parent does not affect the conditional density of the child.
Our target language is very simple, so we only need to update the rules for if
-expressions and procedure calls to include a partial evaluation step.
- If-expressions
ρ, φ, e_1 ⇓ G_1, E_1
ρ, EVAL(φ && E_1), e_2 ⇓ G_2, E_2 ρ, EVAL(φ && not E_1), e_3 ⇓ G_3, E_3
───────────────────────────────────────────────────────────────────────────────────────
ρ, φ, (if e_1 then e_2 else e_3) ⇓ (G_1 ⊕ G_2 ⊕ G_3), EVAL(if E_1 then E_2 else E_3)
- Procedure calls
ρ, φ, e_i ⇓ G_i, E_i forall i ∈ [1 ... n]
───────────────────────────────────────────────────────────────────────────────────────
ρ, φ, (c e_1 ... e_n) ⇓ (G_1 ⊕ ... ⊕ G_n), EVAL(c E_1 ... E_n)
The expression EVAL(e)
is the partial evaluation operation, and can incorporate any number of rules for simplifying expressions. We will begin with these rules:
- EVAL
EVAL(if c_1 then E_2 else E_3) = E_2
when c_1 = true
EVAL(if c_1 then E_2 else E_3) = E_3
when c_1 = false
EVAL(c c_1 ... c_n) = c'
when calling c with arguments c_1 ... c_n evaluates to c'
EVAL(E) = E
in all other cases
These rules state:
- An
if
-statementif E_1 then E_2 else E_3
can be simplified whenE_1 = c_1
can be fully evaluated, by selecting the expression for the appropriate branch. - A primitive procedure call can be evaluated when all arguments can be fully evaluated.
We additionally need to modify the definition of the SCORE
function.
Recall that SCORE
converts an expression representing a distribution into its corresponding probability density function:
score(if E_1 then E_2 else E_3, v) = if E_1 then F_2 else F_3
when F_i = score(E_i, v) for i ∈ {2, 3} and it is not ⊥
score(c E1 . . . En, v) = p_c v E1 . . . En
when c is a constructor for distribution and p_c its pdf or pmf
score(E, v) = ⊥
when E is not one of the above cases
We need to add one more rule:
SCORE(c, v) = p_c v
when c is a distribution and p_c is its pdf or pmf
To see how partial evaluation also lets us reduce the number of edges in the network, let’s consider the expression if true then v_1 else v_2
. This expression references two random variables v_1
and v_2
by name. After partial evaluation, this expression simplifies to v_1
, which eliminates the false dependence on v_2
.
Another practical advantage of partial evaluation is that it gives us a simple way to identify expressions in a program that are fully deterministic, since deterministic expressions will be partially evaluated to constants. This is useful when translating observe e1 e2
statements, in which the expression e_2
must be deterministic as it is the value of a random variable. In programs that use loop c v e e_1 ... e_n
, we can now substitute any fully deterministic expression for the number of loop iterations c
.
Lists, Vectors, and Hash Maps
Eliminating false edges/dependencies in the dependency graph becomes particularly important in programs that make use of data structures.
Consider the following example which defines a 3-state Markov chain.
let A = [[0.9 0.1]
[0.1 0.9]]
x_1 = sample (discrete [0.5 0.5])
x_2 = sample (discrete (get A x_1))
x_3 = sample (discrete (get A x_2))
in [x_1, x_2, x_3]
Compiling this to a network will yield three variable nodes. If we refer to these nodes as v_1
, v_2
, and v_3
, then there will be arcs from v_1
to v_2
, and from v_2
to v_3`.
Suppose we now rewrite this program using the loop
construct:
defn markov-step [n xs A]
let k = last xs
A_k = get A k
in append xs (sample (discrete A_k))
let A = [[0.9 0.1]
[0.1 0.9]]
x_1 = sample (discrete [0.5 0.5])
in loop 2 markov-step [x_1] A
In this version, each call to markov-step
:
- Accepts a vector of states
xs
- Gets the most recent state
k = last xs
(which will be either 0 or 1) - Then gets
A_k
, which is thek
‘th array fromA
- Uses
sample (discrete A_k)
to sample a value from a discrete distribution of values[0, 1]
with corresponding probabilities found in arrayA_k
- Appends this sampled value as the new state to the vector of states
xs
This program version generates the same sequence of random variables as the previous program, and has the advantage of allowing us to generalize to sequences of arbitrary length by varying the loop iteration constant. However, under the partial evaluation rules that we have specified so far, we would obtain a different set of edges. Just like the previous version of the program, this version will evaluate three sample
statements.
- For the first statement,
sample (discrete [0.5 0.5])
, there will be no arcs created (only the initial nodev_1
). - For the second sample statement,
sample (discrete A_k)
, this will result in an arc fromv_1
tov_2
since the expression forA_k
expands to:get [[0.9 0.1] [0.1 0.9]] (last [v_1])
- However, for the third sample statement, there will be arcs from both
v_1
andv_2
tov_3
, since the expression forA_k
expands to:get [[0.9 0.1] [0.1 0.5]] (last (append [v_1] v_2))
The extra arc from v_1
to v_3
is of course not necessary here, since the expression last (append [v_1] v_2)
will always evaluate to v_2
. What’s more, if we run this program to generate more than 3 states, the node v_n
for the n
th state will have incoming arcs from all preceding variables v_1 ... v_{n-1}
, whereas the only real arc in the Bayesian network is the one from v_{n-1}
.
We can eliminate these false arcs by implementing an additional set of partial evaluation rules for data structures vector
and hash-map
:
EVAL(vector E_1 ... E_n)
= [E_1 ... E_n]
EVAL(hash-map c_1 E_1 ... c_n E_n)
= {c_1 E_1 ... c_n E_n}
These rules ensure that expressions which construct data structures are partially evaluated to data structures containing expressions. We can similarly partially evaluate functions that add, get, or replace entries.
EVAL(append [E_1 ... E_n] E_{n+1})
= [E_1 ... E_n E_{n+1}]
EVAL(put {c_1 E_1 ... c_n E_n} c_k E'_k)
= {c_1 E_1 ... c_k E'_k ... c_n E_n}
EVAL(last [E_1 . . . E_n])
= E_n
EVAL(get [E_1 . . . En_] k)
= E_k
EVAL(get {c_1 E_1 . . . c_n E_n} c_k)
= E_k
In the Markov chain example, the expression for A_k
in the third sample statement then can be simplified:
get [[0.9 0.1]
[0.1 0.5]] (last (append [v_1] v_2))
~>
get [[0.9 0.1]
[0.1 0.9]] (last [v_1 v_2])
~>
get [[0.9 0.1]
[0.1 0.9]] v_2
This yields the correct dependency structure for the Bayesian network.
3.3 Gibbs Sampling
So far, we have just defined a way to translate probabilistic programs into a data structure for finite graphical models. One important reason for doing so is that many existing inference algorithms are defined explicitly in terms of finite graphical models, and can now be applied directly to probabilistic programs written in the FOPPL.
We will consider such inference algorithms now, starting with a general family of Markov chain Monte Carlo (MCMC) algorithms. MCMC algorithms perform Bayesian inference by drawing samples from the posterior distribution P(X | Y)
, that is, the conditional distribution of the latent variables X ⊆ V
given the observed variables Y ⊆ V
. This is accomplished by simulating from a Markov chain whose transition function is defined such that the stationary distribution is the target posterior P(X | Y)
. These samples are then used to characterize the distribution of the return value r(X)
.
Procedurally, MCMC algorithms begin by initializing latent variables to some value X_0
, and repeatedly sampling from a Markov transition density to produce a dependent sequence of samples X_1, ... X_s
. We will describe how these algorithms can be applied in the context of inference in graphs produced by compiling a FOPPL
.
Metropolis-Hastings
The Metropolis-Hastings algorithm provides a general recipe for producing appropriate MCMC transition functions, by combining a proposal step with an accept/reject step. Given some proposal distribution q(X' | V)
, the MH algorithm simulates a candidate sample latent value X'
from q(X' | V)
conditioned on the value of the current sample X
, and then evaluates the acceptance probability by considering the candidate sample X'
and the current sample X
:
α(X', X) = min { 1, p(Y, X')q(X | V') / p(Y, X)q(X' | V) }
With probability α(X', X)
we accept the transition X → X'
. With probability 1 - α(X', X)
we reject the transition and retain the current sample X → X
.
When we repeatedly apply this transition function, we obtain a Markov process:
X' ∼ q(X' | V_{s - 1}) X_s = { X' u ≤ α (X', X_{s - 1})
u ~ Uniform(0, 1) { X_{s - 1} u > α (X', X_{s - 1})
Gibbs Sampling
The Gibbs sampling algorithms are an important special case of Metropolis-Hastings which cycle through all the latent variables in the model and iteratively sample from the full conditional distributions. The full conditional distribution is the probability distribution of a variable (node) in a probabilistic graphical model conditioned on the value of all the other variables in the probabilistic graphical model.
p(x | Y, X \ {x}) = p(x | V \ {x})
-
Alternative Explanation:
Gibbs sampling is a special case of Metropolis–Hastings in which the newly proposed state is always accepted with probability one. Consider a
D
-dimensional posterior with parametersθ={θ_1, ... ,θ_D}
. The basic idea of Gibbs sampling is that for each time-stept
, we sampleθ
by iteratively sampling a proposal for each of its dimensionsθ_d
from the conditional distributionP(θ_d ∣ X, θ\{θ_d})
whereθ\{θ_d}
isθ
without thed
th parameter:Gibbs sampling:
for t = 1 ... T do θ^(t+1)_1 := θ*_1 ∼ P( θ^{t}_1 ∣ X, θ^{t}_2 , θ^{t}_3, ...,θ^{t}_D) θ^(t+1)_2 := θ*_2 ∼ P( θ^{t}_2 ∣ X, θ^{t+1}_1, θ^{t}_3, ...,θ^{t}_D) ⋮ θ^(t+1)_d := θ*_d ∼ P( θ^{t}_d ∣ X, θ^{t+1}_1, ..., θ^{t+1}_{d−1}, θ^{t}_d, ...,θ^{t}D) ⋮ θ^(t+1)_D := θ*_D ∼ P( θ^{t}_D ∣ X, θ^{t+1}_1, θ^{t+1}_2, ... ,θ^{t+1}_{D−1})
In some special cases of models, these conditional distributions can be derived analytically and sampled from exactly. However, this is not possible in general, and so as a general purpose solution, one turns to Metropolis-within-Gibbs algorithms, which instead apply a Metropolis-Hastings transition function which targets p(x | V {x})
.
From an implementation perspective, given our compiled graph (V, A, ℙ, Ƴ)
we can compute the acceptance probability α(X', X)
by evaluating the expressions ℙ(v)
for each v ∈ V
, and substituting the values for the current sample X
and the proposal X'
. More precisely, if we use X
to refer to the set of unobserved variables, and χ
to refer to the map from unobserved variables to their values,
X = (x_1, ..., x_N) χ = [ x_1 → c_1, ..., x_N → c_N]
then we can use Ѵ = χ ⊕ Ƴ
to refer to the values of all variables (both unobserved and observed) and express the joint probability over the variables V
as:
p(V = Ѵ) = ∏ EVAL( ℙ(v)[V := Ѵ] )
v∈V
When we update a single variable x
using a kernel q(x | V)
, we are propposing a new mapping Ѵ' = Ѵ[x → c']
, where c'
is the candidate value proposed for x
.
The acceptance probability for changing the value of x
from c
to c'
then takes the form:
α(Ѵ', Ѵ) = min {1, p(V = Ѵ')q(x = c| V = Ѵ') / p(V = Ѵ)q(x = c| V = Ѵ)}
The important thing to note is that many terms in this ratio will cancel out:
-
The joint probabilities
p(V = Ѵ)
are composed of a product of conditional density terms∏ p(v | parents(v))
. -
The individual expressions
p(v | parents(v)) ≡ ℙ(v)
depend on the valuec
or its proposed alternativec'
of the nodex
only ifv = x
orx ∈ parents(v)
, in other words, ifx ∈ FREE-VARS(ℙ(v))
.If we define
V_x
to be the set of variables whose probability densities depend onx
:V_x := {v : x ∈ FREE-VARS(ℙ(v))} = {x} ∪ {v : x ∈ parents(v)}
Then we can decompose the joint distribution
p(V)
of all random variables into terms that depend on the variablex
that we propose changing and terms that do not.p(V) = (∏ p(w | parents(w))) (∏ p(v | parents(v))) w∈V\V_x v∈V_x
We now note that all terms
w ∈ V \ V_x
in the acceptance ratio cancel, with the same value in both numerator and denominator. Denoting the values of a variablev
asc_v
andc'_v
for the mapsѴ
andѴ'
respectively, we can simplify the acceptance probabilityα
to:α(Ѵ',Ѵ) = min {1 , ∏ p(v = c'_v | parents(v))q(x = c | V = Ѵ') } v∈V_x ───────────────────────────────────────────── ∏ p(v = c_v | parents(v))q(x = c' | V = Ѵ) v∈V_x
To implement a Gibbs sampler, we additionaly need to specify some form of proposal. We will here assume a map ℚ
from unobserved variables x
to expressions in the target language.
ℚ := [ x_1 → E_1 , ... , x_N → E_N ]
For each variable x
, the expression E
defines a distribution, which can in principle depend on other unobserved variables X
.
We can then use this distribution to both generate samples, and evaluate the forward and reverse proposal densities q(x = c' | V = Ѵ)
and q(x = c' | V = Ѵ')
. To do so, we first evaluate the expression to a distribution:
d = EVAL(ℚ(x)[V := Ѵ])
We then assume that we have an implementation for the functions:
SAMPLE
, which allows us to generate samplesLOG-PROB
, which allows us to evaluate the density function for the distribution
SAMPLE(d) = c'
LOG-PROB(d, c') = q(x = c' | V)
The below algorithm shows pseudo-code for a Gibbs sampler with this type of proposal.
global V, A, X, Y, ℙ, Ƴ -- A directed graphical model
global ℚ -- A map of proposal expressions
function ACCEPT(x, χ', χ)
function GIBBS-STEP(χ)
for x in X do
d <- EVAL(ℚ(x)[X := χ])
χ' <- χ
χ'(x) <- SAMPLE(d)
α <- ACCEPT(x, χ', χ)
u ~ Uniform(0, 1)
if u < α
then χ <- χ'
return χ
function GIBBS(χ^0, T)
for t in 1 ... T do
χ^t <- GIBBS-STEP(χ^(t-1))
return χ^1, ..., χ^T
- V is the set of random variables
- A is the set of arcs between random variables
- X is the set of unobserved random variables
- Y is the set of observed random variables
- ℙ is the mapping of random variables V to their probability density function
- Ƴ is the mapping of observed random variables to their value
- ℚ is the mapping of unobserved random variables to their proposal probability distribution
- χ is the mapping of unobserved variables to their value
- Ѵ is the mapping of all random variables to their value (Ѵ = χ ⊕ Ƴ)
In the function ACCEPT
, this computes the acceptance ratio of one of the unobserved variables x
in X
, given the current value found in χ
and the new proposed value found in χ'
:
- This takes one of the unobserved variables
x
inX
, the current mappingχ
of unobserved variablesX
to values, and the new proposed mappingχ'
of unobserved variablesX
to values. - Get the current distribution expression for unobserved variable
x
(where we set the values of the unobserved variablesX
to those found in the mapχ
), and evaluate this to a distributiond
. - Get the proposal distribution expression for unobserved variable
x
(where we set the values of the unobserved variablesX
to those found in the mapχ
), and evaluate this to a distributiond
. - Find the difference between the log probability of drawing the current value of
x
from the proposal distributiond'
and the log probability of drawing the proposed value ofx
from current distributiond
. - Let
V_x
be the set of variables whose densities depend onx
.
In the function GIBBS-STEP
, this updates the values of unobserved variables X
at time t
:
- This takes a mapping of unobserved variables
χ
. - For each unobserved variable
x
inχ
:- Get the current distribution expression for unobserved variable
x
from the mapℚ
, (where we set the values of the unobserved variablesX
to the values found in the mapχ
) and evaluate this to a distributiond
. - Copy the current mapping
χ
of unobserved variables to values, yieldingχ'
which will represent the new proposal mapping. - Sample from the current distribution
d
and set this as the value of unobserved variablex
in the new proposal mappingχ'
. - Compute the acceptance ratio
α
by callingACCEPT
on the unobserved variablex
, its current mappingχ
of unobserved variables to values, and its new proposal mappingχ'
of unobserved variables to values. - Sample a value
u
from a uniform distribution from 0 to 1. - If the acceptance ratio is bigger than the value
u
, then use the new proposed value forx
, by returning the new proposal mappingχ'
. Otherwise, retain the current value forx
by returning the current mappingχ
.
- Get the current distribution expression for unobserved variable
In the function GIBBS
, this computes the values for the unobserved variables χ
for T
steps, i.e. T
nodes in a Markov chain:
- This takes the initial mapping
χ^0
of unobserved variables to values at timet = 0
, and the number of time stepsS
. - For each time-step
t
in1 ... T
, we compute thet
‘th set of values for unobserved variablesχ^t
.
Note: each sample
statement in our original program corresponds to sampling a parameter from the prior. The values (samples) produced from Gibbs inference are the same parameters but sampled from the posterior.
4. Evaluation-Based Inference I
In the previous chapter on Graph-Based Inference, our inference algorithms operated on a graph representation of a probabilistic model, which we created through a compilation of a program in our first-order probabilistic programming language. The construction of this graph is performed at compile time – we refer to graphs that can be constructed at compile time as having static support.
Many models exist in which the graph of conditional dependencies is dynamic, in the sense that it cannot be constructed prior to performing inference. One way these graphs arise is when the number of random variables is itself not known at compile time. These types of models are refered to as having dynamic support.
There are two basic strategies that we can employ to represent models with dynamic support.
The strategy we will use is to implement inference methods that dynamically instantiate random variables. For example, at each time step, an inference algorithm could decide whether there are any new objects have appeared in the field of view, and then create random variables for the position of these objects as needed.
A particular strategy for dynamic instantiation of variables is to generate values for variables by simply running a program. We refer to such strategies as evaluation-based inference methods.
Evaluation-based methods differ from their compilation-based counterparts in that they do not require a representation of the dependency graph to be known prior to execution. Rather, the graph is either built up dynamically, or never explicitly constructed at all. This means that many evaluation-based strategies can be applied to models that can be applied to models that can in principle instantiate an unbounded number of random variables.
One of the main things we will need to change in evaluation-based methods is how we deal with if
-expressions. In Graph-Based Inference, we realized that if
-expressions required special consideration in probabilistic programs: the question that we identified was whether lazy or eager evaluation should be used in if
expressions that contain sample
and/or observe
expressions. We showed that lazy evaluation is necessary for observe
expressions, since these expressions affect the posterior distribution on the program output as a side effect. However, for sample
expressions, we have a choice between evaluation strategies, because we can always treat variables in unused branches as auxiliary variables. Hence, we adopted an eager evaluation strategy in which both branches of the if
-expression are evaluated, but a symbolic flow control predicate determines when observe
expressions need to be incorporated into the likelihood.
The language we previously introduced was designed to ensure that programs always evaluate a bounded set of sample
and observe
expressions – because of this, programs that are written in the FOPPL can be safely eagerly evaluated. It is very easy to create a language in which this is no longer the case. For example, if we let function definitions be recursive:
defn sample-geometric [alpha]
let p = sample (bernoulli alpha)
in if (p == 1)
then 1
else (1 + sample-geometric p)
let alpha = sample (uniform 0 1)
k = sample-geometric alpha
in observe (poisson k) 15
alpha
Then eager evaluation of if
-expressions would result in infinite reucrsion. The expression sample (bernoulli p)
can in principle be evaluated an unbounded number of times, implying that the number of random variables in the graph is unbounded as well.
Although we can’t compile the above program to a static graph, it turns out that we can still perform inference in order to characterize the posterior on the program output. To do so, we rely on the fact that we can always simply run a program (using lazy evaluation for if
expressions) to generate a sample from the prior. Even though we might not be able to characterize the support of a probabilistic program, we can still generate a sample that by construction is guaranteed to be part of the support.
If we additionally keep track of the probabilities associated with each of the observe expressions that is evaluated in a program, then we can implement sampling algorithms that either evaluate a Metropolis-Hastings acceptance ratio, or assume an importance weight to each sample.
In this chapter, we will assume that programs are defined using the FOPPL previously described, but that a lazy evaluation strategy is used for if
-expressions.
4.1 Likelihood Weighting
The simplest evaluation-based method is likelihood weighting, which is a form of importance sampling in which the proposal distribution Q
matches the prior distribution. The idea is to weight the samples produced during inference, where the weights reflect the probability that a sample would not be rejected. This is an alternative to rejecting samples.
4.1.1 Background: Importance Sampling
Monte Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results – during inference, these are used to approximate the posterior distribution p(X|Y)
with a set of (weighted) samples.
Importance sampling is one of these Monte Carlo techniques. The trick that importance sampling methods rely upon is that we can replace an expectation over the posterior p(X|Y)
, which is generally hard to sample from, with an expectation over a proposal distribution q(X)
, which is chosen to be easy to sample from.
𝔼[X]
denotes the expectation of random variableX
.- If
X
is a random variable with outcomesx_1, x_2, ..., x_k
with probabilitiesp_1, p_2, ..., p_k
, then the expectation ofX
is defined as𝔼[X] = Σ x_i, p_i
. - If
X
is a random variable with probability density functionf(x)
, then the expected value is defined as𝔼[X] = ∫ x f(x) dx
, or𝔼_f(X)[X] = ∫ x f(x) dx
.
Below shows how the expectation of our returned random variable r(X)
over our posterior p(X|Y)
can be replaced with the expectation of a different random variable r(X) * p(X|Y)/q(X)
over a proposal distribution q(X)
.
𝔼_p(X|Y)[r(X)] = ∫ p(X|Y)r(X) dX
= ∫ q(X) p(X|Y)/q(X) r(X) dX
= 𝔼_q(X) [p(X|Y)/q(X) * r(X)]
What is
r(X)
? Recall that in MCMC, Bayesian inference is performed by drawing samples from the posterior distribution, which is accomplished by simulating from a Markov chain whose stationary distribution is the target distributionπ
, being the posterior in this case. These samples are then used to characterize the distribution of the return valuer(X)
(i.e. the samples produced by inference).
What is Importance Sampling/Likelihood Weighting
The idea behind importance sampling is that certain values of random variables X
in a simulation have more impact on the parameter being estimated than others. If these important values are emphasized by sampling more frequently, then the estimator variance can be reduced. Hence, the basic methodology in importance sampling is to choose a “bias” distribution which encourages the important values. In order to prevent the use of the “bias” distribution resulting in a biased estimator, the simulation outputs are weighted for the correct use of the biased distribution.
Instead of creating a sample and then rejecting it, it is possible to mix sampling with inference to reason about the probability that a sample would be rejected. In importance sampling methods, each sample has a weight, and the sample average is computed using the weighted average of samples. Likelihood weighting is a form of importance sampling where the variables are sampled in the order defined by a belief network, and evidence is used to update the weights. The weights reflect the probability that a sample would not be rejected.
General Procedure of Likelihood Weighting If we:
- Draw samples
X^l ~ q(X)
from the proposal distributionq(X)
- Define importance weights as
w^l := p(X^l|Y)/q(X^l)
Then we can express our Monte Carlo estimate as an average over the weighted samples{(w^l, X^l)} for l ϵ L
.
𝔼_q(X) [p(X|Y)/q(X) * r(X)] ≃ 1/L Σ (w^l * r(X^l))
Unfortunately, we cannot calculate the importance weights w^l := p(X^l|Y)/q(X^l)
(i.e. the importance ratio p(X|Y)/q(X)
) because this requires evaluating the posterior p(X|Y)
which is what we did not know how to do in the first place. However, we can evaluate the joint distribution p(Y, X)
which allows us to define an unnormalized weight. The unnormalized weights W^l
are quantities that we can calculate directly, unlike the normalized weights w^l
.
W^l := p(Y, X^l) / q(X^l)
:= p(Y) * w^l
We still have a remaining problem: we don’t know how to calculate the normalization constant p(Y)
. However, we can derive an approximation to p(Y)
using the same unnormalized weights W^l
by considering the special case r(X) = 1
.
p(Y) = 𝔼_q(X) [p(X|Y)/q(X) * 1] ≃ 1/L Σ (W^l)
… some unfinished information …
To summarize, as long as we can evaluate the joint p(Y, X^l)
for a sample X^l ~ q(X)
, then we can perform importance sampling using unnormalized weights W^l
.
𝔼_q(X) [p(X|Y)/q(X) * r(X)] ≃ Σ (W^l / Σ W^k ) * r(X^l)
4.1.2 Graph-based Implementation
Suppose that we compiled our program to a graphical model. We could then implement likelihood weighting using the following steps:
- For each sampled random variable
x ∈ X
, sample from the priorx^l ~ p(x | parents(x))
(i.e. its corresponding distribution). - For each observed random variable
y ∈ Y
, calculate the weightsW^l_y = p(y | parents(y)).
(i.e. the probability of observing it from its corresponding distribution). - Return the weighted set of return values
r(X^l)
.Σ (W^l/Σ(W^k)) * δ_r(X^l) where W^l := ∏ W^l_y
- Sampling from the prior for each
x ∈ X
is more or less trivial. The only thing we need to make sure of is that we sample all parentsparents(x)
before samplingx
, which is to say that we need to loop over nodesx ∈ X
according to their topological order. - The terms
W^l_y
can be calculated by evaluating the target language expressionℙ(y)[y := Ƴ(y)]
, substituting the sampled valuex^l
for anyx ∈ parents(y)
.
4.1.3 Evaluation-based Implementation
The basic idea in implementing the same algorithm using an evaluation-based strategy, is that:
- We can generate samples by simply running the program. More precisely, we will sample a value
x ~ d
whenever we encounter an expressionsample d
; by definition, this will generate samples from the prior. - We can then calculate the likelihood
p(Y|X)
as a side-effect of running the program. To do so, we initialise a state variableσ
with a single entrylog W = 0
, which tracks the log of the unnormalized importance weight. Each time we encounter an expressionobserve d y
, we calculate the log likelihoodlog_p(d, y)
and update the log weight tolog W <- log W + log_p(d, y)
, ensuring thatlog W = log p(Y|X)
at the end of the execution.
To define this method more formally, let us specify what we mean by “running” the program.
Previously, we defined a program q
as:
q ::= e | (defn f [x1 ... xn] e) q
And our language contained eight expression types.
e ::= c | v | (let [v e1] e2) | (if e1 e2 e3) | (f e1 ... en)
| (c e1 ... en) | (sample e) | (observe e1 e2)
In order to “run” a FOPPL program, we will define a function that evaluates an expression e
to a value c
(as opposed to graph-based implementation that evaluates an expression e
to a graph G
and a deterministic expression E
).
global ρf
function EVAL(e, σ, ρx)
match e
case (sample d)
> algorithm specific
case (observe d y)
> algorithm specific
case c
return c, σ
case (let [v1 e1] e2)
c1, σ ← EVAL(e1, σ, ρx)
return EVAL(e2, σ, ρx[v1 → c1])
case (if e1 e2 e3)
c1, σ ← eval(e1, σ, ρx)
if c1
then return EVAL(e2, σ, ρx)
else return EVAL(e3, σ, ρx)
case (e0 e1 . .. en)
for i in [1, ..., n] do
ci, σ ← EVAL(ei, σ, ρx)
match e0
case f
(v1, ..., vn), e0' ← ρf(f)
return EVAL(e0', σ, ρx[v1 → c1, ..., vn → cn])
case c
return (c (c1, ..., cn), σ)
The meaning of the variables used:
ρf
- a global environment for user-defined procedures.ρx
- a local environment mapping variablesv
to constantsc
.σ
- a mapping of inference state variables to the log of the unnormalized importance weightW
. This allows us to store variables needed for inference, which are computed as a side-effect of the computation.
It remains to define evaluation for sample
and observe
forms. As we described at a high-level, these evaluation rules are algorithm-dependent. For likelihood weighting, we want to draw from the prior when evaluating sample
expressions, and update the importance weight when evaluating observe
expressions. Below, we show pseudo-code for an implementation of these operations for likelihood weighting.
global ρf, e
function EVAL(e, σ, ρx)
match e
case (sample e')
d, σ ← EVAL(e', σ, ρx)
return (SAMPLE(d), σ)
case (observe e1 e2)
d1, σ ← EVAL(e1, σ, ρx)
c2, σ ← EVAL(e2, σ, ρx)
σ(log W) ← σ(log W) + LOG-PROB(d1, c2)
return c2
function LIKELIHOOD-WEIGHTING(L)
σ ← [logW → 0]
for l in 1, ..., L do
r^l, σ^l ← EVAL(e, σ, [])
logW^l ← σ(logW)
return ((r^1, logW^1), ..., (r^L, logW^L))
This algorithm generates a sequence of weighted samples by simply running the program repeatedly. Unlike the algorithms that were defined in the graphical inference chapter, this algorithm does not require any explicit representation of the graph of conditional dependencies between variables. This implementation of likelihood weighting does not even track how many sample
and observe
statements a program evaluates. Instead, it draws from the prior as needed, and accumulates log probabilities when evaluating observe
expressions.
Aside 1: Relationship between Evaluation and Inference Rules
In order to evaluate an expression e
, we first evaluate its sub-expressions and then compute the value of the expression of the values of its sub-expressions. When defining the translation (inference) rules for graphical models, we implicitly followed the same pattern.
Inference rules do not only formally specify how our translation should behave, but also give us a recipe for how to implement a recursive TRANSLATE
operation for each expression type.
We can similarly use inference rules to define our evaluation-based likelihood weighting model.
Aside 2: Side Effects and Referential Transparency
If we do not include sample
and observe
in our syntax, then our FOPPL is not only deterministic, but it also pure in a functional sense, i.e. there are no side effects. An implication of this is that it would referentially transparent. Once we incorporate sample
and observe
into our language, our language is no longer functionally pure.
A sample
expression does not always evaluate to the same value, and is therefore referentially opaque; by extension, any expression containing a sample
form as a sub-expression is also opaque.
An observe
expression observe e1 e2
always evaluates to the same value as long as e2
is referentially transparent. However, observe
expressions have a side effect which is that they increment the log weight stored in the inference state σ(logW)
.
The distinction between referentially transparent and opaque expressions also implicitly showed up in our compilation procedure for graphical models – there, we translate an opaque program into a set of target-language expressions for conditional probabilities which were referentially transparent. In these target-language expressions, each sub-expression corresponding to sample
or observe
was replaced with a free variable v
. If a translated expression has no free variables, then the original untranslated expression is referentially transparent. Partial evaluation exploited this property to replace all target-language expressions without free variables with their values. We also relied on this property to ensure that observe
forms observe e1 e2
always contained a referentially transparent expresssion for the observed value e2
.
4.2 Metropolis-Hastings
Previously, we used evaluation to generate samples from the program prior while calculating the likelihood associated with these samples as a side-effect of computation. We can use this same strategy to define MCMC algorithms. With Gibbs sampling, we explicitly made use of the conditional dependency graph in order to identify the minimal set of variables needed to compute the acceptance ratio.
Metropolis-Hastings methods generate a Markov chain of program return values by accepting or rejecting a newly proposed sample according to the following algorithm:
- Initialise the current sample
X
. ReturnX^1
←X
. - For each subsequent sample
s = 2, ..., S
:- Generate a proposal
X' ~ q(X' | X)
- Calculate the acceptance ratio
α = p(Y', X')q(X | X') / p(Y, X)q(X' | X)
- Update the current sample
X ← X'
with probabilitymax(1, α)
, otherwise keepX
. ReturnX^s ← X
.
- Generate a proposal
An evaluation-based MH
sampler needs to do two things.
- It needs to be able to run the program to generate a proposal, conditioned on the values
Ⲭ
ofsample
expressions that were previously evaluated. - It needs to be able to compute the acceptance ratio
α
as a side effect.
Lets consider a simplified version of this algorithm: Independent proposals from the prior
Suppose that we defined q(X'|X) = p(X')
, in other words, the probability of the proposed latent variable X'
given the previous latent variable X
according to the proposal distribution q
is equal to the probability of the proposed latent variable X'
according to the prior distribution. In other words, at each step we generate a proposal sample X' ~ p(X)
from the program prior, which is independent of the previous sample X
.
The acceptance ratio simplifies to:
α = p(Y' | X') / p(Y | X)
In other words, when we propose X'
from the prior, the acceptance ratio is simply the ratio of the likelihoods. Since our likelihood weighting algorithm computes σ(log W) = log p(Y|X)
as a side effect, we can re-use the evaluator previously implemented, and simply evaluate the acceptance ratio as p(Y'|X')/p(Y|X)
, where p(Y'|X')
is the likelihood of the proposal and p(Y|X)
is the likelihood of the previous sample.
4.2.1 Single-site Proposals
The previous algorithm was so simple because we have side-stepped the difficult operations in the more general MH algorithm, and assumed that q(X'|X) = p(X')
.
In order to generate a proposal, we have to:
- Run our program in a manner that generates a sample
X' ~ q(X'|X)
which is conditioned on the values associated with our previous sample. - Calculate the probability of the reverse proposal
q(X|X')
in order to evaluate the acceptance ratio.
Both of these operations are complicated by the fact that X
and X'
potentially refer to different subsets of sample expressions.
let z = sample ( bernoulli 0.5)
µ = if (z == 0)
then sample (normal -1.0 1.0)
else sample (normal 1.0 1.0)
d = normal µ 1.0
y = 0.5
in observe d y
z
With graphical models, we would compile this model to a Bayesian network containing:
X = {z, µ0, µ1}
- three latent variablesY = {y}
- one observed variable
If we now evaluate if
-expressions lazily in evaluation-based inference, this means that we will either sample µ0
when z == 0
or µ1
when z == 1
, but not both. This introduces a complication: what happens if we update z = 0
to z = 1
in the proposal? This now implies that X
contains a variable µ0
which is not defined for X'
, so q(X'|X)
cannot be evaluated. Conversely, X'
needs to instantiate a value for the variable µ1
which was not defined for X
.
In order to define an evaluation-based algorithm for constructing a proposal, we will need to construct a map σ(Ⲭ)
, such that Ⲭ(x)
refers to the value of a latent variable x
.
In order to calculate the acceptance ratio, we will also need to construct a map σ(logℙ)
. With graphical-based inference, we used a map ℙ(v)
that evaluates the density for each random variable v ∈ X ∪ Y
. In our evaluation-based algorithm, we will use σ(logℙ(v))
to store the log density σ(logℙ(x)) = LOG-PROB(d, Ⲭ(x))
for each sample expression (sample d)
and to store the log density σ(logℙ(y)) = LOG-PROB(d, c)
for each observe expression (observe d c)
.
Lets now define the most commonly used evaluation-based proposal for probabilistic programming systems: the single-site Metropolis-Hastings update.
In this algorithm, we change the value for one variable x0
, keeping the values of other variables fixed whenever possible. To do so, we sample x0
from the program prior (as well as any any variables which aren’t yet in the current Ⲭ
, i.e. they haven’t been instantiated). For all other latent variables, we reuse the values Ⲭ(x)
.
This strategy can be summarized in the following algorithm:
- Pick a variable
x0 ∈ Ⲭ
at random from the current sample. - Construct a proposal
Ⲭ'
,ℙ'
by rerunning the program- For expressions
sample d
with variablex
:- If
x = x0
orx ¬∈ dom(Ⲭ)
, then sampleⲬ'(x) ~ d
Otherwise, reuse the valueⲬ'(x) ← Ⲭ(x)
. - Calculate the probability
ℙ'(x) ← PROB(d, Ⲭ'(x))
.
- If
- For expressions
observe d c
with variabley
:- Calculate the probability
ℙ'(y) ← PROB(d, c)
- Calculate the probability
- For expressions
To calculate the acceptance ratio, this is given by:
α = p(Y', X')q(X|X') / P(Y, X)q(X'|X)
= p(Y', X')/q(X' | X, x0) * q(X | X', x0)/p(Y, X) * q(x0 | X')/q(x0 | X)
-
The ratio
q(x0 | X'0) / q(x0 | X)
accounts for the relative probability of selecting the initial site. Sincex0
is chosen at random, this is the size of setX
divided by the size of setX'
:q(x0 | X') / q(x0 | X) = |X| / |X'|
-
The joint probability
p(Y', X')
is simply the product of probabilities of latent and observed variables found inℙ'
:p(Y', X') = Π ℙ'(y) Π ℙ'(x) y∈Y' x∈X'
where X’ = dom(Ⲭ’) and Y' = dom(ℙ') \ X'
-
To calculate the probability
q(X' | X, x0)
, we decompose the set of variablesX' = X'sampled ∪ X'reused
into the set of sampled variablesX'sampled
and the set of reused variablesX'reused
. The set of sampled variables is given byX'sampled = {x0} ∪ (dom(Ⲭ')\dom(Ⲭ))
, i.e. the randomly chosen variablex0
that we resample, and any latent variables inⲬ'
that aren’t in the domain ofⲬ
. Conversely,Xsampled = {x0} ∪ (dom(Ⲭ)\dom(Ⲭ'))
.
Since all variables in X'sampled
were sampled from the program prior, the proposal probability is:
```
q(X' | X, x0) = Π ℙ'(x)
x∈X'sampled
```
-
The ratio
p(Y', X')/q(X'|X, x0)
simplifies to:p(Y', X')/q(X'|X, x0) = Π ℙ'(y) Π ℙ'(x) / Π ℙ'(x) y∈Y' x∈X' x∈X'sampled = Π ℙ'(y) Π ℙ'(x) y∈Y' x∈X'reused
The ratio
p(Y, X)/q(X|X, x0)
is similarly:p(Y, X)/q(X|X, x0) = Π ℙ(y) Π ℙ(x) / Π ℙ(x) y∈Y x∈X x∈Xsampled = Π ℙ(y) Π ℙ(x) y∈Y x∈Xreused
The set of reused variables
Xreused
is identical to that ofX'reused
:X'reused = (dom(Ⲭ') ∩ dom(Ⲭ)) \ {x0} = Xreused
-
Putting all the terms together, the acceptance ratio becomes:
|dom(Ⲭ)| Π ℙ'(y) Π ℙ'(x) / |dom(Ⲭ)| Π ℙ'(y) Π ℙ'(x) y∈Y' x∈X'reused y∈Y' x∈X'reused
Addressing Transformation
In defining the acceptance ratio, we have assumed that we can associate a variable x
or y
with each sample
or observe
expression. We did this in our compilation of a graphical model. In the context of evalution-based methods, this type of unique identifier for a sample
/observe
expression is commonly referred to as an address.
If needed, unique addresses can be constructed dynamically at run-time. For programs in the FOPPL, we can create addresses using a source code transformation that is similar to our compilation in graph-based inference. In this, we replace all expressions sample
with sample v e
in which v
is a newly created variable. Similarly, we replace observe e1 e2
with observe v e1 e2
.
4.3 Sequential Monte Carlo
SMC:
sample(..)
sample(..)
observe(..) -- y_0
sample(..)
sample(..)
sample(..)
observe(..) -- y_1
sample(..)
...
observe(..) -- y_n
We create n
truncations of the program at every observe expression y
.
For n = 1 ... N
:
For l = 1 ... L
:
- We run the program L
times to generate L
sets of samples up to the n
th observe expression y_n
, (For sample expressions preceding y_n-1
, we reuse sample values from the k
th particle set of the previous run n-1
, where k
is selected from a discrete distribution weighted by each of the L
particle sets' log probability. For sample expressions between y_n-1
and y_n
, we sample new values from the prior.). This creates {(Xˡₙ)}ᴸₗ₌₁
.
- We compute the log probability at the statement y_n
, which is logΛˡₙ - logΛᵏₙ₋₁
(the total log probability up to observe state y_n
minus the total log probability up to observe statement y_n-1
, reusing the sampled values from the k
th particle set up to y_n-1
).
- The importance weight logWˡₙ
is then calculated as logWˡₙ = logΛˡₙ - logΛᵏₙ₋₁ + logẐₙ₋₁
, where logẐₙ₋₁
is log (1/L Σₗ Wˡₙ₋₁)
(the log mean of the weights of the previous particle sets). This creates {(Wˡₙ)}ᴸₗ₌₁
.
- This also means that the normalized weights up to observe statement y_n
is logẐₙ
= logẐₙ₋₁ + log (1/L Σₗ Wˡₙ)
, i.e the total normalized weights from y_0
up to observe statement y_n-1
, plus the normalized weights computed from y_n
.
Previously in Metropolis-Hastings, we used a likelihood-weighting algorithm where in each run of the program, we guess by sampling a X^i
from the program prior, and then check whether this is in fact a good proposal by calculating a weight W^l = p(Y|X^l)
according to the probabilities
of observe
expressions in the program. Unfortunately, this is not necessarily efficient. In order
to end up with high weighted samples, we have to generate reasonable values for all random variables X
. Even if we are always improving our proposed sample at each program run, it may actually take ages to actually end up with samples which are weighted high enough to consider them good, because the frequency with which we generate good proposals decreases exponentially with the number of sample
statements.
Sequential Monte Carlo (SMC) methods solve this problem by turning a sampling problem for high dimensional distributions into a sequence of sampling problems for lower dimensional distributions. In their most general form, SMC methods consider a sequence of N
unnormalized densities γ1(X1), ..., γN(XN)
.
γ(X) = ψ(X)p(X)
= p(Y, X)
where ψ(X)
is a strictly positive potential which replaces the likelihood p(Y|X)
. Hence,
Here, γ1(X1) is typically a low dimensional distribution for which it is easy to peform importance sampling, whereas γN(XN)
is a high dimensional distribution, for which we want to generate samples. Each γn(Xn)
in between increases in dimensionality to interpolate between these two distributions. For a FOPPL program, we can define γN(XN)
(for which we want to generate samples from) as γ(X) = p(Y, X)
, i.e. the joint density associated with the program.
Given a set of unnormalized densities γn(Xn)
, SMC sequentially generates L
weighted samples:
{(Xˡₙ, Wˡₙ)}
by performing importance sampling for each of the N
normalized densities:
πₙ(Xₙ) = γₙ(Xₙ)/Zₙ
To do this, we:
-
Initialise a weighted set of particles
{(Xˡₙ, Wˡₙ)}ᴸₗ₌₁
using importance sampling:For
l = 1, ..., L
whereL
is the number of particles:Xˡ₁ ~ q₁(X₁) Wˡ₁ := γ₁(Xˡ₁)/q₁(Xˡ₁)
where the weight is the unnormalized density divided by the proposal density.
-
For each subsequent generation
n = 2, ..., N
:For
l = 1, ..., L
whereL
is the number of particles:- Resampling Step: Select a particle
Xᵏₙ₋₁
from the preceding set, where the indexk
is found by sampling an ancestor indexk = aˡₙ₋₁
with probability proportional toWᵏₙ₋₁
. In other words, given the previous set of weighted samples{(Xˡₙ₋₁, Wˡₙ₋₁)}ᴸₗ₌₁
, select a sampleXᵏₙ₋₁
from the set with probability according to its weight in the set.
Xᵏₙ₋₁, where k = aˡₙ₋₁ ~ Discrete (W¹ₙ₋₁ / Σₗ Wˡₙ₋₁ , . . . , Wᴸₙ₋₁ / Σₗ Wˡₙ₋₁ )
- Construct New Proposals: Generate a proposal
Xˡₙ
conditioned on the selected particleXᵏₙ₋₁
from the preceding set:
Xˡₙ ~ qₙ(Xˡₙ | Xᵏₙ₋₁)
And define the importance weights
Wˡₙ = γₙ(Xˡₙ) / ( γₙ₋₁(Xᵏₙ₋₁) * qₙ(Xˡₙ | Xᵏₙ₋₁) * Ẑₙ₋₁ )
γₙ(Xˡₙ)
is the unnormalized density of the proposal Xˡₙ.γₙ₋₁(Xᵏₙ₋₁)
is the unnormalized density of the selected particleXᵏₙ₋₁
from the preceding set.qₙ(Xˡₙ | Xᵏₙ₋₁)
is the proposal density of the proposalXˡₙ
conditioned on the selected particleXᵏₙ₋₁
from the preceding set.Ẑₙ₋₁
is the average weight of the preceding set of particles,1/L ΣWˡₙ₋₁
- Resampling Step: Select a particle
We can think of a resampling step as performing natural selection on the particle set: particles Xᵏₙ₋₁
with a high weight Wᵏₙ₋₁
will be used more often to construct proposals. In other words, SMC uses the weight of a sample at generation n - 1
as a heuristic for the weight that
4.3.1 Defining Intermediate Densities with Breackpoints
A FOPPL program defines an unnormalized distribution γ(X) = p(Y, X)
.
When inference is performed with SMC, we define the final density as γN (XN) = γ(X)
.
In order to define intermediate densities γₙ(Xₙ) = p(Yₙ, Xₙ)
, we consider a sequence of truncated programs that evaluate successively larger subsets of sample
and observe
expressions.
X1 ⊆ X2 ⊆ . . . ⊆ XN = X
Y1 ⊆ Y2 ⊆ . . . ⊆ YN = Y
The definition of a truncated program is a program that halts at a breakpoint. Breakpoints can be specified explicitly by the user, constructed using program analysis, or even dynamically defined at run-time. The sequence of breakpoints needs to satisfy the following two properties in order:
- The breakpoint for generation
n
must always occur after the breakpoint for generationn - 1
. - Each breakpoint needs to occur at an expression that is evaluated in every execution of a program. This means that breakpoints should not be associated with expressions inside branches of
if
-expressions.
In this section, we will assume that we first apply the addressing transformation (assigning fresh variables to sample
and observe
statements) to the FOPPL program duringdesugaring. We then assume that the user identifies a sequence of symbols y1, ... y{N-1}
for observe
expressions that satisfy the two above properties.
4.3.2 Calculating the Importance Weight
So far we have defined a notion of intermediate densities γₙ(Xₙ)
for FOPPL programs.
We now need to specify a mechanism for generating proposals from a distribution qₙ(Xₙ|Xₙ₋₁)
. The SMC analogue of likelihood weighting is to simple sample from the program prior p(Xₙ|Xₙ₋₁)
, which is sometimes known as a bootstrapped proposal. Hence, we state that:
qₙ(Xₙ|Xₙ₋₁) = p(Xₙ|Xₙ₋₁)
Xₙ ~ qₙ(Xₙ|Xₙ₋₁) = Xₙ ~ p(Xₙ|Xₙ₋₁)
We know that the unnormalized density is γₙ(Xₙ) = p(Yₙ, Xₙ)
. For this proposal, we can express the unnormalized density γₙ(Xₙ)
in terms of γₙ₋₁(Xₙ₋₁)
:
γₙ(Xₙ) = p(Yₙ, Xₙ)
= p(Yₙ | Yₙ₋₁, Xₙ) * p(Xₙ | Xₙ₋₁) * p(Yₙ₋₁, Xₙ₋₁)
= p(Yₙ | Yₙ₋₁, Xₙ) * p(Xₙ | Xₙ₋₁) * γₙ₋₁(Xₙ₋₁)
If we substitute this expression into the equation for calculating the importance weights for a new proposal for particle l
at particle set n
:
Wˡₙ = γₙ(Xˡₙ) / ( γₙ₋₁(Xᵏₙ₋₁) * qₙ(Xˡₙ | Xᵏₙ₋₁) * Ẑₙ₋₁ )
= p(Yˡₙ | Xˡₙ)/p(Yᵏₙ₋₁ | Xᵏₙ₋₁)
= Π p(y|Xˡₙ)
y∈Yˡₙ\ₙ₋₁
In other words, the importance weight at each generation is defined in terms of the joint probability of observe
s that have been evaluated at breakpoint n
but not at n - 1
.
4.3.3 Evaluating Proposals
To implement SMC, we will introduce a function PROPOSE(Ⲭₙ₋₁|yₙ)
.
This function:
- Evaluates the program that truncates at the
observe
expression with addressyₙ
, conditioned on previously sampled valuesⲬₙ₋₁
- Returns a pair
(Ⲭₙ, logΛₙ)
containing a mapⲬₙ
of values associated with eachsample
expression, and the log likelihoodlogΛₙ = logp(Yₙ|Xₙ)
. To construct the proposal for the final generation, we will callPROPOSE(Ⲭₙ₋₁, nil, yₙ)
which returns a pair(r, logΛ)
in which the return valuer
replaces the valuesⲬ
.
Below the define the PROPOSE
function and its evaluator.
global ρ, e
function eval(e, σ, `)
match e
case (sample v e)
d, σ ← eval(e, σ, `)
if v 6∈ dom(σ(X )) then
σ(X (v)) ← sample(d)
return σ(X (v)), σ
case (observe v e1 e2)
d, σ ← eval(e1, σ, `)
c, σ ← eval(e2, σ, `)
σ(log Λ) ← σ(logΛ) + log-prob(d, c)
if v = σ(yn)
then error resample-breakpoint( )
return c, σ
function propose(Ⲭ , y)
σ ← [yn → y, Ⲭ → Ⲭ , logΛ → 0]
try
r, σ ← eval(e, σ, [ ])
return r, σ(logΛ)
catch resample-breakpoint( )
return σ(X ), σ(logΛ)
- For
sample
expressions, we use previously sampled valuesⲬ(v)
for previously sampled variablesv
and sample from the prior for new variablesv
. - For
observe
expressions, we accumulate log probability into a state variablelogΛ
. When we reach theobserve
expression with a specified symbolyᵣ
, we terminate the program by throwing a special-purposeRESAMPLE-BREAKPOINT
error. - In
propose
, there is an implicit indexn
for1 ... n
attached toyn
. Forn < N
(N
being the number of unnormalized densities in the sequence of unnormalized densitiesγ1(X1), ... , γN(XN)
), we evaluate a truncated form of the program up to then
‘thobserve
expression, and simply return the sampled variablesⲬ
up until that point. Forn = N
, we finally return the valuer
outputted from the entire program (which represents a sample from the final unnormalized densityγN(XN)
.
Below is Sequential Monte Carlo:
- l in L is the index of the particle in a particle set.
- n in N is the index of the particle set, corresponding to the nth unnormalized density in the sequence of truncated unnormalized densities.
{(Xˡₙ, Wˡₙ)}
function SMC(L, y₁, ..., y_{N-1})
-- Set the log average weight to 0
logẐ₀ ← 0
-- Generate the initial set of L particles and their weights for n = 1
for l in 1, ..., L
Ⲭˡ₁, logΛˡ₁ ← PROPOSE([], y₁)
logWˡ₁ ← logΛˡ₁
-- For n = 2, ... N-1, generate a weighted set {(Xˡₙ, Wˡₙ)} for l = 1, ..., L
-- When having reached n = N, generate a final weighted set {(rˡ, Wˡ)} for l = 1, ..., L
for n in 2, ..., N
logẐₙ₋₁ ← LOG-MEAN-EXP(logWₙ₋₁)
-- For l = 1, ..., L
for l in 1, ..., L
k ~ Discrete (W¹ₙ₋₁ / Σₗ Wˡₙ₋₁ , . . . , Wᴸₙ₋₁ / Σₗ Wˡₙ₋₁ )
if n < N
(Ⲭˡₙ, logΛˡₙ) ← PROPOSE(Ⲭᵏₙ₋₁, yₙ)
else
(rˡ, logΛˡN) ← PROPOSE(Ⲭᵏ_N₋₁, y_N)
logWˡₙ ← logΛˡₙ - logΛᵏₙ₋₁ + logẐₙ₋₁
return ((r¹, logW¹_N), ..., (r^L, logW^L_N))
5. Probabilistic Programming with Recursion
The use of infinitely many random variables arises with the introduction of stochastic recursion.
6. Evaluation-Based Inference with HOPPL
The inference algorithms in this chapter use program evaluation as one of their core subroutines; however, to more clearly illustrate how evaluation-based inference can be implemented by extending existing languages, we abandon the definition of inference algorithms in terms of evaluators, and instead favor a more language-agnostic formulation.
We hence define inference methods as non-standard schedulers of HOPPL programs. The guiding intuition in this formulation is that the majority of operations in HOPPL programs are deterministic and referentially transparent with the exception of sample
and observe
which are stochastic and have side-effects. In the evaluators we previously defined for the FOPPL, this is reflected in the fact that only sample
and observe
expressions are algorithm specific.
The abstraction we employ is that a program as a deterministic computation can be interrupted at sample
and observe
expressions. Here, the program cedes control to an inference controller which implements probabilistic and stochastic operations in an algorithm-specific manner.
6.1 Explicit Separation of Model and Inference Code
We assume a probabilistic program is a deterministic computation that is interrupted at every sample
and observe
expression. Inference is carried out using a controller process. The controller needs to be able to start executions of a program, receive the return value when an execution terminates, and finally control program execution at each sample
and observe
expression.
The inference controller interacts with program executions via a messaging protocol. When a program reaches a sample
/observe
expression, it sends a message back to the controller and waits for a response. This message will typically include an address for the random variable, and a representation of the fully evaluated arguments to sample
and observe
. The controller then performs any operations necessary for inference, and sends a message back to the running program. The message indicates whether the program should continue execution, fork itself and execute multiple times, or halt.
This interface defines an abstraction boundary between program execution and inference.
Example: Likelihood Weighting
In evaluation-based implementation of likelihood weighting previously using a FOPPL, we evaluate sample
expressions by drawing from the prior, and increment the log importance weight at every observe
expression.
The inference controller for this inference strategy would repeat the following operations:
- The controller starts a new execution of the HOPPL program and initialises its log weight
logW = 0.0
. - The controller repeatedly receives messages from the running program, and dispatches based on type:
- At a
sample d
form, the controller samplesx
from the distributiond
and sends the sampled valuex
back to the program to continue execution. - At an
observe d c
form, the controller incrementslogW
with the log probability ofc
underd
and sends a message to continue execution. - If the program has terminated with value
c
, the controller stores a weighted sample(c, logW)
and exits the loop.
- At a
Messaging Interface
In the above inference algorithm (likelihood weighting), the response which the inference controller sends back is always to continue execution.
To support algorithms such as SMC, the program execution process will additionally need to implement a forking
operation, which starts multiple independent processes that each resume from the same point in the execution.
To support these operations, we will define an interface in which an inference process can send four messages to the execution process.
(start, σ)
: Start a new execution with process idσ
.(continue, σ, c)
: Continue execution for the process with idσ
, usingc
as the argument value.(fork, σ, σ', c)
: Fork the process with idσ
into a new process with idσ'
and continue execution with argumentc
.(kill, σ)
: Terminate the process with idσ
.
We assume that the program execution process can send three types of messages to the inference controller:
(sample, σ, α, d)
: The execution with idσ
has reached asample
expression with addressα
and distributiond
.(observe, σ, α, d, c)
: The execution with idσ
has reached anobserve
expression with addressα
, distributiond
, and valuec
.(return, σ, c)
: The execution with idσ
has terminated with return valuec
.
Implementations of Interruption and Forking
To implement this interface, program execution needs to support interruption, resuming, and forking.
- Interruption is relatively straightforward. In the case of the HOPPL, we will assume two primitives
(send µ)
and(receive σ)
. At everysample
andobserve
, we send a messageµ
to the inference process, and then receive a response with process idσ
. The call toreceive
then effectively pauses the execution until a response arrives. - Forking can be implemented in multiple ways.
- Previously in the FOPPL, we wrote evaluators that could be conditioned on a trace of random values to re-execute a program in a deterministic manner (i.e. reusing recently sampled values). This strategy can also be used to implement forking; we could simply re-execute the program from the start, conditioning on values of
sample
expressions that were already evaluated in the parent execution. This implementation is not particularly efficient, since it requires that we re-execute the program once for everyobserve
in the program, resulting in a computational cost that is quadratic in the number of observe expressions rather, than linear. - An alternative strategy is to implement an evaluator which keeps track of the current execution state of the machine; that is, it explicitly manages all memory which the program is able to access, and keeps track of the current point of execution. To interrupt a running program, we simply store the memory state. The program can then be forked by making a deep copy of the saved memory back into the interpreter and resuming execution.
- Forking becomes much more straightforward when we restrict the modeling language to prohibit mutable state. We have exactly two stateful operations:
sample
andobserve
: all other operations are guaranteed to have no side effects. In languages without mutable state, there is no need to copy the memory associated with a process during forking, since a variable cannot be modified in place once it h as been defined.
- Previously in the FOPPL, we wrote evaluators that could be conditioned on a trace of random values to re-execute a program in a deterministic manner (i.e. reusing recently sampled values). This strategy can also be used to implement forking; we could simply re-execute the program from the start, conditioning on values of
In the HOPPL, we will implement support for interruption and forking of program executions by way of transformation to CPS, which is a standard technique for supporting interruption of programs in purely functional languages.
We now describe two source code transformations for the HOPPL. The first is an addressing transformation, assigning a unique address with the messages that need to be sent at each sample
and observe
expression. The second converts the HOPPL program to CPS. Unlike the graph compiler and the custom evaluators, these code transformations take HOPPL programs as input and produces HOPPL programs as output - they do not change the language.
6.2 Addressing Transformation
An addressing transformation modifies the source code of the program to a new program that performs the original computation whilst keeping track of an address – a representation of the current execution point of the program.
This address uniquely identifies any sample
or observe
expression that can be reached in the course of an execution of a program. Since HOPPL can evaluate an unbounded number of these expressions, the transformation used previously to introduce addresses is not applicable here, since that transformation inlines the bodies of all function applications to create an exhaustive list of sample
and observe
statements.
The most familiar notion of an address is a stack trace, which is encountered whenever debugging a program that has prematurely terminated. In functional programming languages, a stack trace effectively provides a unique identifier for the current location in the program execution. This allows us to associate a unique address with each sample
and observe
expression at run-time, which we can then use in our implementations of inference methods.
The addressing transformation follows the given design:
- All function calls,
sample
statements, andobserve
statements, are modified to take an additional argument which provides the current address. We will use the symbolα
to refer to the address argument, which must be a fresh variable.
The addressing transformation is the relation:
e α ⇓addr e'
which translate a HOPPL expression e
and a variable α
to a new expression which incorporates addresses.
We additionally define a second relation:
↓addr
that operates on the top-level HOPPL program q
. This secondary evaluator serves to define the top-level outer address, that is, the base of the stack trace.
Variables, procedure names, constants, if
Addresses track the call stack. Hence evaluation of expressions that do not increase the depth of the call stack leave the address unaffected. This includes constants c
, variables v
and procedure names f
.
c is a constant value
───────────────────── ──────────── ───────────
c, α ⇓addr c v, α ⇓addr v f, α⇓addr f
Primitive procedures are transformed to accept an address argument. Since we are not able to “step in” primitive procedure calls, these calls do not increase the depth of the call stack. This means that the address argument to primitive procedure calls can be ignored in the function body.
c is a primitive function with n arguments
────────────────────────────────────────────
c, α ⇓addr (fn [α v1 ... vn] (c v1 ... vn))
User-defined functions are also transformed to accept an address argument, which may be referenced in the function body. The translated expression of the function body, e'
, may contain a free variable α
, which must be a unique symbol that cannot occur anywhere in the original expression e
.
e, α ⇓addr e'
──────────────────────────────────────────────────
(fn [v1 ... vn] e), α ⇓addr (fn [α v1 ... vn] e')
Evaluation of if
expression also does not modify the address in our implementation, so we only need to translate each of the sub-expression forms.
e1, α ⇓addr e1' e2, α ⇓addr e2' e3, α ⇓addr e3'
──────────────────────────────────────────────────
(if e1 e2 e3), α ⇓addr (if e1' e2' e3')
Functions, sample, and observe
So far, we have simply threaded an address through the entire program, but this address has not been modified in any of the expression forms above. We increase the depth of the call stack at every function call:
ei, α ⇓addr ei' for i = 0 ... n c <- fresh value
───────────────────────────────────────────────────────────────
(e0 e1 ... en), α ⇓addr (e0' (push-addr α c) e1' ... en')
Here, we:
- Translate the expression
e0
which returns a transformed functione0'
that now accepts an address argumentα
. - This address argument
α
is updated to reflect that we are now nested one level deeper in the call stack. To do so, we assume a primitive(push-addr α c)
which creates a new address by combining the current addressα
with some unique identifierc
generated at translation time. The translated expression will contain a new free variableα
since this variable is unbound in the expression(push-addr α c)
. We will bindα
to a top-level address using the↓addr
relation.
If we take the stack trace metaphor literally, then we can think of α
like a list-like data structure, and push-addr
as an operation that appends a new unique identifier c
to the end of this list. Alternatively, push-addr
could perform some sort of hash on α
and c
to yield an address of constant size, regardless of recursion depth.
The translation rules sample
and observe
can be thought of as special cases of the rule for general function application.
e, α ⇓addr e' c <- fresh
───────────────────────────────────────────────────────
(sample e) ⇓addr (sample (push-addr α c) e')
e1, α ⇓addr e1' e2, α ⇓addr e2' c <- fresh
───────────────────────────────────────────────────────
(observe e1 e2) ⇓addr (observe (push-addr α c) e1' e2')
The result of this translation is that each sample
and observe
expression will now have a unique address associated with it. These are constructed dynamically at run time, but are well-defined in the sense that each sample
or observe
will have an address that is fully determined by its call stack.
Top-level-addresses and program translation
Translation of function calls introduces an unbound variable α
into the expression. To associate a top-level address to a program execution, we define a relation that translates the program body e
and wraps it in a function which accepts an address argument.
Choose a fresh variable α e, α ⇓addr e'
───────────────────────────────────────────────────────
e, α ↓addr (fn [α] e')
For programs which include functions that are user-defined at the top level, this relation also inserts the additional address argument into each of the function definitions.
Choose a fresh variable α e, α ⇓addr e' q ↓addr q'
───────────────────────────────────────────────────────────────
(defn f [v1 ... vn] e) q ↓addr (defn f [α v1 ... vn] e') q'
6.3 Continuation-Passing-Style Transformation
Now that each function call in the program has been augmented with an address that tracks the location in the program execution, we next need to transform the program into CPS style so that we can pause, resume, and potentially fork it multiple times if needed.
We define the ⇓꜀ relation:
e, κ, σ ⇓꜀ e'
Here, e
is an expression, κ
is a continuation, e'
is the result of CPS-transforming e
under the continuation κ
. For purposes of generality, we incorporate an argument σ
which serves to store mutable state, or any information that needs to be threaded through the computation.
In Anglican σ
holds any state that needs to be tracked by the inference algorithm, and hereby plays a role analogous to that the state variable σ
we used in our evaluators.
In the messaging interface that we define in this chapter, all inference state is stored by the controller process, moreover, there is no mutable state in the HOPPL. As a result, the only state that we need to pass to the execution is the process id. We use σ
to refer to both the CPS state and the process id.
Using CPS, because the expressions will transform to the form κ σ v
, we can choose to either:
- Interrupt the computation and return a tuple
(κ, σ, v)
. - Evaluate the continuation call
κ σ v
(i.e.PApp [κ, σ, v]
).
For general (non-probabilistic) expressions, we will tend to always evaluate the continuation call on the expression:
- For variables, procedure names, and constants, these take the form:
v, κ, σ ⇓꜀ (PApp κ σ v)
For constants which are primitive functions, we need to transform them to their CPS form which are functions which additionally take a continuation and state as a parameter.
c_cps = fn [v1 v2 κ σ] (κ σ (c v1 v2)) ──────────────────────────────────── c, κ, σ ⇓꜀ (PApp κ σ c_cps)
- For if expressions, we transform both branches
e2
ande3
under the continuationκ
(which may result in something likee2_cps = κ σ e2
ande3_cps = κ σ e3
). We then create a continuation which when applied toe1
, will return the CPS transformede2'
ore3'
.e2, κ, σ ⇓꜀ e2_cps e3, κ, σ ⇓꜀ e3_cps e1, (fn [σ v] (if v then e2_cps else e3_cps)), σ ⇓꜀ e' ──────────────────────────────────────────────────── (if e1 then e3 else e3), κ, σ ⇓꜀ e'
- All functions are turned into functions which additionally accept a continuation
κ'
and stateσ
, and where their function bodye
has been CPS transformed under this continuationκ'
. We then apply the continuationκ
to the resulting function.e, κ', σ ⇓꜀ e' ──────────────────────────────────────────────────── (fn [v1 .. vn] e), κ, σ ⇓꜀ κ σ (fn [v1 .. vn κ' σ] e')
- With application, we know that the applied expression
v0 = e0
is a function which takes parametersv1 = e1 ... vn = en
, so when it is CPS transformed, it will additionally take a continuationκ
and stateσ
as a parameter. The main function body will hence bev0 v1 ... vn κ σ
.en, (fn [σ v] (v0 v1 ... vn κ σ )), σ ⇓꜀ e'n ei, (fn [σ v] e'i+1), σ ⇓꜀ e'i for i = (n - 1) ... 0 ──────────────────────────────────────────────────────── e0 e1 ... en, κ, σ ⇓꜀ e0'
With observe
and sample
, we perform a CPS transformation on the entire expression, but also replace the functions observe
and sample
with their CPS equivalents which take two additional parameters κ
and σ
.
6.4 Message Interface Implementation
In this interface, an inference controller process starts copies of the probabilistic program, which are interrupted at every sample
and observe
expression. On interruption, each program execution sends a message to the controller, which then carries out any inference operations that need to be performed. These operations can include:
- Sampling a new value
- Reusing a stored value
- Computing the log probabilities
- Resampling program executions
After carrying out these operations, the controller sends messages to the program executions to continue or fork the computation. We use a client-server architecture.
Messages in the Inference Controller
The inference controller (client) can send requests to the execution server (which runs the probabilistic program).
SEND("start", σ)
- start a new process with id σSEND("continue", σ, c)
- continue process σ with argument cSEND("fork", σ, σ', c)
- fork process σ with a new process with id σ’, and continue execution with argument cSEND("kill", σ)
- half execution with process σ
It can also RECEIVE
responses from the server.
Messages in the Execution Server
The execution server can be entirely be implemented in the HOPPL (if desired). It must be able to receive requests from the inference controller and return responses. We will assume that primitive functions receive
and send
exist for this purpose.
The execution server can respond by:
send "sample" σ α d
: The processσ
has arrived at asample
expression with addressα
and distributiond
.send "observe" σ α d c
: The processσ
has arrived at anobserve
expression with addressα
, distributiond
and valuec
.send "return" σ c
: The processσ
has termined with valuec
.
To implement this messaging architecture, we need to change the behaviour of sample
and observe
. We instead make use of their CPS analogues which additionally take a continuation κ
and the current CPS state σ
.
To interrupt the computation of sample
, observe
and return
expressions, we will provide an implementation that returns a tuple containing the relevant expression and its parameters, rather than calling the continuation on the expression itself.
sampleCPS [α d κ σ]
("sample", α, d, κ, σ)
observeCPS [α d κ σ]
("observe, α, d, κ, σ)
returnCPS [α d κ σ]
("return", c, σ)