Introduction To Probabilistic Programming

1.1 Model-based Reasoning

A model is an artifical construct designed to respond in the same way as the system we would like to understand. As computers have evolved, numerical models have come to the forefront and computer simulations have replaced physical models. Numerical models emulate stochasticity, i.e. they use pseudorandom number generators to simulate actually random phenomena and other uncertainties. Running a simulator with stochastic value generation leads to an explosion of possible simulation outcomes. Effective stochastic modeling means writing a program that can produce all possible explosions, each corresponding to a particular set of random values. Stochastic numerical simulation aims to computationally encompass the complete distribution of possible outcomes.

When we write “model”, we generally will mean stochastic simulator and the measurable values it produces. However this is not the only notion of model that one can adopt.

Models produce values for things we can measure in the real world; we call such measured values observations. What counts as an observation is specific to the model, experiment, and query. Generally one does not observe every detail produced by a model, and sometimes one simply cannot.

Models can be used in various ways. One way is to use them to falsify theories. For this, one needs to encode the theory as a model and then simulate from it many times – if the population distribution of observations generated by the model is not in agreement with observations generated by the real world process, then there is evidence that the theory can be falsified. Models can also be used to make decisions.

All model types have parameters. Fitting these parameters, when few, can sometimes be performed manually, by intensive theory-based reasoning and a priori experimentation, by measuring conditional subcomponents of a simulator, or by simply fiddling with parameters to see which values produce the most realistic outputs.

Automated model fitting describes the process of using algorithms to determine either point or distributional estimates for model parameters and structure.

1.1.1 Model Denotation

To elaborate on what is meant by model denotation, we first look at a simple statistical model and see how it is denoted. Statistical models are typically denoted mathematically, subsequently manipulated algebraically, and then “solved” computationally. By “solved”, we mean that an inference problem, involving conditioning on the values of a subset of the variables in the model, is answered.

A simple model one could write down is a beta-Bernoulli model for generating a coin flip from a potentially biased coin. Such a model is typically denoted:

x ~ Beta(α, β) y ~ Bernoulli(x)

where α and β are parameters, x is a latent variable (the bias of the coin), and y is the value of the flipped coin.

We can ascribe a meaning to the symbol ~ and the keywords Beta and Bernoulli.

For example, Beta(a, b) means that given the value of arguments a and b, we can construct what is effectively an object with two methods.

The first method is a probability density (or distribution) function that computes:
The second method is something which can draw exact samples from the said distribution.

We can also intuit, not only that some variables in a model are to be observed (here for instance y), but that there is an inference objective, here for instance to characterize p(x|y).

We will generally focus on conditioning as the goal, namely the characterisation of some conditional distribution given a specification of a model in the form of a joint distribution. This involves the extensive use of Bayes rule:

Bayes rule tells us how to derive a conditional probability from a joint probability. Conditioning tells us how to rationally update our beliefs. Updating our beliefs is what learning and inference are all about.

The constituents of Bayes rule are:

p(Y|X) - the likelihood
p(X) - the prior
p(Y) - the marginal likelihood (or evidence)
p(X|Y) - the posterior

For our purposes, a model is the joint distribution p(Y, X) = p(Y|X)p(X) of the observations Y and the random choices made in the generative model X, known as the latent variables.

An important thing to understand is that conditioning a joint distribution, i.e. the fundamental Bayesian update, describes a huge number of problems succinctly.

We would build some intuition about the power of both programming languages for model denotation and automated conditioning, by considering the following table:

In this table, we list a number of X, Y pairs where denoting the joint distribution of P(X,Y) is realistically only doable in a probabilistic programming language and the posterior distribution P(X|Y) is of interest.

Consider the first row, “scene description” and “image”. Visualising what P(X, Y) is, is difficult. However, thinking about P(X) as a distribution over possible scene graphs is not too hard, for example as writing a simulator that only needs to stochastically generate reasonably plausible scene graphs. Given that P(X, Y) = P(Y|X)P(X), all we need to do to get the joint distribution is a way to get from a scene graph to an observable image. There are many kinds of renderers that do this – although these are generally deterministic, this is fine when specifying a joint distribution because they map from some latent scene description to an observable pixel space, and with the addition of some image-level pixel noise, they form a perfectly valid likelihood.

An example of this is considering the image Y to be a Captcha image and the scene description X to include the obscured string. Let us consider alternative ways to solve a Captcha problem. A non-probabilistic programming approach would require gathering a very large number of Captchas, hand-labelling them all, and then designing and training a neural network to regress from the image to a text string. The probabilistic programming approach in contrast, merely requires one to write a program that generates Captchas that are stylistically similar to the type of Captchas one would like to break, i.e. writing a model of Captchas, in a probabilistic programming language. Conditioning such a model on its observable output, the Captcha image, will yield a posterior distribution over text strings. This kind of conditioning is what probabilistic programming evaluators do.

The below figure shows a representation of the output of such a conditioning computation:

Each captcha and bar-plot pair consists of a held-out Captcha image and a truncated marginal posterior distribution over unique string interpretations (marginal distribution meaning it gives the probabilities of various values of a subset of variables without reference to the values of the other variables). The middle of the bottow row expresses that the noise on the Captcha makes it difficult to distinguish whether the string is “aG8BPY” or “aG8RPY” - the posterior distribution P(X|Y) arrived at by conditioning reflects this uncertainty.

This simple example aims to liberate the idea of what a model is (a joint distribution produced by adding stochastic choice to normal computer programs) and what the output of a conditioning computation can be like. What probabilistic programming languages do is to allow denotation of any such model. This tutorial covers how to develop inference algorithms that allow computational characterization of the posterior distribution of interest.

1.1.2 Conditioning

We aim to demonstrate what the mathematical operations involved in conditioning are like, and why the problem of conditioning is generally hard. Consider the example of coin-flips and let us write out the joint probability density for the distribution of X and Y. Assume that the symbol Y denotes the observed outcome of the coin flip, where heads is encoded as 1 and tails is encoded as 0. We denote the bias of the coin, i.e. the probability it comes up as heads, using the symbol x and encode it using a real positive number between 0 and 1.

Then using standard definitions for the distributions indicated by the joint denotation in:

x ~ Beta(α, β)
y ~ Bernoulli(x)

we can write:

and use rules of algebra to simplify this expression to:

Our implicit objective here is not to compute the value of the joint probability of some variables, but to do conditioning instead, for example, to compute p(x|y=“heads”). Using Bayes rule, this is theoretically easy to do. It is just:

In this special case, the rules of algebra and semantics preserving transformations of integrals can be used to algebraically solve for an analytic form for this posterior distribution.

However, the integral in the denominator is the complicating crux of Bayesian inference. This integral is in general intractable as it involves integrating over the entire space of latent variables. Consider the Captcha example: simply summing over the latent character sequence itself would require an exponential-time operation.

This specific example distribution of coin-flipping has a special property called conjugacy, which means that this integral can be performed by inspection, by identifying that the integrand is the same as the non-constant part of the beta distribution.

…

The result of the conditioning operation is a distribution p(x|y) parameterized by the observed or given quantity. Unfortunately, this distribution will in general not have an analytic form, because we usually won’t be so lucky that the normalizing integral as an algebraic analytic solution, nor that it will be easily calculable.

However, remember that the ~ operator is overloaded to mean two things:

Evaluation of a probability density (or distribution) function.
A means of drawing exact sampling from the said distribution.

Neither of these are possible in general. The latter can be approximated, often without being able to do the former. For this reason, our focus will be on sampling-based characterisations of conditional distributions in general.

1.1.3 Query

Having a handle on the resulting posterior distribution or a method for drawing samples from it, allows us to ask queries in general. These are best expressed in integral form as well. For instance, we could ask “what is the probability that the bias of the coin is greater than 0.7, given that the coin came up heads?”.

This is mathematically denoted as:

Where I() is an indicator function which evaluates to 1 when its argument is true, and 0 otherwise – in this instance, this can be directly calculated using the cumulative distribution function of the beta distribution.

Fortunately we can still answer queries when we only have the ability to sample from the posterior distribution, due to the Markov strong law of large numbers which states that for general distributions p and functions f:

There are two things to note here:

The distribution on the RHS is approximated by a set of L samples on the LHS.
Different functions f can be evaluated at the same sample points chosen to represent the probability distribution p after the samples have been generated.

1.2 Probabilistic Programming

One view of probabilistic programming is that it is about automating Bayesian inference. In this view, probabilistic programming concerns 1) the development of syntax and semantics for languages that denote conditional inference problems, and 2) the development of corresponding evaluators or solvers that computationally characterize the denoted conditional distribution.

Computer science has largely been about finding ways to efficiently evaluate programs, given parameter values, to produce some output.

The typical computer science programming pipeline is to write a program, specify the values of its arguments, then evaluate the program to produce an output. The typical statistical modeling approach is to start with the output (the observations or data Y), then specify a usually abstract generative model p(X, Y), and finally use algebra and inference techniques to characterize the posterior distribution p(X|Y) of the unknown quantities in the model given the observed quantities. Probabilistic programming is about performing Bayesian inference using the tools of computer science - programming language for model denotation, and statistical inference algorithms for computing the conditional distribution of program inputs that could have given rise to the observed program output.

For instance, reasoning about the bias of a coin flip is an example of the kind of inference that probabilistic programming systems can do. Our data is the outcome of a coin flip. Our model (specified in a forward direction) stipulates that a coin and its bias is generated according to the hand-specified model. The coin flip outcome is then observed and analysed under this model.

One challenge, the writing of the model, is a major focus of statistics research where useful models are painstakingly designed for every new important problem.
The other challenge is computational, and is what Bayes rule gives us a theoretical framework in which to calculate: to computationally characterize the posterior distribution of the latent quantities (bias of a coin) given the observed quantity (heads or tails)

When performing inference in probabilistic programming systems, what counts as observable are the outputs generated from the forward computation. The inference objective is to computationally characterize the posterior distribution of all of the latent quantities random choices made during the forward execution of the program, given that the program produces a particular output.

Probabilistic programs are usual functional or imperative programs with two added constructs:

The ability to draw values at random from distributions.
The ability to condition values of variables in a program via observations.

The meaning of a probabilistic program is that it simultaneously denotes a joint and conditional distribution, the latter by syntactically indicating where conditioning will occur (what we condition on – i.e. which random variable values will be observed).

Almost all languages have pseudo-random value generators or packages; what they lack in comparison to probabilistic programming languages is syntactic constructs for conditioning and evaluators that implement conditioning. Languages that include these constructs are called probabilistic programming languages. Languages that do not include these constructs, but are used for forward modeling, are called stochastic simulation languages, or more simply, programming languages.

There are many libraries for constructing graphical models and performing inference (graphical models are probabilistic models for which a graph expresses the conditional dependence structure between random variables); this software works by programmatically constructing a data structure which represents a model, and then given observations, runs graphical model inference. What distinguishes between this kind of approach and probabilistic programming is that a program is used to explicitly construct a model as a data structure, rather than considering the “model” that arises implicitly from direct evaluation of the program itself.

In probabilistic programming systems, a model data structure is either:

Constructed explicitly via a non-standard interpretation of the probabilistic program itself (in which case inference is performed by compiling the model data structure to a probability density function).
A general Markov model whose state is the evolving evaluation environment generated by the probabilistic programming language evaluator (in which case inference is performed by employing methods that are fundamentally generative).

(Difference between density function and distribution: so the distribution is the shape, the density is the function. If you have the density you have the distribution, but the density may not have an analytical form i.e. it may not be a computable function, so you may want to approximate the distribution. I’d say the density refers to the mathematical function, whereas the distribution is the shape. You can in theory replace one with the other and things should make sense.)

2. A Probabilistic Programming Language Without Recursion

We present the key ideas of probabilistic programming using a first-order probabilistic programming language (FOPPL).

The restrictions that we impose are that:

Functions must be first-order, i.e. functions cannot take other functions as arguments.
Functions cannot be recursive.

These two restrictions result in a language where models describe distributions over a finite number of random variables.

We can compile any program in the FOPPL to a data structure that represents the corresponding graphical model - this is a very useful property when reasoning about inference, as it allows us to make use of existing theories and algorithms for inference in graphical models. This property then gives rise to the ability to completely determine the computation graph of any FOPPL program in advance. While in a FOPPL program, conditional branching might dictate that not all of the nodes of its computation graph are active in the sense of being on the control-flow path, it is the case that all FOPPL programs can be unrolled to computation graphs where all possible control-flow paths are explicitly and completely enumerated at compile time. FOPPL programs hence have static computation graphs.

2.1 Syntax

We define the FOPPL in terms of two sets of production rules: one for expressions e and another for programs q.

v ::= variable
c ::= constant value or primitive operation
f ::= procedure
e ::= c | v | (let [v e1] e2) | (if e1 e2 e3) 
    | (f e1 ... en) | (c e1 ... en)
    | (sample e) | (observe e1 e2)
q ::= e | (defn f [v1 ... vn] e) q

The two forms of expressions sample and observe are what make the FOPPL a probabilistic programming language.

A sample form (sample e) represents an unobserved random variable , i.e. a latent variable. We call these latent variables, because they are typically variables we sample from a prior distribution to represent some parameter we do not know much about yet. The sample statements hence define a prior p(X) on the random variables.

It accepts a single expression e (which must evaluate to a “distribution object”), and returns a value that is a sample from this distribution. Distributions are constructed using primitives provided by the FOPPL – for example, normal 0.0 1.0 evaluates to a standard normal distribution.

An observe form (observe e1 e2) represents an observed random variable. These are observed because we typically use a latent variable to parameterise a distribution, and from this we can directly observe a value. The observe statements hence define a likelihood p(Y|X), which is the probability of observing the observed variable Y conditioned on the latent variable X.

It accepts an argument e1 which must evaluate to a distribution, and conditions the distribution on the next argument e2, which is the value of the random variable.

To elaborate on how sample represents a latent variable, and observe represents an observed variable, consider the following:

x ∼ Beta(α, β)
y ∼ Bernoulli(x)

Here, x is a latent variable that we sample from a prior distribution, for example, the bias of a coin. Then, y will be a variable that we directly observe from a distribution parameterised by our latent variable, for example, the value of a flipped coin. We can’t observe the bias of the coin, but we can sample an initial guess for it as a latent variable. Using this, we can directly observe values which are produced from a distribution parameterised by the latent variable, in order to indirectly observe and infer the latent variable.

Let’s now illustrate what a program in the FOPPL looks like. Below shows a simple univariate Bayesian linear regression model.

(defn observe-data [slope intercept x y]
  (let [fx (+ (* slope x) intercept)]
    (observe (normal fx 1.0) y )  
  )
)

(let [ slope ( sample ( normal 0.0 10.0))]
  (let [ intercept ( sample ( normal 0.0 10.0))]
    (let [ y1 ( observe-data slope intercept 1.0 2.1)]
      (let [ y2 ( observe-data slope intercept 2.0 3.9)]
        (let [ y3 ( observe-data slope intercept 3.0 5.3)]
          (let [ y4 ( observe-data slope intercept 4.0 7.7)]
            (let [ y5 ( observe-data slope intercept 5.0 10.2)]
              [ slope intercept ])
          )
        )
      )
    )
  )
)

The function observe-data conditions the generative model given a pair (x, y), by observing the value y from a normal distribution with mean ((slope * x) + intercept). The function returns the observed value, which is ignored in our case – the reason for this is that we are not interested in the returned values; we are using observe-data in order to score/condition our distribution, which is a side-effect of the function call.

The program defines a prior on the slope and intercept, using the function normal for creating an object for normal distribution.

After conditioning this prior with 5 observed data points, the program returns a pair [slope intercept] which is a sample from the posterior distribution where we have conditioned on the 5 observed values.

2.2 Syntactic Sugar

foreach

The foreach construct has the following syntax:

(foreach c
  [v_1 e_1 ... v_n e_n]
  e_1' ... e_k'
)

This desugars into a vector containing c let constructs:

(vector
  (let [ v_1 (get e_1 0)
         ...
         v_n (get e_n 0) 
       ]
       e_1' ... e_k'
  )
  ...
  (let [ v_1 (get e_1 (c - 1))
         ...
         v_n (get e_n (c - 1))
       ]
       e_1' ... e_k'
  )
)

The foreach form associates each variable v_i with a sequence e_i, and then maps over the values in this sequence for a total of c steps, returning a vector of results. If the length of any of the bound sequences is less than c, then this will result in a runtime error.

Using the foreach form, we can write our linear regression model without having to make use of the function observe-data.

Old code:

(defn observe-data [slope intercept x y]
  (let [fx (+ (* slope x) intercept)]
    (observe (normal fx 1.0) y )  
  )
)

(let [ slope ( sample ( normal 0.0 10.0))]
  (let [ intercept ( sample ( normal 0.0 10.0))]
    (let [ y1 ( observe-data slope intercept 1.0 2.1)]
      (let [ y2 ( observe-data slope intercept 2.0 3.9)]
        (let [ y3 ( observe-data slope intercept 3.0 5.3)]
          (let [ y4 ( observe-data slope intercept 4.0 7.7)]
            (let [ y5 ( observe-data slope intercept 5.0 10.2)]
              [ slope intercept ])
          )
        )
      )
    )
  )
)

New code

(let [ y-values  [2.1 3.9 5.3 7.7 10.2]
       slope     (sample (normal 0.0 10.0))
       intercept (sample (normal 0.0 10.0))
     ]
     (foreach 5
        [ x (range 1 6)
          y y-values
        ]
        (let [fx ((slope * x) + intercept)]
             (observe (normal fx 1.0) y)
        ) 
     )
)

The reason we use a constant c for the number of loop iterations is so that we can guarantee that the number of iterations is known at compile time.

loop

The loop construct has the following syntax:

(loop c e f e_1 ... e_n)

Here, c must be an non-negative integer constant, and f a function. Desugaring this yields the following, where v_0, ..., v_(c-1) and a_0, ..., a_n are fresh variables:

(let [ a_1 e_1
       a_2 e_2
       ...
       a_n e_n 
     ]
     (let [ v_0 (f 0 e a_1 ... a_n) ]
          (let [ v_1 (f 1 v_0 a_1 ... a_n) ]
               (let [ v_2 (f 2 v_1 a_1 ... a_n)]
                    ...
                       (let [ v_(c-1) (f (c-1) v_(c-2) a_1 ... a_n) ]
                            v_(c-1)
                       )
               )
          )
     )
)

let a_1 = e_1
    a_2 = e_2
    ...
    a_n = e_n 
in  (let v_0 = (f 0 e a_1 ... a_n) 
     in  (let v_1 = (f 1 v_0 a_1 ... a_n)
          in  (let v_2 = (f 2 v_1 a_1 ... a_n)
               in  ...
                   (let v_(c-1) = (f (c-1) v_(c-2) a_1 ... a_n)
                    in  v_(c-1)   
                   )
               )
          )
     )
)

We can see that the as are bound to each of the expressions in the sequence e_1, ..., e_n and all of these are repeatedly passed to function f in each loop iteration. The variables v are bound to the return value of f and then used as the next value passed to the parameter e of function f; they hence represent recursive accumulated computations.

To demonstrate loop, we show a new variant of the linear regression example:

(defn regr-step [ n r2 xs ys slope intercept ]
  (let [ x  (get xn n)
         y  (get ys n)
         fx ((slope * x ) + intercept )
         r  (y - fx)
        ]
       ( observe ( normal fx 1.0) y )
       ( r2 + (r * r) )
  )
)

(let [ xs [1.0 2.0 3.0 4.0 5.0]
       ys [2.1 3.9 5.3 7.7 10.2]
       slope ( sample ( normal 0.0 10.0) )
       bias ( sample ( normal 0.0 10.0) )
       r2 ( loop 5 0.0 regr-step xs ys slope bias )
     ]
     [ slope bias r2 ]
)

(defn regr-step [ n r2 xs ys slope intercept ]
  (do let x  = get xn n
          y  = get ys n
          fx = (slope * x ) + intercept
          r  = (y - fx)
   observe (normal fx 1.0) y
   return $ r2 + (r * r)
  )
)

(let xs     = [1.0 2.0 3.0 4.0 5.0]
     ys     = [2.1 3.9 5.3 7.7 10.2]
     slope  = sample ( normal 0.0 10.0)
     bias   = sample ( normal 0.0 10.0)
     r2     = loop 5 0.0 regr-step xs ys slope bias
 in  [ slope bias r2 ]
)

In this version of the program, we not only observe a sequence of values y_n according to the normal distribution centered around f(x_n), but we also compute the sum of the squared residuals: r^2=Σ(y_n - f(x_n))^2. To do this, we define a function regr-step which accepts an argument n (the index of the loop iteration) and an argument r2 which represents the sum of squares for the preceding data points. The value of the entire loop is the value of the final call to regr-step which is the sum of all the squared residuals.

The difference between loop and foreach is that:

loop can be used to accumulate a result over the course of iterations.
foreach provides a much more specific loop type that evaluates a single expression repeatedly with different values for its variables.

From a statistical perspective, we can think of them as:

loop defines a sequence of dependent variables
foreach creates conditionally independent variables

2.3 Examples

Likelihood: the likelihood function is a model of the data – it’s any stochastic algorithm which can just be sampling from a distribution. But remember the data is fixed, and the model parameters are the parameters of the likelihood function.

2.3.1 Gaussian Mixture Model

A Gaussian mixture model is a density estimation model often used for clustering, in which each data point y_n is assigned to a latent class z_n. We will consider the following generative model:

σ_k       ~ Gamma(1.0, 1.0)           for k = 1, 2, 3
µ_k       ~ Normal(0.0, 10.0)         for k = 1, 2, 3
π         ~ Dirichlet(1.0, 1.0, 1.0)
z_n       ~ Discrete(π)               for n = 1, ..., 7
y_n | z_n ~ Normal (µ_k, σ_k)

The following program shows a translation of this generative model to the FOPPL:

I’m assuming discrete takes a list of n probabilities which sum to 1, and assign them correspondingly to the values [0 .. n].

(let [ data         [1.1 2.1 2.0 1.9 -0.1 -0.05]
       likelihoods  (foreach 3 []
                      (let [ mu     (sample (normal 0.0 10.0))
                             sigma  (sample (gamma 1.0 1.0))
                           ]
                           (normal mu sigma)
                      )
                    )
       pi           (sample (dirichlet [1.0 1.0 1.0]))
       z-prior      (discrete pi)
     ]
     (foreach 7 [y data]
        (let [z (sample z-prior)]
             (observe (get likelihoods z) y)
             z 
        )
     )
)

(let data        = [1.1 2.1 2.0 1.9 -0.1 -0.05]
     likelihoods = (foreach 3 []
                      (let mu    = sample (normal 0.0 10.0)
                           sigma = sample (gamma 1.0 1.0)
                       in  normal mu sigma
                      )
                    )
     pi        = sample (dirichlet [1.0 1.0 1.0])
     z-prior   = discrete pi
 in  (foreach 7 [y in data]
        (do let z = (sample z-prior)]
            observe (get likelihoods z) y
            return z 
        )
     )
)

In this model, we first sample the mean µ and standard deviation σ for 3 mixture components, given us an list of 3 normal distributions.

Sampling from a dirichlet distribution of 3 dimensions (1.0, 1.0, 1.0) will return a sample such as [0.26592092 0.23521457 0.49886451]. Using this as a parameter to a discrete distribution will give a distribution where the integers [0 1 2] are weighted respectively by [0.26592092 0.23521457 0.49886451], hence sampling from this discrete distribution will return an index from [0 1 2] using these probabilities.

For each observation y, we then sample a class assignment label z from the prior distribution – the value of this sample will be from [0 1 2], and is used as an index to get one of the likelihood distributions from likelihoods. After this, we observe y according to the likelihood distribution likelihoods[z]. Finally, we return the class assignment index z. The return value from the entire program is the sequence of latent class assignments for each of the 7 data points. We can use this result to ask questions such as “Are these two datapoints similar?”, etc.

2.3.2 Hidden Markov Model

The following program denotes a hidden markov model (HMM) with a known initial state, transition, and observation distributions governing 16 sequential observations.

( defn hmm-step [ t states data trans-dists likelihoods ]
  (let [ z ( sample (get trans-dists (last states)) )
       ]
       ( observe (get likelihoods z) (get data t) )
       ( append states z )
  )
)

(let [ data [0.9 0.8 0.7 0.0 -0 .025 -5 .0 -2 .0 -0 .1 0.0 0.13 0.45 6 0.2 0.3 -1 -1 ]
       trans-dists [( discrete [0.10 0.50 0.40])
                    ( discrete [0.20 0.20 0.60])
                    ( discrete [0.15 0.15 0.70])]
       likelihoods [( normal -1 .0 1.0)
                    ( normal 1.0 1.0)
                    ( normal 0.0 1.0)]
       states [( sample ( discrete [0.33 0.33 0.34]))]
     ]
     ( loop 16 states hmm-step data trans-dists likelihoods )
)

( defn hmm-step [ t states data trans-dists likelihoods ]
    (do let z = sample (get trans-dists (last states))
        observe (get likelihoods z) (get data t)
        append states z
  )
)

(let data        = [0.9 0.8 0.7 0.0 -0.025 -5.0 -2.0 -0.1
                    0.0 0.13 0.45 6 0.2 0.3 -1 -1 ]
     trans-dists = [( discrete [0.10 0.50 0.40]),
                    ( discrete [0.20 0.20 0.60]),
                    ( discrete [0.15 0.15 0.70])]
     likelihoods = [( normal -1.0 1.0),
                    ( normal 1.0 1.0),
                    ( normal 0.0 1.0)]
     states      = [( sample ( discrete [0.33 0.33 0.34]))]

 in  loop 16 states hmm-step data trans-dists likelihoods
)

Here:

data is a vector of data points.
trans-dists is a vector of transition distributions.
likelihoods is a vector of state likelihoods.

We then loop over the data using a function hmm-step, which at every loop iteration, does three things:

It first samples a new state z from the transition distribution associated with the preceding state (or in the first iteration, the initial state).
It then observes the current data point at time t according to the likelihood component of the current state.
Finally it appends the state z to the sequence states.

The vector of accumulated latent states is the return value of the program, and thus is the object whose joint posterior distribution is of interest.

2.4 A Simply Purely Deterministic Language

The FOPPL can be understood in two different ways:

A language for specifying graphical-model data structures on which traditional inference algorithms may be run.
A language that requires a non-standard interpretation in some implementing language to characterize the denoted posterior distribution.

In the case of graphical-model construction it will be necessary to have a language for purely deterministic expressions. This language will be used to express link functions in the graphical model (a link function provides the relationship between the linear predictor and the mean of the distribution function, where a linear predictor is a linear function whose value is used to predict the outcome of a dependent variable). Contrasting to the usual definition of link functions from statistics, the pure deterministic language will encode functions that take values of parent random variables and produce distribution objects for children. These link functions cannot have random variables inside them; such a variable would be another node in the graphical model instead.

2.4.1 Deterministic Expressions

Expressions in the FOPPL that do not involve user-defined procedure calls and involve only deterministic computations, such as (17 + (2.0 / 6.0)), will be called “0th-order expressions”.

To identify and work with these deterministic expressions, we define a language with the following grammar:

c ::= constant value or primitive operation
v ::= variable
E ::= c | v | (if E1 E2 E3) | (c E1 ... En)

Note that neither sample nor observe statements appear in the syntax, and that procedure calls are only allowed for primitive operations, not for defined procedures. These constraints ensure that epxressions E cannot depend on any probabilistic choices or conditioning.

3. Graph-Based Inference

3.1 Compilation to a Graphical Model

Programs written in the FOPPL specify probabilistic models over finitely many random variables. We will make this aspect clear by presenting the translation of these programs into finite graphical models. In later sections, we will show how this translation can be exploited to adapt inference algorithms for graphical models to probabilistic programs.

In the following relation, translation is specified by ⇓.

ρ, φ, e ⇓ G, E

The inputs:

ρ is an environment, mapping procedure names to their definitions
φ is a logical predicate for the flow control context
e is an expression we intend to compile

This expression e is translated to:

G, a graphical model. Vertices in G represent random variables, and arcs represent dependencies between them. For each random variable in G, we will define a probability density or mass function in the graph.
E, an expression in the deterministic sub-language (described in section 2.4.1). This expression is deterministic in the sense that it does not involve sample nor observe. It describes the return value of the original expression e in terms of random variables in G.

For observed random variables, we additionally define the observed value, as well as a logical predicate that indicates whether the observe expression is on the control flow path, conditioned on the values of the latent variables.

Definition of a Graphical Model

We define a graphical model G as a tuple (V, A, ℙ, Ƴ) where:

V is a set of vertices representing random variables
A is a set of directed edges (V × V) that represent conditional dependencies between random variables
ℙ is a map from vertices to deterministic expressions that specify the probability density function for each random variable
Ƴ is a partial map such that for each observed random variable, it contains a pair (E, Φ), where E is the deterministic expression for the observed value, and Φ is a predicate expression that evaluates to true when this observation is on the control flow path. A control flow path is a graphical representation of all paths that might be traversed through a program during its execution.

Before presenting a set of translation rules that can be used to compile any FOPPL program to a graphical model, we will illustrate the intended translation using a simple example:

(let [  z  (sample ( bernoulli 0.5))
        mu (if (= z 0) -1 .0 1.0)
        d  (normal mu 1.0)
        y 0.5
     ]
    (observe d y)
    z
)

The program first samples z as a prior from a bernoulli distribution, based on which it sets a likelihood parameter µ to -1.0 or 1.0. It then observes a value y = 0.5 from a normal distribution with mean µ. This program defines a joint distribution p(y = 0.5, z). The inference problem is then to characterize the posterior distribution p(z | y).

The below figure shows the graphical model and pure deterministic link functions that correspond to the above program:

In the evaluation relation ρ, φ, e ⇓ G, E:

e is the source code of the program.
ρ is the empty map, as there are no procedure definitions.
φ (the control flow predicate) is true at the top level of the program.

The graphical model G = (V, A, ℙ, Ƴ) and the result expression E that the program translates to are:Ƴ

V = {z, y}
A = {(z, y)}
ℙ = [ z → p_bern z 0.5,
      y → p_norm y (if (z = 0) -1.0 1.0) 1.0
    ]
Ƴ = [y → 0.5]
E = z

In here,

V, the vertex set of G, contains two random variables.
A, the arc set of G, contains a single pair (z, y) to mark the conditional dependence relationship between these two variables.
ℙ, the mapping of random variables to probability density functions, states that:
- The probability density for z is defined as the target language expression p_bern z 0.5. Here, p_bern refers to a function in the target language that implements the probability density function for the Bernoulli distribution.
- The probability density for y is defined using p_norm, which implements the probability density function for the normal distribution.
Ƴ, the mapping of observed random variables to pairs (E, Φ), contains a single entry that holds the observed value for `y_.

Assigning Symbols to Variable Nodes

In the above example, we used the mathematical symbols:

z to refer to the random variable associated with the expression (sample (bernoulli 0.5)).
y to refer to the observed variable with expression (observe d y).

In general, there will be one node in the network for each sample or observe expression that is evaluated in a program.

In the above example, there also happens to be a program variable z that holds the value of the sample expression for node z, and a program variable y that holds the observed value for node y. This is of course not necessarily always the case. In programs that have procedures, the same sample and observe expressions in the procedure body can be evaluated multiple times.

We will later define a general set of translation rules that compile a FOPPL program to a graphical model, in which we assign each vertex in the graphical model a newly generated unique symbol. Assigning a label to each vertex is a way of assigning a unique address to each and every random variable in the program.

If Expressions in Graphical Models

When compiling a program to a graphical model, if expressions require special consideration. Consider a simple mixture model, in which only the mean is treated as an unknown variable:

let z   = sample (bernoulli 0.5)
    mu  = sample (normal (if z = 0 then -1.0 else 1.0) 1.0)
    d   = normal mu 1.0
    y   = 0.5
in  observe d y
    z

This is of course a really strange way of writing a mixture model. We define a single likelihood parameter µ, which is either distributed according to Normal(−1, 1) when z = 0 and according to Normal(1, 1) when z = 1.

Typically we would think of a mixture model as having two components with parameter µ_0 and µ_1 respectively, where z selects the component. A more natural way to write the model might be:

let z     = sample (bernoulli 0.5)
    mu_0  = sample (normal -1.0 1.0)
    mu_1  = sample (normal  1.0 1.0)
    d_0   = normal mu_0 1.0
    d_1   = normal mu_1 1.0
    y     = 0.5
in  observe (if z = 0 then d0 else d1) y
    z

Here we sample parameters µ0 and µ1, which then define two component likelihoods d0 and d1. The variable z then selects the component likelihood for an observation y.

The first program defines a density on three variables p(y, µ, z). The second program defines a joint density on four variables p(y, µ1, µ0, z). Regardless, it seems intuitive that these programs are equivalent in some sense. The equivalence that we would want to achieve here is that both programs define the same marginal posterior on z.

p(z | y) = ∫ p(z, µ | y) dµ = ∫ ∫ p(z, µ0, µ1 | y) dµ0 dµ1

Recall that the number of nodes in the graphical model depends on the amount of sample and observe expressions. The essential difference between these two programs is that:

In the first program, the if-expression is placed inside the sample expression for mu. This results in a graph with 3 nodes.
In the second program, the if-expression is placed outside the sample expression for mu. This results in a graph with 4 nodes.

However, the distribution on the return values of the programs should be equivalent.

It also matters how we evaluate if-expressions – lazily or eagerly. Consider the following program:

let z   = sample (bernoulli 0.5)
    mu  = if z = 0 then sample (normal -1.0 1.0) else sample (normal 1.0 1.0)
    d   = normal mu 1.0
    y   = 0.5
in  observe d y
    z

In lazy evaluation, we would first evaluate the predicate z = 0 and then either evaluate the first sample expression or the second sample expression, but never both.

In eager evaluation, we first evaluate the predicate and both branches, before returning the value of one of the branches based on the predicate value.

The problem with eager evaluation of observe in if-expressions:

Consider the following code:

let z   = sample (bernoulli 0.5)
    mu0 = sample (normal -1 .0 1.0)
    mu1 = sample (normal 1.0 1.0)
    y   = 0.5
in  if   (z = 0)
    then (observe (normal mu_0 1) y)
    else (observe (normal mu1 1) y)

When performing eager evaluation, we would be calling observe on two variables y_0 and y_1, both with value 0.5 – this means we’re conditioning the probability distribution with both variables, even though we only want to do it for one of them. When performing lazy evaluation, only one of the two branches would be included in the probability density. The lazy interpretation is a lot more natural here.

The problem with lazy evaluation of sample in if-expressions:

Consider the following code:

let z   = sample (bernoulli 0.5)
    mu  = if   (z = 0)
          then sample (normal -1.0 1.0)
          else sample (normal  1.0 1.0)
    d   = normal mu 1.0
    y   = 0.5
in  observe d y
    z

Lazy evaluation of if-expressions makes it difficult to characterize the support of the probability distribution defined by a program when branches contain sample expressions. How do we define the probabilities for mu_0 and mu_1? One choice is to set the probability of unevaluated branches to 1. We would include either p(mu_0 | z = 0) or p(mu_1 | z = 1) in the probability, and assume a probability 1 for unevaluated branches. This would be equivalent to both:

Defining ℙ(mu_0) as if (z = 0) then p_norm mu_0 -1.0 1.0 else 1.0
Defining ℙ(mu_0) as if (z != 0) then p_norm mu_1 1.0 1.0 else 1.0

However, we cannot marginalize ∫ dµ_0 p(µ_0 | z = 0) and ∫ dµ_1 p(µ_1 | z = 1) if we assume both p(µ_0|z = 0) = 1 and p(µ1|z = 1) = 1.

Conclusion:

We need to understand that observe and sample expressions affect the marginal posterior on a program output in very different ways:

sample expressions that are not on the control flow path cannot affect the values of any expressions outside of their branch. This means they can be safely incorporated into the model as auxiliary variables, since they do not affect the marginal posterior on the return value.
observe expressions do not have the guarantee that sample expressions have, as they change the posterior distribution on the return value when incorporated into a graphical model.

Based on this intuition, the solution to our problem is:

We can assign probability 1 to the observed variables that are not on the same control flow path. Since observed variables have constant values, the interpretability of their support for characterizing probability distributions is not an issue in the same way it is with sampled variables.
Conversely, we assign the same probability to sampled variables, regardless of the branch they occur in. How this is accomplished will be described in later sections.

Translation Rules

             ρ, φ, e ⇓ G, E

Constants and Variables: we translate constants c and variables z to themselves and the empty graphical model.

────────────────────      ──────────────────── 
ρ, φ, c ⇓ G_empty, c      ρ, φ, z ⇓ G_empty, z

Let: we translate let [v e_1] e_2 by first translating e_1, then substituting the outcome of this translation for v in e_2, and finally translating the result of this substitution:

ρ, φ, e_1 ⇓ G_1 E_1     ρ, φ, e_2[v := E_1] ⇓ G_2, E_2
────────────────────────────────────────────────────── 
      ρ, φ, let [v e_1] e_2 ⇓ G_1 ⊕ G_2, E_2

G_1 ⊕ G_2 is the concatenation of two disjoint graphical models. When G_1 = (V_1, A_1, P_1, Y_1) and G_2 = (V_2, A_2, P_2, Y_2):

G_1 ⊕ G_2 = (V1 ∪ V2, A1 ∪ A2, P1 ⊕ P2, Y1 ⊕ Y2)

where P1 ⊕ P2 and Y1 ⊕ Y2 are the concatenation of two finite maps with disjoint domains (disjoint meaning having no elements in common). This operator ⊕ assumes that the input graphical models G_1 and G_2 use disjoint sets of vertices. This assumption always holds because every graphical model created by our translation uses fresh vertices, which do not appear in other networks generated.

If: when we translate the sub-expressions for the alternative branches, the conjoin the logical predicate φ with the expression E_1 or its negation – recall that the role of the logical predicate is useful for including observe statements that are on the current-sample control-flow path, and excluding observe statements that are off the current-sample control-flow path.

ρ, φ, e_1 ⇓ G_1 E_1   ρ, (φ && E_1), e_2 ⇓ G_2, E_2   ρ, (φ && not E_1), e_3 ⇓ G_3, E_3
───────────────────────────────────────────────────────────────────────────────────────
        ρ, φ, if e_1 e_2 e_3 ⇓ G_1 ⊕ G_2 ⊕ G_3, E_2

The set of rules so far for an expression e do not extend graphical models from e’s sub-expressions with any new vertices. This is because the programming constructs involved in these rules perform deterministic computations, not probabilistic computations, and graphical models are used to express random variables.

The next two rules for sample and observe will show how graphical models are extended with new random variables.

Sample:

ρ, φ, e ⇓ (V, A, ℙ, Ƴ), E    Choose a fresh variable v   Z = FREE-VARS(E)  F = SCORE(E, v)
───────────────────────────────────────────────────────────────────────────────────────
        ρ, φ, sample e ⇓ ⇓ (V ∪ {v}, A ∪ {(z, v) | z ∈ Z}, ℙ ⊕ [v → F], Ƴ), v

This rule states that we translate (sample e) in four steps.

First, we translate the argument e to a graphical model (V, A, ℙ, Ƴ) and a deterministic expression E. Both the argument e and its translation E represent the same distribution, from which sample e samples.
Second, we choose a fresh variable v that will represent the sampled random variable.
Third, we collect all the free variables in E that are used as random variables of the network, and set Z to the set of these random variables.
Finally, we convert the expression E (that denotes a distribution) to the probability density function F of the distribution. This conversion is done by calling SCORE on the distribution E and the variable v, which is defined as follows:

SCORE(if E_1, E_2, E_3, v) = if E_1 F_2 F_3
  when F_i = SCORE(E_i, v) for i ∈ {2, 3} and it is not ⊥

SCORE((c E_1 ... E_n), v)  = (p_c v E_1 ... E_n)
  when c is a constructor for a distribution and p_c is its probability density function

SCORE(E, v) = ⊥
  when E is not one of the above cases (when E is not a distribution)

Observe: the rule for observe expressions is analogous to sample expressions, but we additionally need to account for the observed value e_2 and the predicate φ.

  ρ, φ, e_1  ⇓ G_1, E_1        ρ, φ, e_2 ⇓ G_2, E_2
  (V, A, ℙ, Ƴ) = G_1 ⊕ G_2     Choose a fresh variable v
  F_1 = SCORE(E_1, v) ≠ ⊥      F = if φ then F_1 else 1
  Z = FREE-VARS(F_1) \ {v}     FREE-VARS(E_2) ∩ V = ∅
  B = {(z, v) : z ∈ Z}
───────────────────────────────────────────────────────────────────────────────────────
  ρ, φ, observe e_1 e_2 ⇓ (V ∪ {v}, A ∪ B, ℙ ⊕ [v → F], Ƴ ⊕ [v → E_2]), E_2

The rule states that:

First we translate the sub-expressions e_1 and e_2 into the graphical models G_1 and G_2 and deterministic expressions E_1 and E_2, where E_1 is the distribution and E_2 is the value of an observed variable.
Next, we construct a graphical network (V, A, ℙ, Ƴ) by merging the networks of the subexpressions.
Then, we pick a new variable v that will represent the observed random variable.
We use SCORE to construct an expression F_1 that represents the probability density function of the observed variable v under the distribution E_1. (As with sample, the expression e_1 which translates to E_1 must be a distribution.)
We then construct a new expression F = if φ then F_1 else 1 to ensure that the probability of the observed variable evaluates to 1 if the observe expression occurs in a branch that was not taken. The free variables in this expression F are the union of the free variables in E_1, φ and the new variable v.
We add a set of arcs B to the graphical network, consisting of edges from all the free variables in F (i.e. the free variables in the probability density function of v under distribution E_1) to the observed random variable v, excluding v itself.

In order for this notion of an observed random variable to make sense, the expression E_2 for the observed random variable must be fully deterministic. For this reason we require that FREE-VARS(E_2) ∩ V = ∅, which ensures that E_2 cannot reference any other random variables in the graphical model G_1 ⊕ G_2.

An important consequence of E_2 being a value is that although the return value of an observe (being the value of the observed variable) may be used in subsequent computation, no graphical model edges will be generated with the observed random variable as a parent. An alternative rule could return a null value in place of E_2, which may be safer in terms of ensuring clarity to the programmer, so that there is no way to imagine that an edge could be created where one was not.

Procedure Call: the remaining two cases are those for a user-defined procedure f and a primitive function c.

For primitive functions, we first translate the arguments e_i to e_n. Then we translate the expression for the call by substituting the translated arguments into the original expression, and merging the graphs for the arguments.

  ρ, φ, e_i ⇓ G_i, E_i  forall i ∈ [1 ... n]
───────────────────────────────────────────────────────────────────────────────────────
  ρ, φ, (c e_1 ... e_2) ⇓ G_1 ⊕ ... ⊕ G_n, (c E_1 ... E_n)

For user-defined procedures, we first translate the arguments e_i to e_n. Then we transform the procedure body by replacing all instances of the variable v_i with the expression for the argument E_i.

3.2 Evaluating the Probability Density

We make explicit how we can use this representation of a probabilistic program to evaluate the probability of a particular setting of the variables in V. The Bayesian network G = (V, A, ℙ, Ƴ) that we construct by compiling a FOPPL program is a mathematical representation of a directed graphical model. Like any graphical model, G defines a probability density on its variables V.

In a directed graphical model, each node v ∈ V has a set of parents:

parents(v) := {u : (u, v) ∈ A}

The joint probability of all variables can be expressed as a product over conditional probabilities.

p(V) = ∏  p(v | parents(v))
      v∈V

In our graph G, each term p(v | parents(v)) is represented as a deterministic expression ℙ(v) = c v E_1 ... E_n, where:

c is either a probability density function (for continuous variables) or a probability mass function (for discrete functions)
E_1 ... E_n are expressions that evaluate to the parameters θ_1 ... θ_n of that probability density/mass function.

Implicit in this notation is that fact that each expression has some set of free variables. In order to evaluate an expression to a value, we must specify values for each of these free variables. In other words, we can think of each of these expressions E_i as a mapping from a value of a free variable to a parameter value. By construction, the set of parents parents(v) is nothing but the free variables in ℙ(v) except v. We can imagine this as the probability of v conditioned on its ancestor/parent random variables in the graph.

parents(v) = FREE-VARS(ℙ(v)) \ {v}

Recall that ℙ is a map from vertices to deterministic expressions that specify the probability density or mass function for each random variable. The expression ℙ(v) can hence be thought of as a function that maps the random variable v and its parents parents(v) to a probability or a probability density that is the probability of v conditioned on its parents. Therefore, we will treat these two as equivalent:

p(v | parents(v)) ≡ ℙ(v)

This means that the conditional probability of random variable v given its parent random variables can be looked up in ℙ by doing ℙ(v).

We can decompose the joint probability p(V) into a prior and a likelihood term.

Recall that Ƴ is a mapping from random variables to deterministic expressions representing their value. In the translation rule for observe, we require that the expression Ƴ(v) for an observed value can not have free variables. Each expression Ƴ(v) will hence simplify to a constant when we perform partial evaluation.
We will use Y to refer to all the nodes in V that correspond to observed random variables, which is to say Y = dom(Ƴ). Observed random variables are those produced by the observe expression.
We will use X to refer to all the nodes in V that correspond to unobserved random variables, which is to say X = V \ Y. Unobserved random variables are those produced by the sample expression.

Since the observed nodes y ∈ Y cannot have any children, we can re-express the joint probability p(V) as:

p(V) = p(Y, X) = p(Y | X)p(X)

where

p(Y | X) = ∏  p(y | parents(y))     p(X) = ∏ p(x | parents(x))
          y∈Y                             x∈X

In this manner, a probabilistic program defines a joint distribution p(Y, X).

The goal of probabilistic program inference is to characterize the posterior distribution:

p(X | Y) = p(X, Y)/p(Y)     p(Y) = ∫ p(X, Y) dX

3.2.1 Conditioning with Factors

Not all inference problems for probabilistic programs target a posterior p(X | Y) that is defined in terms of unobserved and observed random variables. There are inference problems in which there is no notion of observed data Y but it is possible to define some notion of loss, reward, or fitness, given a choice of an unobserved variable X.

In a probabilistic program:

The sample statements define a prior p(X) on the random variables.
The observe statements define a likelihood p(Y | X) on the random variables.

To support a more general notion of soft constraints, we can replace the likelihood p(Y | X) with a strictly positive potential ψ(X) to define an unnormalized density γ(X):

γ(X) = ψ(X)p(X)

In this more general setting, the goal of inference is to characterize a target density π(x), which is defined as:

π(X) =   γ(X)       Z = ∫ γ(X) dX
        ──────
          Z

     =  ψ(X)p(X)
        ───────
           Z

Here:

The target density π(X) is analogous to the posterior p(X | Y).
The strictly positive potential ψ(X) is analogous to the likelihood p(Y | X).
The unnormalized density γ(X) is analogous to the joint distribution p(Y, X).
The normalizing constant Z is analogous to the marginal likelihood p(Y).

From a language design perspective, we can now ask how the FOPPL would need to be extended in order to support this more general form of soft constraint.

For a probabilistic program in the FOPPL, the potential function is a product over terms:

ψ(X) = ∏ ψ_y(X_y)
      y∈Y

where
  ψ_y(X_y) := p(y = Ƴ(y) | parents(y)) ≡ ℙ(y)[y := Ƴ(y)]
  X_y      := FREE-VARS(ℙ(y)) \ {y} = parents(y)

Here, ψ_y(X_y) is the potential that “p(y = Ƴ(y) | parents(y)), i.e. the probability of random variable y being the value Ƴ(y) (where Ƴ is the mapping of random variables to values) given the parent nodes of y” is equal to “ℙ(y)[y := Ƴ(y)], i.e. the probability density function of y where we specify the value Ƴ(y) for its free variables X_y”.

One way to support arbitrary potential functions is to provide a special form factor log-p that takes an arbitrary log probability log-p as an argument. The factor expression lets the user manually set a probability for a new random variable, as opposed to observe which sets the probability based on a distribution and a value of an observed variable.

We can then define a translation rule for the factor log-p expression, which:

Inserts a new node v with probability ℙ(v) = exp log-p, where applying exp to a log probability will convert it into a non-log form.
Inserts an observed value nil for the random variable v into the graph.

  ρ, φ, e ⇓ (V, A, ℙ, Ƴ), E       F = (if φ then (exp E) else 1)
  Choose a fresh variable v
───────────────────────────────────────────────────────────────────────────────────────
  ρ, φ, (factor e) ⇓ (V, A, ℙ ⊕ [v → f], Ƴ ⊕ [v → nil]), nil

Comparing observe with factor:

Recall that the observe expression represents the likelihood P(Y|X). The potential function ψ(X) is analogous to the likelihood P(Y|X). Hence factor is analogous to observe – the difference is that observe conditions using both observed and unobserved random variables, whereas factor conditions using only unobserved random variables. This is why we set the value of v in Ƴ to nil – we haven’t observed a value. However, we set the probability density of the value of v to be the exponential of the log probability e. We can imagine that factor e is hence manually setting the probability of a random variable.

In practice, we don’t need to provide separate special forms for factor and observe, since each can be implement as a special case of the other.

One way of doing so is to define factor [log-p] as a procedure which calls observe on a pseudo distribution factor-dist parameterized by log-p, and a nil value:

defn factor [log-p]
  observe (factor-dist log-p) nil

where factor-dist is a constructor for a pseudo distribution object with corresponding potential:

p_factor-dist(y ; log-p) := if   y = nil
                            then exp log-p  
                            else 0

Another way of doing this is by defining observe [dist v] as a procedure which calls factor with the log probability log-prob dist v, where log-prob is a primitive procedure that returns the log probability density for a value v under a distribution dist:

defn observe [dist v]
  factor (log-prob dist v)
  v

3.2.2 Partial Evaluation

A necessary optimization for our compilation procedure is to perform a partial evaluation step. The partial evaluation step pre-evaluates expressions E in the target language that do not contain any free variables, which means that they take on the same value in every execution of the program. Partial evaluation of these expressions is necessary to avoid the appearance of false edges between variables that are in fact not connected, in the sense that the value of the parent does not affect the conditional density of the child.

Our target language is very simple, so we only need to update the rules for if-expressions and procedure calls to include a partial evaluation step.

If-expressions

                         ρ, φ, e_1 ⇓ G_1, E_1   
 ρ, EVAL(φ && E_1), e_2 ⇓ G_2, E_2   ρ, EVAL(φ && not E_1), e_3 ⇓ G_3, E_3
───────────────────────────────────────────────────────────────────────────────────────
 ρ, φ, (if e_1 then e_2 else e_3) ⇓ (G_1 ⊕ G_2 ⊕ G_3), EVAL(if E_1 then E_2 else E_3)

Procedure calls

                  ρ, φ, e_i ⇓ G_i, E_i   forall i ∈ [1 ... n] 
───────────────────────────────────────────────────────────────────────────────────────
 ρ, φ, (c e_1 ... e_n) ⇓ (G_1 ⊕ ... ⊕ G_n), EVAL(c E_1 ... E_n)

The expression EVAL(e) is the partial evaluation operation, and can incorporate any number of rules for simplifying expressions. We will begin with these rules:

EVAL

EVAL(if c_1 then E_2 else E_3) = E_2
  when c_1 = true

EVAL(if c_1 then E_2 else E_3) = E_3
  when c_1 = false

EVAL(c c_1 ... c_n) = c'
  when calling c with arguments c_1 ... c_n evaluates to c'

EVAL(E) = E
  in all other cases

These rules state:

An if-statement if E_1 then E_2 else E_3 can be simplified when E_1 = c_1 can be fully evaluated, by selecting the expression for the appropriate branch.
A primitive procedure call can be evaluated when all arguments can be fully evaluated.

We additionally need to modify the definition of the SCORE function.

Recall that SCORE converts an expression representing a distribution into its corresponding probability density function:

score(if E_1 then E_2 else E_3, v) = if E_1 then F_2 else F_3
  when F_i = score(E_i, v) for i ∈ {2, 3} and it is not ⊥

score(c E1 . . . En, v) = p_c v E1 . . . En
  when c is a constructor for distribution and p_c its pdf or pmf

score(E, v) = ⊥
  when E is not one of the above cases

We need to add one more rule:

SCORE(c, v) = p_c v
  when c is a distribution and p_c is its pdf or pmf

To see how partial evaluation also lets us reduce the number of edges in the network, let’s consider the expression if true then v_1 else v_2. This expression references two random variables v_1 and v_2 by name. After partial evaluation, this expression simplifies to v_1, which eliminates the false dependence on v_2.

Another practical advantage of partial evaluation is that it gives us a simple way to identify expressions in a program that are fully deterministic, since deterministic expressions will be partially evaluated to constants. This is useful when translating observe e1 e2 statements, in which the expression e_2 must be deterministic as it is the value of a random variable. In programs that use loop c v e e_1 ... e_n, we can now substitute any fully deterministic expression for the number of loop iterations c.

Lists, Vectors, and Hash Maps

Eliminating false edges/dependencies in the dependency graph becomes particularly important in programs that make use of data structures.

Consider the following example which defines a 3-state Markov chain.

let A   = [[0.9 0.1]
           [0.1 0.9]]
    x_1 = sample (discrete [0.5 0.5])
    x_2 = sample (discrete (get A x_1))
    x_3 = sample (discrete (get A x_2))
in  [x_1, x_2, x_3]

Compiling this to a network will yield three variable nodes. If we refer to these nodes as v_1, v_2, and v_3, then there will be arcs from v_1 to v_2, and from v_2 to v_3`.

Suppose we now rewrite this program using the loop construct:

defn markov-step [n xs A]
  let k   = last xs
      A_k = get A k
  in  append xs (sample (discrete A_k))

let A   = [[0.9 0.1]
           [0.1 0.9]]
    x_1 = sample (discrete [0.5 0.5])
in  loop 2 markov-step [x_1] A

In this version, each call to markov-step:

Accepts a vector of states xs
Gets the most recent state k = last xs (which will be either 0 or 1)
Then gets A_k, which is the k‘th array from A
Uses sample (discrete A_k) to sample a value from a discrete distribution of values [0, 1] with corresponding probabilities found in array A_k
Appends this sampled value as the new state to the vector of states xs

This program version generates the same sequence of random variables as the previous program, and has the advantage of allowing us to generalize to sequences of arbitrary length by varying the loop iteration constant. However, under the partial evaluation rules that we have specified so far, we would obtain a different set of edges. Just like the previous version of the program, this version will evaluate three sample statements.

For the first statement, sample (discrete [0.5 0.5]), there will be no arcs created (only the initial node v_1).
For the second sample statement, sample (discrete A_k), this will result in an arc from v_1 to v_2 since the expression for A_k expands to:
```
get [[0.9 0.1]
     [0.1 0.9]] (last [v_1])
```
However, for the third sample statement, there will be arcs from both v_1 and v_2 to v_3, since the expression for A_k expands to:
```
get [[0.9 0.1]
     [0.1 0.5]] (last (append [v_1] v_2))
```

The extra arc from v_1 to v_3 is of course not necessary here, since the expression last (append [v_1] v_2) will always evaluate to v_2. What’s more, if we run this program to generate more than 3 states, the node v_n for the nth state will have incoming arcs from all preceding variables v_1 ... v_{n-1}, whereas the only real arc in the Bayesian network is the one from v_{n-1}.

We can eliminate these false arcs by implementing an additional set of partial evaluation rules for data structures vector and hash-map:

EVAL(vector E_1 ... E_n)
  = [E_1 ... E_n]

EVAL(hash-map c_1 E_1 ... c_n E_n)
  = {c_1 E_1 ... c_n E_n}

These rules ensure that expressions which construct data structures are partially evaluated to data structures containing expressions. We can similarly partially evaluate functions that add, get, or replace entries.

EVAL(append [E_1 ... E_n] E_{n+1}) 
  = [E_1 ... E_n E_{n+1}]

EVAL(put {c_1 E_1 ... c_n E_n} c_k E'_k) 
  = {c_1 E_1 ... c_k E'_k ... c_n E_n}

EVAL(last [E_1 . . . E_n])
  = E_n

EVAL(get [E_1 . . . En_] k)
  = E_k

EVAL(get {c_1 E_1 . . . c_n E_n} c_k)
  = E_k

In the Markov chain example, the expression for A_k in the third sample statement then can be simplified:

get [[0.9 0.1]
     [0.1 0.5]] (last (append [v_1] v_2))

~>

get [[0.9 0.1]
     [0.1 0.9]] (last [v_1 v_2])

~>

get [[0.9 0.1]
     [0.1 0.9]] v_2

This yields the correct dependency structure for the Bayesian network.

3.3 Gibbs Sampling

So far, we have just defined a way to translate probabilistic programs into a data structure for finite graphical models. One important reason for doing so is that many existing inference algorithms are defined explicitly in terms of finite graphical models, and can now be applied directly to probabilistic programs written in the FOPPL.

We will consider such inference algorithms now, starting with a general family of Markov chain Monte Carlo (MCMC) algorithms. MCMC algorithms perform Bayesian inference by drawing samples from the posterior distribution P(X | Y), that is, the conditional distribution of the latent variables X ⊆ V given the observed variables Y ⊆ V. This is accomplished by simulating from a Markov chain whose transition function is defined such that the stationary distribution is the target posterior P(X | Y). These samples are then used to characterize the distribution of the return value r(X).

Procedurally, MCMC algorithms begin by initializing latent variables to some value X_0, and repeatedly sampling from a Markov transition density to produce a dependent sequence of samples X_1, ... X_s. We will describe how these algorithms can be applied in the context of inference in graphs produced by compiling a FOPPL.

Metropolis-Hastings

The Metropolis-Hastings algorithm provides a general recipe for producing appropriate MCMC transition functions, by combining a proposal step with an accept/reject step. Given some proposal distribution q(X' | V), the MH algorithm simulates a candidate sample latent value X' from q(X' | V) conditioned on the value of the current sample X, and then evaluates the acceptance probability by considering the candidate sample X' and the current sample X:

α(X', X) = min { 1, p(Y, X')q(X | V') / p(Y, X)q(X' | V) }

With probability α(X', X) we accept the transition X → X'. With probability 1 - α(X', X) we reject the transition and retain the current sample X → X.

When we repeatedly apply this transition function, we obtain a Markov process:

X' ∼ q(X' | V_{s - 1})    X_s = { X'          u ≤ α (X', X_{s - 1})
u  ~ Uniform(0, 1)              { X_{s - 1}   u > α (X', X_{s - 1})

Gibbs Sampling

The Gibbs sampling algorithms are an important special case of Metropolis-Hastings which cycle through all the latent variables in the model and iteratively sample from the full conditional distributions. The full conditional distribution is the probability distribution of a variable (node) in a probabilistic graphical model conditioned on the value of all the other variables in the probabilistic graphical model.

p(x | Y, X \ {x}) = p(x | V \ {x})

Alternative Explanation:

Gibbs sampling is a special case of Metropolis–Hastings in which the newly proposed state is always accepted with probability one. Consider a D-dimensional posterior with parameters θ={θ_1, ... ,θ_D}. The basic idea of Gibbs sampling is that for each time-step t, we sample θ by iteratively sampling a proposal for each of its dimensions θ_d from the conditional distribution P(θ_d ∣ X, θ\{θ_d}) where θ\{θ_d} is θ without the dth parameter:

Gibbs sampling:
```
for t = 1 ... T do
  θ^(t+1)_1 := θ*_1 ∼ P( θ^{t}_1 ∣ X, θ^{t}_2  , θ^{t}_3, ...,θ^{t}_D)
  θ^(t+1)_2 := θ*_2 ∼ P( θ^{t}_2 ∣ X, θ^{t+1}_1, θ^{t}_3, ...,θ^{t}_D)
  ⋮
  θ^(t+1)_d := θ*_d ∼ P( θ^{t}_d ∣ X, θ^{t+1}_1, ..., θ^{t+1}_{d−1}, θ^{t}_d, ...,θ^{t}D)
  ⋮
  θ^(t+1)_D := θ*_D ∼ P( θ^{t}_D ∣ X, θ^{t+1}_1, θ^{t+1}_2, ... ,θ^{t+1}_{D−1})
```

In some special cases of models, these conditional distributions can be derived analytically and sampled from exactly. However, this is not possible in general, and so as a general purpose solution, one turns to Metropolis-within-Gibbs algorithms, which instead apply a Metropolis-Hastings transition function which targets p(x | V {x}).

From an implementation perspective, given our compiled graph (V, A, ℙ, Ƴ) we can compute the acceptance probability α(X', X) by evaluating the expressions ℙ(v) for each v ∈ V, and substituting the values for the current sample X and the proposal X'. More precisely, if we use X to refer to the set of unobserved variables, and χ to refer to the map from unobserved variables to their values,

X = (x_1, ..., x_N)       χ = [ x_1 → c_1, ..., x_N → c_N]

then we can use Ѵ = χ ⊕ Ƴ to refer to the values of all variables (both unobserved and observed) and express the joint probability over the variables V as:

p(V = Ѵ) =  ∏ EVAL( ℙ(v)[V := Ѵ] )
           v∈V

When we update a single variable x using a kernel q(x | V), we are propposing a new mapping Ѵ' = Ѵ[x → c'], where c' is the candidate value proposed for x.

The acceptance probability for changing the value of x from c to c' then takes the form:

α(Ѵ', Ѵ) = min {1, p(V = Ѵ')q(x = c| V = Ѵ') / p(V = Ѵ)q(x = c| V = Ѵ)}

The important thing to note is that many terms in this ratio will cancel out:

The joint probabilities p(V = Ѵ) are composed of a product of conditional density terms ∏ p(v | parents(v)).
The individual expressions p(v | parents(v)) ≡ ℙ(v) depend on the value c or its proposed alternative c' of the node x only if v = x or x ∈ parents(v), in other words, if x ∈ FREE-VARS(ℙ(v)).

If we define V_x to be the set of variables whose probability densities depend on x:
```
V_x := {v : x ∈ FREE-VARS(ℙ(v))}
    = {x} ∪ {v : x ∈ parents(v)}
```
Then we can decompose the joint distribution p(V) of all random variables into terms that depend on the variable x that we propose changing and terms that do not.
```
p(V) = (∏ p(w | parents(w))) (∏ p(v | parents(v)))
       w∈V\V_x               v∈V_x
```
We now note that all terms w ∈ V \ V_x in the acceptance ratio cancel, with the same value in both numerator and denominator. Denoting the values of a variable v as c_v and c'_v for the maps Ѵ and Ѵ' respectively, we can simplify the acceptance probability α to:
```
α(Ѵ',Ѵ) = min {1 , ∏    p(v = c'_v | parents(v))q(x = c | V = Ѵ') }
                   v∈V_x
                   ─────────────────────────────────────────────
                    ∏    p(v = c_v  | parents(v))q(x = c' | V = Ѵ)
                   v∈V_x
```

To implement a Gibbs sampler, we additionaly need to specify some form of proposal. We will here assume a map ℚ from unobserved variables x to expressions in the target language.

ℚ := [ x_1 → E_1 , ... , x_N → E_N ]

For each variable x, the expression E defines a distribution, which can in principle depend on other unobserved variables X.

We can then use this distribution to both generate samples, and evaluate the forward and reverse proposal densities q(x = c' | V = Ѵ) and q(x = c' | V = Ѵ'). To do so, we first evaluate the expression to a distribution:

d = EVAL(ℚ(x)[V := Ѵ])

We then assume that we have an implementation for the functions:

SAMPLE, which allows us to generate samples
LOG-PROB, which allows us to evaluate the density function for the distribution

SAMPLE(d) = c'

LOG-PROB(d, c') = q(x = c' | V)

The below algorithm shows pseudo-code for a Gibbs sampler with this type of proposal.

global V, A, X, Y, ℙ, Ƴ    -- A directed graphical model
global ℚ                   -- A map of proposal expressions

function ACCEPT(x, χ', χ)

function GIBBS-STEP(χ)
  for x in X do
    d     <- EVAL(ℚ(x)[X := χ])
    χ'    <- χ
    χ'(x) <- SAMPLE(d)
    α     <- ACCEPT(x, χ', χ)
    u     ~  Uniform(0, 1)
    if u < α
      then χ <- χ'
    return χ

function GIBBS(χ^0, T)
  for t in 1 ... T do
    χ^t   <- GIBBS-STEP(χ^(t-1))
  return χ^1, ..., χ^T

V is the set of random variables
A is the set of arcs between random variables
X is the set of unobserved random variables
Y is the set of observed random variables
ℙ is the mapping of random variables V to their probability density function
Ƴ is the mapping of observed random variables to their value
ℚ is the mapping of unobserved random variables to their proposal probability distribution
χ is the mapping of unobserved variables to their value
Ѵ is the mapping of all random variables to their value (Ѵ = χ ⊕ Ƴ)

In the function ACCEPT, this computes the acceptance ratio of one of the unobserved variables x in X, given the current value found in χ and the new proposed value found in χ':

This takes one of the unobserved variables x in X, the current mapping χ of unobserved variables X to values, and the new proposed mapping χ' of unobserved variables X to values.
Get the current distribution expression for unobserved variable x (where we set the values of the unobserved variables X to those found in the map χ), and evaluate this to a distribution d.
Get the proposal distribution expression for unobserved variable x (where we set the values of the unobserved variables X to those found in the map χ), and evaluate this to a distribution d.
Find the difference between the log probability of drawing the current value of x from the proposal distribution d' and the log probability of drawing the proposed value of x from current distribution d.
Let V_x be the set of variables whose densities depend on x.

In the function GIBBS-STEP, this updates the values of unobserved variables X at time t:

This takes a mapping of unobserved variables χ.
For each unobserved variable x in χ:
1. Get the current distribution expression for unobserved variable x from the map ℚ, (where we set the values of the unobserved variables X to the values found in the map χ) and evaluate this to a distribution d.
2. Copy the current mapping χ of unobserved variables to values, yielding χ' which will represent the new proposal mapping.
3. Sample from the current distribution d and set this as the value of unobserved variable x in the new proposal mapping χ'.
4. Compute the acceptance ratio α by calling ACCEPT on the unobserved variable x, its current mapping χ of unobserved variables to values, and its new proposal mapping χ' of unobserved variables to values.
5. Sample a value u from a uniform distribution from 0 to 1.
6. If the acceptance ratio is bigger than the value u, then use the new proposed value for x, by returning the new proposal mapping χ'. Otherwise, retain the current value for x by returning the current mapping χ.

In the function GIBBS, this computes the values for the unobserved variables χ for T steps, i.e. T nodes in a Markov chain:

This takes the initial mapping χ^0 of unobserved variables to values at time t = 0, and the number of time steps S.
For each time-step t in 1 ... T, we compute the t‘th set of values for unobserved variables χ^t.

Note: each sample statement in our original program corresponds to sampling a parameter from the prior. The values (samples) produced from Gibbs inference are the same parameters but sampled from the posterior.

4. Evaluation-Based Inference I

In the previous chapter on Graph-Based Inference, our inference algorithms operated on a graph representation of a probabilistic model, which we created through a compilation of a program in our first-order probabilistic programming language. The construction of this graph is performed at compile time – we refer to graphs that can be constructed at compile time as having static support.

Many models exist in which the graph of conditional dependencies is dynamic, in the sense that it cannot be constructed prior to performing inference. One way these graphs arise is when the number of random variables is itself not known at compile time. These types of models are refered to as having dynamic support.

There are two basic strategies that we can employ to represent models with dynamic support.

The strategy we will use is to implement inference methods that dynamically instantiate random variables. For example, at each time step, an inference algorithm could decide whether there are any new objects have appeared in the field of view, and then create random variables for the position of these objects as needed.

A particular strategy for dynamic instantiation of variables is to generate values for variables by simply running a program. We refer to such strategies as evaluation-based inference methods.

Evaluation-based methods differ from their compilation-based counterparts in that they do not require a representation of the dependency graph to be known prior to execution. Rather, the graph is either built up dynamically, or never explicitly constructed at all. This means that many evaluation-based strategies can be applied to models that can be applied to models that can in principle instantiate an unbounded number of random variables.

One of the main things we will need to change in evaluation-based methods is how we deal with if-expressions. In Graph-Based Inference, we realized that if-expressions required special consideration in probabilistic programs: the question that we identified was whether lazy or eager evaluation should be used in if expressions that contain sample and/or observe expressions. We showed that lazy evaluation is necessary for observe expressions, since these expressions affect the posterior distribution on the program output as a side effect. However, for sample expressions, we have a choice between evaluation strategies, because we can always treat variables in unused branches as auxiliary variables. Hence, we adopted an eager evaluation strategy in which both branches of the if-expression are evaluated, but a symbolic flow control predicate determines when observe expressions need to be incorporated into the likelihood.

The language we previously introduced was designed to ensure that programs always evaluate a bounded set of sample and observe expressions – because of this, programs that are written in the FOPPL can be safely eagerly evaluated. It is very easy to create a language in which this is no longer the case. For example, if we let function definitions be recursive:

defn sample-geometric [alpha]
  let p = sample (bernoulli alpha)
  in  if (p == 1)
      then 1
      else (1 + sample-geometric p)

let alpha = sample (uniform 0 1)
    k     = sample-geometric alpha
in  observe (poisson k) 15
    alpha

Then eager evaluation of if-expressions would result in infinite reucrsion. The expression sample (bernoulli p) can in principle be evaluated an unbounded number of times, implying that the number of random variables in the graph is unbounded as well.

Although we can’t compile the above program to a static graph, it turns out that we can still perform inference in order to characterize the posterior on the program output. To do so, we rely on the fact that we can always simply run a program (using lazy evaluation for if expressions) to generate a sample from the prior. Even though we might not be able to characterize the support of a probabilistic program, we can still generate a sample that by construction is guaranteed to be part of the support.

If we additionally keep track of the probabilities associated with each of the observe expressions that is evaluated in a program, then we can implement sampling algorithms that either evaluate a Metropolis-Hastings acceptance ratio, or assume an importance weight to each sample.

In this chapter, we will assume that programs are defined using the FOPPL previously described, but that a lazy evaluation strategy is used for if-expressions.

4.1 Likelihood Weighting

The simplest evaluation-based method is likelihood weighting, which is a form of importance sampling in which the proposal distribution Q matches the prior distribution. The idea is to weight the samples produced during inference, where the weights reflect the probability that a sample would not be rejected. This is an alternative to rejecting samples.

4.1.1 Background: Importance Sampling

Monte Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results – during inference, these are used to approximate the posterior distribution p(X|Y) with a set of (weighted) samples.

Importance sampling is one of these Monte Carlo techniques. The trick that importance sampling methods rely upon is that we can replace an expectation over the posterior p(X|Y), which is generally hard to sample from, with an expectation over a proposal distribution q(X), which is chosen to be easy to sample from.

𝔼[X] denotes the expectation of random variable X.
If X is a random variable with outcomes x_1, x_2, ..., x_k with probabilities p_1, p_2, ..., p_k, then the expectation of X is defined as 𝔼[X] = Σ x_i, p_i.
If X is a random variable with probability density function f(x), then the expected value is defined as 𝔼[X] = ∫ x f(x) dx, or 𝔼_f(X)[X] = ∫ x f(x) dx.

Below shows how the expectation of our returned random variable r(X) over our posterior p(X|Y) can be replaced with the expectation of a different random variable r(X) * p(X|Y)/q(X) over a proposal distribution q(X).

𝔼_p(X|Y)[r(X)] = ∫ p(X|Y)r(X) dX
               = ∫ q(X) p(X|Y)/q(X) r(X) dX
               = 𝔼_q(X) [p(X|Y)/q(X) * r(X)]

What is r(X)? Recall that in MCMC, Bayesian inference is performed by drawing samples from the posterior distribution, which is accomplished by simulating from a Markov chain whose stationary distribution is the target distribution π, being the posterior in this case. These samples are then used to characterize the distribution of the return value r(X) (i.e. the samples produced by inference).

What is Importance Sampling/Likelihood Weighting The idea behind importance sampling is that certain values of random variables X in a simulation have more impact on the parameter being estimated than others. If these important values are emphasized by sampling more frequently, then the estimator variance can be reduced. Hence, the basic methodology in importance sampling is to choose a “bias” distribution which encourages the important values. In order to prevent the use of the “bias” distribution resulting in a biased estimator, the simulation outputs are weighted for the correct use of the biased distribution.

Instead of creating a sample and then rejecting it, it is possible to mix sampling with inference to reason about the probability that a sample would be rejected. In importance sampling methods, each sample has a weight, and the sample average is computed using the weighted average of samples. Likelihood weighting is a form of importance sampling where the variables are sampled in the order defined by a belief network, and evidence is used to update the weights. The weights reflect the probability that a sample would not be rejected.

General Procedure of Likelihood Weighting If we:

Draw samples X^l ~ q(X) from the proposal distribution q(X)
Define importance weights as w^l := p(X^l|Y)/q(X^l) Then we can express our Monte Carlo estimate as an average over the weighted samples {(w^l, X^l)} for l ϵ L.

𝔼_q(X) [p(X|Y)/q(X) * r(X)]  ≃ 1/L Σ (w^l * r(X^l))

Unfortunately, we cannot calculate the importance weights w^l := p(X^l|Y)/q(X^l) (i.e. the importance ratio p(X|Y)/q(X)) because this requires evaluating the posterior p(X|Y) which is what we did not know how to do in the first place. However, we can evaluate the joint distribution p(Y, X) which allows us to define an unnormalized weight. The unnormalized weights W^l are quantities that we can calculate directly, unlike the normalized weights w^l.

W^l := p(Y, X^l) / q(X^l)
    := p(Y) * w^l

We still have a remaining problem: we don’t know how to calculate the normalization constant p(Y). However, we can derive an approximation to p(Y) using the same unnormalized weights W^l by considering the special case r(X) = 1.

p(Y) = 𝔼_q(X) [p(X|Y)/q(X) * 1] ≃ 1/L Σ (W^l)

… some unfinished information …

To summarize, as long as we can evaluate the joint p(Y, X^l) for a sample X^l ~ q(X), then we can perform importance sampling using unnormalized weights W^l.

𝔼_q(X) [p(X|Y)/q(X) * r(X)]  ≃ Σ (W^l / Σ W^k ) * r(X^l)

4.1.2 Graph-based Implementation

Suppose that we compiled our program to a graphical model. We could then implement likelihood weighting using the following steps:

For each sampled random variable x ∈ X, sample from the prior x^l ~ p(x | parents(x)) (i.e. its corresponding distribution).
For each observed random variable y ∈ Y, calculate the weights W^l_y = p(y | parents(y)). (i.e. the probability of observing it from its corresponding distribution).

Return the weighted set of return values r(X^l).

Σ (W^l/Σ(W^k)) * δ_r(X^l) 
where W^l := ∏ W^l_y

Sampling from the prior for each x ∈ X is more or less trivial. The only thing we need to make sure of is that we sample all parents parents(x) before sampling x, which is to say that we need to loop over nodes x ∈ X according to their topological order.
The terms W^l_y can be calculated by evaluating the target language expression ℙ(y)[y := Ƴ(y)], substituting the sampled value x^l for any x ∈ parents(y).

4.1.3 Evaluation-based Implementation

The basic idea in implementing the same algorithm using an evaluation-based strategy, is that:

We can generate samples by simply running the program. More precisely, we will sample a value x ~ d whenever we encounter an expression sample d; by definition, this will generate samples from the prior.
We can then calculate the likelihood p(Y|X) as a side-effect of running the program. To do so, we initialise a state variable σ with a single entry log W = 0, which tracks the log of the unnormalized importance weight. Each time we encounter an expression observe d y, we calculate the log likelihood log_p(d, y) and update the log weight to log W <- log W + log_p(d, y), ensuring that log W = log p(Y|X) at the end of the execution.

To define this method more formally, let us specify what we mean by “running” the program.

Previously, we defined a program q as:

q ::= e | (defn f [x1 ... xn] e) q

And our language contained eight expression types.

e ::= c | v | (let [v e1] e2) | (if e1 e2 e3) | (f e1 ... en)
    | (c e1 ... en) | (sample e) | (observe e1 e2)

In order to “run” a FOPPL program, we will define a function that evaluates an expression e to a value c (as opposed to graph-based implementation that evaluates an expression e to a graph G and a deterministic expression E).

global ρf

function EVAL(e, σ, ρx)
  match e
    case (sample d)
      > algorithm specific
    case (observe d y)
      > algorithm specific
    case c
      return c, σ
    case (let [v1 e1] e2)
      c1, σ ← EVAL(e1, σ, ρx)
      return EVAL(e2, σ, ρx[v1 → c1])
    case (if e1 e2 e3)
      c1, σ ← eval(e1, σ, ρx)
      if c1 
        then return EVAL(e2, σ, ρx)
        else return EVAL(e3, σ, ρx)
    case (e0 e1 . .. en)
      for i in [1, ..., n] do
        ci, σ ← EVAL(ei, σ, ρx)
      match e0
        case f
          (v1, ..., vn), e0' ← ρf(f)
          return EVAL(e0', σ, ρx[v1 → c1, ..., vn → cn])
        case c
          return (c (c1, ..., cn), σ)

The meaning of the variables used:

ρf - a global environment for user-defined procedures.
ρx - a local environment mapping variables v to constants c.
σ - a mapping of inference state variables to the log of the unnormalized importance weight W. This allows us to store variables needed for inference, which are computed as a side-effect of the computation.

It remains to define evaluation for sample and observe forms. As we described at a high-level, these evaluation rules are algorithm-dependent. For likelihood weighting, we want to draw from the prior when evaluating sample expressions, and update the importance weight when evaluating observe expressions. Below, we show pseudo-code for an implementation of these operations for likelihood weighting.

global ρf, e

function EVAL(e, σ, ρx)
  match e
    case (sample e')
      d, σ ← EVAL(e', σ, ρx)
      return (SAMPLE(d), σ)
    case (observe e1 e2)
      d1, σ ← EVAL(e1, σ, ρx)
      c2, σ ← EVAL(e2, σ, ρx)
      σ(log W) ← σ(log W) + LOG-PROB(d1, c2)
      return c2

function LIKELIHOOD-WEIGHTING(L)
  σ ← [logW → 0]
  for l in 1, ..., L do
    r^l, σ^l ← EVAL(e, σ, [])
    logW^l   ← σ(logW)
  return ((r^1, logW^1), ..., (r^L, logW^L))

This algorithm generates a sequence of weighted samples by simply running the program repeatedly. Unlike the algorithms that were defined in the graphical inference chapter, this algorithm does not require any explicit representation of the graph of conditional dependencies between variables. This implementation of likelihood weighting does not even track how many sample and observe statements a program evaluates. Instead, it draws from the prior as needed, and accumulates log probabilities when evaluating observe expressions.

Aside 1: Relationship between Evaluation and Inference Rules

In order to evaluate an expression e, we first evaluate its sub-expressions and then compute the value of the expression of the values of its sub-expressions. When defining the translation (inference) rules for graphical models, we implicitly followed the same pattern.

Inference rules do not only formally specify how our translation should behave, but also give us a recipe for how to implement a recursive TRANSLATE operation for each expression type.

We can similarly use inference rules to define our evaluation-based likelihood weighting model.

Aside 2: Side Effects and Referential Transparency

If we do not include sample and observe in our syntax, then our FOPPL is not only deterministic, but it also pure in a functional sense, i.e. there are no side effects. An implication of this is that it would referentially transparent. Once we incorporate sample and observe into our language, our language is no longer functionally pure.

A sample expression does not always evaluate to the same value, and is therefore referentially opaque; by extension, any expression containing a sample form as a sub-expression is also opaque.

An observe expression observe e1 e2 always evaluates to the same value as long as e2 is referentially transparent. However, observe expressions have a side effect which is that they increment the log weight stored in the inference state σ(logW).

The distinction between referentially transparent and opaque expressions also implicitly showed up in our compilation procedure for graphical models – there, we translate an opaque program into a set of target-language expressions for conditional probabilities which were referentially transparent. In these target-language expressions, each sub-expression corresponding to sample or observe was replaced with a free variable v. If a translated expression has no free variables, then the original untranslated expression is referentially transparent. Partial evaluation exploited this property to replace all target-language expressions without free variables with their values. We also relied on this property to ensure that observe forms observe e1 e2 always contained a referentially transparent expresssion for the observed value e2.

4.2 Metropolis-Hastings

Previously, we used evaluation to generate samples from the program prior while calculating the likelihood associated with these samples as a side-effect of computation. We can use this same strategy to define MCMC algorithms. With Gibbs sampling, we explicitly made use of the conditional dependency graph in order to identify the minimal set of variables needed to compute the acceptance ratio.

Metropolis-Hastings methods generate a Markov chain of program return values by accepting or rejecting a newly proposed sample according to the following algorithm:

Initialise the current sample X. Return X^1 ← X.
For each subsequent sample s = 2, ..., S:
- Generate a proposal X' ~ q(X' | X)
- Calculate the acceptance ratio
```
α = p(Y', X')q(X | X') / p(Y, X)q(X' | X)
```
- Update the current sample X ← X' with probability max(1, α), otherwise keep X. Return X^s ← X.

An evaluation-based MH sampler needs to do two things.

It needs to be able to run the program to generate a proposal, conditioned on the values Ⲭ of sample expressions that were previously evaluated.
It needs to be able to compute the acceptance ratio α as a side effect.

Lets consider a simplified version of this algorithm: Independent proposals from the prior

Suppose that we defined q(X'|X) = p(X'), in other words, the probability of the proposed latent variable X' given the previous latent variable X according to the proposal distribution q is equal to the probability of the proposed latent variable X' according to the prior distribution. In other words, at each step we generate a proposal sample X' ~ p(X) from the program prior, which is independent of the previous sample X.

The acceptance ratio simplifies to:

α = p(Y' | X') / p(Y | X)

In other words, when we propose X' from the prior, the acceptance ratio is simply the ratio of the likelihoods. Since our likelihood weighting algorithm computes σ(log W) = log p(Y|X) as a side effect, we can re-use the evaluator previously implemented, and simply evaluate the acceptance ratio as p(Y'|X')/p(Y|X), where p(Y'|X') is the likelihood of the proposal and p(Y|X) is the likelihood of the previous sample.

4.2.1 Single-site Proposals

The previous algorithm was so simple because we have side-stepped the difficult operations in the more general MH algorithm, and assumed that q(X'|X) = p(X').

In order to generate a proposal, we have to:

Run our program in a manner that generates a sample X' ~ q(X'|X) which is conditioned on the values associated with our previous sample.
Calculate the probability of the reverse proposal q(X|X') in order to evaluate the acceptance ratio.

Both of these operations are complicated by the fact that X and X' potentially refer to different subsets of sample expressions.

let z  = sample ( bernoulli 0.5)
    µ  = if (z == 0)
         then sample (normal -1.0 1.0)
         else sample (normal  1.0 1.0)
    d  = normal µ 1.0
    y  = 0.5
in  observe d y
    z

With graphical models, we would compile this model to a Bayesian network containing:

X = {z, µ0, µ1} - three latent variables
Y = {y} - one observed variable

If we now evaluate if-expressions lazily in evaluation-based inference, this means that we will either sample µ0 when z == 0 or µ1 when z == 1, but not both. This introduces a complication: what happens if we update z = 0 to z = 1 in the proposal? This now implies that X contains a variable µ0 which is not defined for X', so q(X'|X) cannot be evaluated. Conversely, X' needs to instantiate a value for the variable µ1 which was not defined for X.

In order to define an evaluation-based algorithm for constructing a proposal, we will need to construct a map σ(Ⲭ), such that Ⲭ(x) refers to the value of a latent variable x. In order to calculate the acceptance ratio, we will also need to construct a map σ(logℙ). With graphical-based inference, we used a map ℙ(v) that evaluates the density for each random variable v ∈ X ∪ Y. In our evaluation-based algorithm, we will use σ(logℙ(v)) to store the log density σ(logℙ(x)) = LOG-PROB(d, Ⲭ(x)) for each sample expression (sample d) and to store the log density σ(logℙ(y)) = LOG-PROB(d, c) for each observe expression (observe d c).

Lets now define the most commonly used evaluation-based proposal for probabilistic programming systems: the single-site Metropolis-Hastings update.

In this algorithm, we change the value for one variable x0, keeping the values of other variables fixed whenever possible. To do so, we sample x0 from the program prior (as well as any any variables which aren’t yet in the current Ⲭ, i.e. they haven’t been instantiated). For all other latent variables, we reuse the values Ⲭ(x).

This strategy can be summarized in the following algorithm:

Pick a variable x0 ∈ Ⲭ at random from the current sample.
Construct a proposal Ⲭ', ℙ' by rerunning the program
- For expressions sample d with variable x:
  - If x = x0 or x ¬∈ dom(Ⲭ), then sample Ⲭ'(x) ~ d Otherwise, reuse the value Ⲭ'(x) ← Ⲭ(x).
  - Calculate the probability ℙ'(x) ← PROB(d, Ⲭ'(x)).
- For expressions observe d c with variable y:
  - Calculate the probability ℙ'(y) ← PROB(d, c)

To calculate the acceptance ratio, this is given by:

α = p(Y', X')q(X|X') / P(Y, X)q(X'|X)
  = p(Y', X')/q(X' | X, x0) * q(X | X', x0)/p(Y, X) * q(x0 | X')/q(x0 | X)

The ratio q(x0 | X'0) / q(x0 | X) accounts for the relative probability of selecting the initial site. Since x0 is chosen at random, this is the size of set X divided by the size of set X':
```
q(x0 | X') / q(x0 | X) = |X| / |X'| 
```
The joint probability p(Y', X') is simply the product of probabilities of latent and observed variables found in ℙ':
```
p(Y', X') = Π ℙ'(y) Π ℙ'(x)
           y∈Y'    x∈X' 
```
where X’ = dom(Ⲭ’) and Y' = dom(ℙ') \ X'
To calculate the probability q(X' | X, x0), we decompose the set of variables X' = X'sampled ∪ X'reused into the set of sampled variables X'sampled and the set of reused variables X'reused. The set of sampled variables is given by X'sampled = {x0} ∪ (dom(Ⲭ')\dom(Ⲭ)), i.e. the randomly chosen variable x0 that we resample, and any latent variables in Ⲭ' that aren’t in the domain of Ⲭ. Conversely, Xsampled = {x0} ∪ (dom(Ⲭ)\dom(Ⲭ')).

Since all variables in X'sampled were sampled from the program prior, the proposal probability is:

```
q(X' | X, x0) = Π ℙ'(x)
              x∈X'sampled
```

The ratio p(Y', X')/q(X'|X, x0) simplifies to:

p(Y', X')/q(X'|X, x0) =  Π ℙ'(y) Π ℙ'(x)  /  Π ℙ'(x)
                        y∈Y'    x∈X'        x∈X'sampled
                      =  Π ℙ'(y) Π ℙ'(x)
                        y∈Y'    x∈X'reused

The ratio p(Y, X)/q(X|X, x0) is similarly:

p(Y, X)/q(X|X, x0) =  Π ℙ(y) Π ℙ(x)  /  Π ℙ(x)
                     y∈Y    x∈X        x∈Xsampled
                   =  Π ℙ(y) Π ℙ(x)
                     y∈Y    x∈Xreused

The set of reused variables Xreused is identical to that of X'reused:

X'reused = (dom(Ⲭ') ∩ dom(Ⲭ)) \ {x0} = Xreused

Putting all the terms together, the acceptance ratio becomes:

|dom(Ⲭ)|  Π ℙ'(y) Π ℙ'(x)       /  |dom(Ⲭ)|  Π ℙ'(y) Π ℙ'(x) 
         y∈Y'    x∈X'reused                 y∈Y'    x∈X'reused

Addressing Transformation

In defining the acceptance ratio, we have assumed that we can associate a variable x or y with each sample or observe expression. We did this in our compilation of a graphical model. In the context of evalution-based methods, this type of unique identifier for a sample/observe expression is commonly referred to as an address.

If needed, unique addresses can be constructed dynamically at run-time. For programs in the FOPPL, we can create addresses using a source code transformation that is similar to our compilation in graph-based inference. In this, we replace all expressions sample with sample v e in which v is a newly created variable. Similarly, we replace observe e1 e2 with observe v e1 e2.

4.3 Sequential Monte Carlo

SMC:

sample(..)
sample(..)
observe(..)  -- y_0
sample(..)
sample(..)
sample(..)
observe(..)  -- y_1
sample(..)
...
observe(..)  -- y_n

We create n truncations of the program at every observe expression y. For n = 1 ... N: For l = 1 ... L: - We run the program L times to generate L sets of samples up to the nth observe expression y_n, (For sample expressions preceding y_n-1, we reuse sample values from the kth particle set of the previous run n-1, where k is selected from a discrete distribution weighted by each of the L particle sets' log probability. For sample expressions between y_n-1 and y_n, we sample new values from the prior.). This creates {(Xˡₙ)}ᴸₗ₌₁. - We compute the log probability at the statement y_n, which is logΛˡₙ - logΛᵏₙ₋₁ (the total log probability up to observe state y_n minus the total log probability up to observe statement y_n-1, reusing the sampled values from the kth particle set up to y_n-1). - The importance weight logWˡₙ is then calculated as logWˡₙ = logΛˡₙ - logΛᵏₙ₋₁ + logẐₙ₋₁, where logẐₙ₋₁ is log (1/L Σₗ Wˡₙ₋₁) (the log mean of the weights of the previous particle sets). This creates {(Wˡₙ)}ᴸₗ₌₁. - This also means that the normalized weights up to observe statement y_n is logẐₙ = logẐₙ₋₁ + log (1/L Σₗ Wˡₙ), i.e the total normalized weights from y_0 up to observe statement y_n-1, plus the normalized weights computed from y_n.

Previously in Metropolis-Hastings, we used a likelihood-weighting algorithm where in each run of the program, we guess by sampling a X^i from the program prior, and then check whether this is in fact a good proposal by calculating a weight W^l = p(Y|X^l) according to the probabilities of observe expressions in the program. Unfortunately, this is not necessarily efficient. In order to end up with high weighted samples, we have to generate reasonable values for all random variables X. Even if we are always improving our proposed sample at each program run, it may actually take ages to actually end up with samples which are weighted high enough to consider them good, because the frequency with which we generate good proposals decreases exponentially with the number of sample statements.

Sequential Monte Carlo (SMC) methods solve this problem by turning a sampling problem for high dimensional distributions into a sequence of sampling problems for lower dimensional distributions. In their most general form, SMC methods consider a sequence of N unnormalized densities γ1(X1), ..., γN(XN).

γ(X) = ψ(X)p(X)
     = p(Y, X)

where ψ(X) is a strictly positive potential which replaces the likelihood p(Y|X). Hence,

Here, γ1(X1) is typically a low dimensional distribution for which it is easy to peform importance sampling, whereas γN(XN) is a high dimensional distribution, for which we want to generate samples. Each γn(Xn) in between increases in dimensionality to interpolate between these two distributions. For a FOPPL program, we can define γN(XN) (for which we want to generate samples from) as γ(X) = p(Y, X), i.e. the joint density associated with the program.

Given a set of unnormalized densities γn(Xn), SMC sequentially generates L weighted samples:

{(Xˡₙ, Wˡₙ)}

by performing importance sampling for each of the N normalized densities:

πₙ(Xₙ) = γₙ(Xₙ)/Zₙ

To do this, we:

Initialise a weighted set of particles {(Xˡₙ, Wˡₙ)}ᴸₗ₌₁ using importance sampling:

For l = 1, ..., L where L is the number of particles:
```
Xˡ₁ ~ q₁(X₁)      Wˡ₁ := γ₁(Xˡ₁)/q₁(Xˡ₁)
```
where the weight is the unnormalized density divided by the proposal density.
For each subsequent generation n = 2, ..., N:

For l = 1, ..., L where L is the number of particles:
1. Resampling Step: Select a particle Xᵏₙ₋₁ from the preceding set, where the index k is found by sampling an ancestor index k = aˡₙ₋₁ with probability proportional to Wᵏₙ₋₁. In other words, given the previous set of weighted samples {(Xˡₙ₋₁, Wˡₙ₋₁)}ᴸₗ₌₁, select a sample Xᵏₙ₋₁ from the set with probability according to its weight in the set.
```
Xᵏₙ₋₁, where k = aˡₙ₋₁ ~ Discrete (W¹ₙ₋₁ / Σₗ Wˡₙ₋₁ , . . . , Wᴸₙ₋₁ / Σₗ Wˡₙ₋₁ )
```
1. Construct New Proposals: Generate a proposal Xˡₙ conditioned on the selected particle Xᵏₙ₋₁ from the preceding set:
```
Xˡₙ ~ qₙ(Xˡₙ | Xᵏₙ₋₁)
```
And define the importance weights
```
Wˡₙ = γₙ(Xˡₙ) / ( γₙ₋₁(Xᵏₙ₋₁) * qₙ(Xˡₙ | Xᵏₙ₋₁) * Ẑₙ₋₁ )
```
- γₙ(Xˡₙ) is the unnormalized density of the proposal Xˡₙ.
- γₙ₋₁(Xᵏₙ₋₁) is the unnormalized density of the selected particle Xᵏₙ₋₁ from the preceding set.
- qₙ(Xˡₙ | Xᵏₙ₋₁) is the proposal density of the proposal Xˡₙ conditioned on the selected particle Xᵏₙ₋₁ from the preceding set.
- Ẑₙ₋₁ is the average weight of the preceding set of particles, 1/L ΣWˡₙ₋₁

We can think of a resampling step as performing natural selection on the particle set: particles Xᵏₙ₋₁ with a high weight Wᵏₙ₋₁ will be used more often to construct proposals. In other words, SMC uses the weight of a sample at generation n - 1 as a heuristic for the weight that

4.3.1 Defining Intermediate Densities with Breackpoints

A FOPPL program defines an unnormalized distribution γ(X) = p(Y, X). When inference is performed with SMC, we define the final density as γN (XN) = γ(X).

In order to define intermediate densities γₙ(Xₙ) = p(Yₙ, Xₙ), we consider a sequence of truncated programs that evaluate successively larger subsets of sample and observe expressions.

X1 ⊆ X2 ⊆ . . . ⊆ XN = X
Y1 ⊆ Y2 ⊆ . . . ⊆ YN = Y

The definition of a truncated program is a program that halts at a breakpoint. Breakpoints can be specified explicitly by the user, constructed using program analysis, or even dynamically defined at run-time. The sequence of breakpoints needs to satisfy the following two properties in order:

The breakpoint for generation n must always occur after the breakpoint for generation n - 1.
Each breakpoint needs to occur at an expression that is evaluated in every execution of a program. This means that breakpoints should not be associated with expressions inside branches of if-expressions.

In this section, we will assume that we first apply the addressing transformation (assigning fresh variables to sample and observe statements) to the FOPPL program duringdesugaring. We then assume that the user identifies a sequence of symbols y1, ... y{N-1} for observe expressions that satisfy the two above properties.

4.3.2 Calculating the Importance Weight

So far we have defined a notion of intermediate densities γₙ(Xₙ) for FOPPL programs.

We now need to specify a mechanism for generating proposals from a distribution qₙ(Xₙ|Xₙ₋₁). The SMC analogue of likelihood weighting is to simple sample from the program prior p(Xₙ|Xₙ₋₁), which is sometimes known as a bootstrapped proposal. Hence, we state that:

qₙ(Xₙ|Xₙ₋₁)      = p(Xₙ|Xₙ₋₁)
Xₙ ~ qₙ(Xₙ|Xₙ₋₁) = Xₙ ~ p(Xₙ|Xₙ₋₁)

We know that the unnormalized density is γₙ(Xₙ) = p(Yₙ, Xₙ). For this proposal, we can express the unnormalized density γₙ(Xₙ) in terms of γₙ₋₁(Xₙ₋₁):

γₙ(Xₙ) = p(Yₙ, Xₙ)
       = p(Yₙ | Yₙ₋₁, Xₙ) * p(Xₙ | Xₙ₋₁) * p(Yₙ₋₁, Xₙ₋₁)
       = p(Yₙ | Yₙ₋₁, Xₙ) * p(Xₙ | Xₙ₋₁) * γₙ₋₁(Xₙ₋₁)

If we substitute this expression into the equation for calculating the importance weights for a new proposal for particle l at particle set n:

Wˡₙ = γₙ(Xˡₙ) / ( γₙ₋₁(Xᵏₙ₋₁) * qₙ(Xˡₙ | Xᵏₙ₋₁) * Ẑₙ₋₁ )
    = p(Yˡₙ | Xˡₙ)/p(Yᵏₙ₋₁ | Xᵏₙ₋₁)
    = Π p(y|Xˡₙ)
     y∈Yˡₙ\ₙ₋₁

In other words, the importance weight at each generation is defined in terms of the joint probability of observes that have been evaluated at breakpoint n but not at n - 1.

4.3.3 Evaluating Proposals

To implement SMC, we will introduce a function PROPOSE(Ⲭₙ₋₁|yₙ). This function:

Evaluates the program that truncates at the observe expression with address yₙ, conditioned on previously sampled values Ⲭₙ₋₁
Returns a pair (Ⲭₙ, logΛₙ) containing a map Ⲭₙ of values associated with each sample expression, and the log likelihood logΛₙ = logp(Yₙ|Xₙ). To construct the proposal for the final generation, we will call PROPOSE(Ⲭₙ₋₁, nil, yₙ) which returns a pair (r, logΛ) in which the return value r replaces the values Ⲭ.

Below the define the PROPOSE function and its evaluator.

global ρ, e
function eval(e, σ, `)
  match e
    case (sample v e)
      d, σ ← eval(e, σ, `)
      if v 6∈ dom(σ(X )) then
        σ(X (v)) ← sample(d)
      return σ(X (v)), σ

    case (observe v e1 e2)
      d, σ ← eval(e1, σ, `)
      c, σ ← eval(e2, σ, `)
      σ(log Λ) ← σ(logΛ) + log-prob(d, c)
      if v = σ(yn) 
        then error resample-breakpoint( )
      return c, σ

function propose(Ⲭ , y)
  σ ← [yn → y, Ⲭ → Ⲭ , logΛ → 0]
  try
    r, σ ← eval(e, σ, [ ])
    return r, σ(logΛ)
  catch resample-breakpoint( )
    return σ(X ), σ(logΛ)

For sample expressions, we use previously sampled values Ⲭ(v) for previously sampled variables v and sample from the prior for new variables v.
For observe expressions, we accumulate log probability into a state variable logΛ. When we reach the observe expression with a specified symbol yᵣ, we terminate the program by throwing a special-purpose RESAMPLE-BREAKPOINT error.
In propose, there is an implicit index n for 1 ... n attached to yn. For n < N (N being the number of unnormalized densities in the sequence of unnormalized densities γ1(X1), ... , γN(XN)), we evaluate a truncated form of the program up to the n‘th observe expression, and simply return the sampled variables Ⲭ up until that point. For n = N, we finally return the value r outputted from the entire program (which represents a sample from the final unnormalized density γN(XN).

Below is Sequential Monte Carlo:

l in L is the index of the particle in a particle set.
n in N is the index of the particle set, corresponding to the nth unnormalized density in the sequence of truncated unnormalized densities.

{(Xˡₙ, Wˡₙ)}

function SMC(L, y₁, ..., y_{N-1})
  -- Set the log average weight to 0
  logẐ₀ ← 0

  -- Generate the initial set of L particles and their weights for n = 1
  for l in 1, ..., L
    Ⲭˡ₁, logΛˡ₁ ← PROPOSE([], y₁)
    logWˡ₁      ← logΛˡ₁

  -- For n = 2, ... N-1, generate a weighted set {(Xˡₙ, Wˡₙ)} for l = 1, ..., L 
  -- When having reached n = N, generate a final weighted set {(rˡ, Wˡ)} for l = 1, ..., L 
  for n in 2, ..., N
    logẐₙ₋₁     ← LOG-MEAN-EXP(logWₙ₋₁) 
    -- For l = 1, ..., L
    for l in 1, ..., L
      k ~ Discrete (W¹ₙ₋₁ / Σₗ Wˡₙ₋₁ , . . . , Wᴸₙ₋₁ / Σₗ Wˡₙ₋₁ )
      if n < N
        (Ⲭˡₙ, logΛˡₙ) ← PROPOSE(Ⲭᵏₙ₋₁, yₙ)
      else
        (rˡ, logΛˡN)  ← PROPOSE(Ⲭᵏ_N₋₁, y_N)
      logWˡₙ ← logΛˡₙ - logΛᵏₙ₋₁ + logẐₙ₋₁

return ((r¹, logW¹_N), ..., (r^L, logW^L_N))

5. Probabilistic Programming with Recursion

The use of infinitely many random variables arises with the introduction of stochastic recursion.

6. Evaluation-Based Inference with HOPPL

The inference algorithms in this chapter use program evaluation as one of their core subroutines; however, to more clearly illustrate how evaluation-based inference can be implemented by extending existing languages, we abandon the definition of inference algorithms in terms of evaluators, and instead favor a more language-agnostic formulation.

We hence define inference methods as non-standard schedulers of HOPPL programs. The guiding intuition in this formulation is that the majority of operations in HOPPL programs are deterministic and referentially transparent with the exception of sample and observe which are stochastic and have side-effects. In the evaluators we previously defined for the FOPPL, this is reflected in the fact that only sample and observe expressions are algorithm specific.

The abstraction we employ is that a program as a deterministic computation can be interrupted at sample and observe expressions. Here, the program cedes control to an inference controller which implements probabilistic and stochastic operations in an algorithm-specific manner.

6.1 Explicit Separation of Model and Inference Code

We assume a probabilistic program is a deterministic computation that is interrupted at every sample and observe expression. Inference is carried out using a controller process. The controller needs to be able to start executions of a program, receive the return value when an execution terminates, and finally control program execution at each sample and observe expression.

The inference controller interacts with program executions via a messaging protocol. When a program reaches a sample/observe expression, it sends a message back to the controller and waits for a response. This message will typically include an address for the random variable, and a representation of the fully evaluated arguments to sample and observe. The controller then performs any operations necessary for inference, and sends a message back to the running program. The message indicates whether the program should continue execution, fork itself and execute multiple times, or halt.

This interface defines an abstraction boundary between program execution and inference.

Example: Likelihood Weighting

In evaluation-based implementation of likelihood weighting previously using a FOPPL, we evaluate sample expressions by drawing from the prior, and increment the log importance weight at every observe expression.

The inference controller for this inference strategy would repeat the following operations:

The controller starts a new execution of the HOPPL program and initialises its log weight logW = 0.0.
The controller repeatedly receives messages from the running program, and dispatches based on type:
- At a sample d form, the controller samples x from the distribution d and sends the sampled value x back to the program to continue execution.
- At an observe d c form, the controller increments logW with the log probability of c under d and sends a message to continue execution.
- If the program has terminated with value c, the controller stores a weighted sample (c, logW) and exits the loop.

Messaging Interface

In the above inference algorithm (likelihood weighting), the response which the inference controller sends back is always to continue execution.

To support algorithms such as SMC, the program execution process will additionally need to implement a forking operation, which starts multiple independent processes that each resume from the same point in the execution.

To support these operations, we will define an interface in which an inference process can send four messages to the execution process.

(start, σ) : Start a new execution with process id σ.
(continue, σ, c): Continue execution for the process with id σ, using c as the argument value.
(fork, σ, σ', c): Fork the process with id σ into a new process with id σ' and continue execution with argument c.
(kill, σ): Terminate the process with id σ.

We assume that the program execution process can send three types of messages to the inference controller:

(sample, σ, α, d): The execution with id σ has reached a sample expression with address α and distribution d.
(observe, σ, α, d, c): The execution with id σ has reached an observe expression with address α, distribution d, and value c.
(return, σ, c): The execution with id σ has terminated with return value c.

Implementations of Interruption and Forking

To implement this interface, program execution needs to support interruption, resuming, and forking.

Interruption is relatively straightforward. In the case of the HOPPL, we will assume two primitives (send µ) and (receive σ). At every sample and observe, we send a message µ to the inference process, and then receive a response with process id σ. The call to receive then effectively pauses the execution until a response arrives.
Forking can be implemented in multiple ways.
- Previously in the FOPPL, we wrote evaluators that could be conditioned on a trace of random values to re-execute a program in a deterministic manner (i.e. reusing recently sampled values). This strategy can also be used to implement forking; we could simply re-execute the program from the start, conditioning on values of sample expressions that were already evaluated in the parent execution. This implementation is not particularly efficient, since it requires that we re-execute the program once for every observe in the program, resulting in a computational cost that is quadratic in the number of observe expressions rather, than linear.
- An alternative strategy is to implement an evaluator which keeps track of the current execution state of the machine; that is, it explicitly manages all memory which the program is able to access, and keeps track of the current point of execution. To interrupt a running program, we simply store the memory state. The program can then be forked by making a deep copy of the saved memory back into the interpreter and resuming execution.
- Forking becomes much more straightforward when we restrict the modeling language to prohibit mutable state. We have exactly two stateful operations: sample and observe: all other operations are guaranteed to have no side effects. In languages without mutable state, there is no need to copy the memory associated with a process during forking, since a variable cannot be modified in place once it h as been defined.

In the HOPPL, we will implement support for interruption and forking of program executions by way of transformation to CPS, which is a standard technique for supporting interruption of programs in purely functional languages.

We now describe two source code transformations for the HOPPL. The first is an addressing transformation, assigning a unique address with the messages that need to be sent at each sample and observe expression. The second converts the HOPPL program to CPS. Unlike the graph compiler and the custom evaluators, these code transformations take HOPPL programs as input and produces HOPPL programs as output - they do not change the language.

6.2 Addressing Transformation

An addressing transformation modifies the source code of the program to a new program that performs the original computation whilst keeping track of an address – a representation of the current execution point of the program.

This address uniquely identifies any sample or observe expression that can be reached in the course of an execution of a program. Since HOPPL can evaluate an unbounded number of these expressions, the transformation used previously to introduce addresses is not applicable here, since that transformation inlines the bodies of all function applications to create an exhaustive list of sample and observe statements.

The most familiar notion of an address is a stack trace, which is encountered whenever debugging a program that has prematurely terminated. In functional programming languages, a stack trace effectively provides a unique identifier for the current location in the program execution. This allows us to associate a unique address with each sample and observe expression at run-time, which we can then use in our implementations of inference methods.

The addressing transformation follows the given design:

All function calls, sample statements, and observe statements, are modified to take an additional argument which provides the current address. We will use the symbol α to refer to the address argument, which must be a fresh variable.

The addressing transformation is the relation:

e α ⇓addr e'

which translate a HOPPL expression e and a variable α to a new expression which incorporates addresses.

We additionally define a second relation:

↓addr

that operates on the top-level HOPPL program q. This secondary evaluator serves to define the top-level outer address, that is, the base of the stack trace.

Variables, procedure names, constants, if

Addresses track the call stack. Hence evaluation of expressions that do not increase the depth of the call stack leave the address unaffected. This includes constants c, variables v and procedure names f.

c is a constant value
─────────────────────    ────────────         ─────────── 
     c, α ⇓addr c        v, α ⇓addr v         f, α⇓addr f

Primitive procedures are transformed to accept an address argument. Since we are not able to “step in” primitive procedure calls, these calls do not increase the depth of the call stack. This means that the address argument to primitive procedure calls can be ignored in the function body.

 c is a primitive function with n arguments
────────────────────────────────────────────
c, α  ⇓addr  (fn [α v1 ... vn] (c v1 ... vn))

User-defined functions are also transformed to accept an address argument, which may be referenced in the function body. The translated expression of the function body, e', may contain a free variable α, which must be a unique symbol that cannot occur anywhere in the original expression e.

                e, α   ⇓addr  e'
──────────────────────────────────────────────────
(fn [v1 ... vn] e), α  ⇓addr  (fn [α v1 ... vn] e')

Evaluation of if expression also does not modify the address in our implementation, so we only need to translate each of the sub-expression forms.

e1, α ⇓addr e1'   e2, α ⇓addr e2'  e3, α ⇓addr e3'
──────────────────────────────────────────────────
   (if e1 e2 e3), α  ⇓addr  (if e1' e2' e3')

Functions, sample, and observe

So far, we have simply threaded an address through the entire program, but this address has not been modified in any of the expression forms above. We increase the depth of the call stack at every function call:

  ei, α ⇓addr ei' for i = 0 ... n           c <- fresh value
───────────────────────────────────────────────────────────────
(e0 e1 ... en), α    ⇓addr    (e0' (push-addr α c) e1' ... en')

Here, we:

Translate the expression e0 which returns a transformed function e0' that now accepts an address argument α.
This address argument α is updated to reflect that we are now nested one level deeper in the call stack. To do so, we assume a primitive (push-addr α c) which creates a new address by combining the current address α with some unique identifier c generated at translation time. The translated expression will contain a new free variable α since this variable is unbound in the expression (push-addr α c). We will bind α to a top-level address using the ↓addr relation.

If we take the stack trace metaphor literally, then we can think of α like a list-like data structure, and push-addr as an operation that appends a new unique identifier c to the end of this list. Alternatively, push-addr could perform some sort of hash on α and c to yield an address of constant size, regardless of recursion depth.

The translation rules sample and observe can be thought of as special cases of the rule for general function application.

       e, α ⇓addr e'       c <- fresh 
───────────────────────────────────────────────────────
  (sample e)  ⇓addr  (sample (push-addr α c) e')

    e1, α ⇓addr e1'    e2, α ⇓addr e2'  c <- fresh 
───────────────────────────────────────────────────────
(observe e1 e2) ⇓addr (observe (push-addr α c) e1' e2')

The result of this translation is that each sample and observe expression will now have a unique address associated with it. These are constructed dynamically at run time, but are well-defined in the sense that each sample or observe will have an address that is fully determined by its call stack.

Top-level-addresses and program translation

Translation of function calls introduces an unbound variable α into the expression. To associate a top-level address to a program execution, we define a relation that translates the program body e and wraps it in a function which accepts an address argument.

Choose a fresh variable α     e, α  ⇓addr e'
───────────────────────────────────────────────────────
        e, α   ↓addr  (fn [α] e')

For programs which include functions that are user-defined at the top level, this relation also inserts the additional address argument into each of the function definitions.

  Choose a fresh variable α    e, α ⇓addr e'      q ↓addr q'
───────────────────────────────────────────────────────────────
(defn f [v1 ... vn] e) q   ↓addr   (defn f [α v1 ... vn] e') q'

6.3 Continuation-Passing-Style Transformation

Now that each function call in the program has been augmented with an address that tracks the location in the program execution, we next need to transform the program into CPS style so that we can pause, resume, and potentially fork it multiple times if needed.

We define the ⇓꜀ relation:

e, κ, σ ⇓꜀ e'

Here, e is an expression, κ is a continuation, e' is the result of CPS-transforming e under the continuation κ. For purposes of generality, we incorporate an argument σ which serves to store mutable state, or any information that needs to be threaded through the computation.

In Anglican σ holds any state that needs to be tracked by the inference algorithm, and hereby plays a role analogous to that the state variable σ we used in our evaluators.

In the messaging interface that we define in this chapter, all inference state is stored by the controller process, moreover, there is no mutable state in the HOPPL. As a result, the only state that we need to pass to the execution is the process id. We use σ to refer to both the CPS state and the process id.

Using CPS, because the expressions will transform to the form κ σ v, we can choose to either:

Interrupt the computation and return a tuple (κ, σ, v).
Evaluate the continuation call κ σ v (i.e. PApp [κ, σ, v]).

For general (non-probabilistic) expressions, we will tend to always evaluate the continuation call on the expression:

For variables, procedure names, and constants, these take the form:
```
v, κ, σ ⇓꜀ (PApp κ σ v)   
```
For constants which are primitive functions, we need to transform them to their CPS form which are functions which additionally take a continuation and state as a parameter.
```
c_cps = fn [v1 v2 κ σ] (κ σ (c v1 v2))
────────────────────────────────────
c, κ, σ ⇓꜀ (PApp κ σ c_cps) 
```

For if expressions, we transform both branches e2 and e3 under the continuation κ (which may result in something like e2_cps = κ σ e2 and e3_cps = κ σ e3). We then create a continuation which when applied to e1, will return the CPS transformed e2' or e3'.

      e2, κ, σ ⇓꜀ e2_cps   e3, κ, σ ⇓꜀ e3_cps 
e1, (fn [σ v] (if v then e2_cps else e3_cps)), σ ⇓꜀ e'
────────────────────────────────────────────────────
      (if e1 then e3 else e3), κ, σ ⇓꜀ e'

All functions are turned into functions which additionally accept a continuation κ' and state σ, and where their function body e has been CPS transformed under this continuation κ'. We then apply the continuation κ to the resulting function.

                  e, κ', σ  ⇓꜀ e'
────────────────────────────────────────────────────
(fn [v1 .. vn] e), κ, σ   ⇓꜀   κ σ (fn [v1 .. vn  κ' σ] e')

With application, we know that the applied expression v0 = e0 is a function which takes parameters v1 = e1 ... vn = en, so when it is CPS transformed, it will additionally take a continuation κ and state σ as a parameter. The main function body will hence be v0 v1 ... vn κ σ.

     en, (fn [σ v] (v0 v1 ... vn κ σ )), σ  ⇓꜀  e'n
 ei, (fn [σ v] e'i+1), σ  ⇓꜀ e'i   for i = (n - 1) ... 0
────────────────────────────────────────────────────────
          e0 e1 ... en, κ, σ   ⇓꜀    e0'

With observe and sample, we perform a CPS transformation on the entire expression, but also replace the functions observe and sample with their CPS equivalents which take two additional parameters κ and σ.

6.4 Message Interface Implementation

In this interface, an inference controller process starts copies of the probabilistic program, which are interrupted at every sample and observe expression. On interruption, each program execution sends a message to the controller, which then carries out any inference operations that need to be performed. These operations can include:

Sampling a new value
Reusing a stored value
Computing the log probabilities
Resampling program executions

After carrying out these operations, the controller sends messages to the program executions to continue or fork the computation. We use a client-server architecture.

Messages in the Inference Controller

The inference controller (client) can send requests to the execution server (which runs the probabilistic program).

SEND("start", σ) - start a new process with id σ
SEND("continue", σ, c) - continue process σ with argument c
SEND("fork", σ, σ', c) - fork process σ with a new process with id σ’, and continue execution with argument c
SEND("kill", σ) - half execution with process σ

It can also RECEIVE responses from the server.

Messages in the Execution Server

The execution server can be entirely be implemented in the HOPPL (if desired). It must be able to receive requests from the inference controller and return responses. We will assume that primitive functions receive and send exist for this purpose.

The execution server can respond by:

send "sample" σ α d: The process σ has arrived at a sample expression with address α and distribution d.
send "observe" σ α d c: The process σ has arrived at an observe expression with address α, distribution d and value c.
send "return" σ c: The process σ has termined with value c.

To implement this messaging architecture, we need to change the behaviour of sample and observe. We instead make use of their CPS analogues which additionally take a continuation κ and the current CPS state σ.

To interrupt the computation of sample, observe and return expressions, we will provide an implementation that returns a tuple containing the relevant expression and its parameters, rather than calling the continuation on the expression itself.

sampleCPS [α d κ σ]
  ("sample", α, d, κ, σ)

observeCPS [α d κ σ]
  ("observe, α, d, κ, σ)

returnCPS [α d κ σ]
  ("return", c, σ)

Last updated on 13 Nov 2020
Published on 13 Nov 2020