MCMC and MH
Monte Carlo
Monte Carlo methods, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. These are used to approximate the posterior distribution p(X|Y) with a set of (weighted) samples.
These methods vary, but tend to follow a particular pattern:
- Define a domain of possible inputs
- Generate inputs randomly from a probability distribution over the domain
- Perform a deterministic computation on the inputs
- Aggregate the results
Markov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC) methods are from a class of techniques known as approximate inference which are a range of algorithms for computing posteriors when the likelihood is intractable.
MCMC comprises of a class of algorithms for sampling from a probability distribution, based on constructing a Markov Chain that has the desired distribution as its stationary distribution. A stationary distribution of a Markov Chain is a probability distribution that remains unchanged in the Markov Chain as time progresses. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.
For example:
Assume we want to sample from a Beta distribution. This is a difficult/intractable problem. MCMC methods provide us with algorithms that could create a Markov Chain which has the Beta distribution as its stationary distribution, given that we can sample from a uniform distribution.
If we start from a random state and traverse to the next state based on some algorithm repeatedly, we will end up creating a Markov Chain which has the Beta distribution as its stationary distribution. The states we are at after a long time can be then used as a sample from the Beta distribution.
One such MCMC algorithm is the Metropolis-Hastings Algorithm.
Metropolis-Hastings Algorithm
Metropolis-Hastings is an algorithm from the MCMC class of algorithms, for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult; as more and more sample values are produced, the distribution of values more closely approximates the desired distribution. These sample values represent the parameters of the distribution we’re approximating.
Algorithm
Our desired distribution we’re trying to approximate is p(θ). However, we only know this distribution up to some normalising constant - this is the function g(θ) which is proportional to p(θ).
-
Initialisation:
-
Choose an arbitrary initial sample value θ0 to be the first sample.
-
We then choose an arbitrary proposal distribution q, where q(θt+1 | θt) suggests a candidate for the next sample value θt+1 given the current sample value θt.
-
-
For each iteration t:
-
Generate a candidate θt+1 for the next sample by drawing from a proposal distribution q of θt+1 given the current iteration value θt.
θt+1 ~ q(θt+1 | θt)
-
Calculate the acceptance ratio which will be used to decide whether to accept or reject the candidate θt+1. Because g is proportional the density of p, we have that:
α = g(θt+1) ⁄ g(θt) = p(θt+1) ⁄ p(θt)
or equivalently:
α = (p(θt+1) * p(data | θt+1)) ⁄ (p(θt) * p(data | θt))
-
Accept or reject: We generate a uniform random number u ∈ [0,1], and accept if α ≥ u, or reject if α < u.
if α < (u ∈ [0,1]) : Reject θt+1 and retain θt as current θ
if α ≥ (u ∈ [0,1]) : Accept θt+1 as current θ
-