Bayesian Model Comparison

Min Jean Cho

This article is a supplement of the previous article ’Bayesian Inference.’ In the previous article , I discussed mainly the Bayesian parameter estimation. In this article, I will discuss model comparison. As for the parameter estimation, there are two approaches for the model comparison: frequentist’s approach and Bayesian approach. These two approaches are fundamentally different, but both of them uses the ratio of probabilities. Suppose we have two models (denoted by \(M_0\) and \(M_1\)). Frequentist’s approach uses likelihood ratio,

\[\text{Likelihood ratio} = \frac{P(Data|M_1)}{P(Data|M_0)}\]

and Bayesian approach uses posterior ratio,

\[\text{Posterior ratio} = \frac{P(M_1|Data)}{P(M_0|Data)}\]

One of the widely used methods of frequentist’s model comparison is likelihood ratio test (LRT). Since parameters are associated with models to be compared, LRT uses maximum likelihood values under the two models: \(L_i = P(D|M_i, \hat{\theta_i}^{ML})\). The LRT test statistic is as follows:

\[2\text{log}(\frac{L_1}{L_0}) = 2 \Delta l = 2(l_1-l_0)\]

\(L_i\) denotes log likelihood, and \(\Delta l\) denotes log likelihood ratio. Using the fact that \(2\Delta l\) follows chi-squared distribution as a sampling distribution, LRT determines whether \(M_0\) is rejected or not. There are far more theories for frequentist’s model comparison (e.g, type I and II errors and p-value), which are elegant and very thoughtful, but I won’t explain them here as they deserve their own separate articles.

For describing Bayesian model comparison, I will use a simple computational biology example, which is related to Bayesian change point analysis. In eukaryotic genomes, many viral sequences are incorporated. If base compositions of viral sequences and eukaryotic genome sequence is different, the genomic regions that contains viral sequences can be easily detected by using Bayesian model comparison. First, let’s define a DNA sequence \(\mathbf{x}\) with length \(n\) as a string composed of \(n\) independent alphabets \(a \in \mathscr{A} = \{A,T,G,C\}\), that is each sequence position is IID random variables.

\[\mathbf{x} = [x_1, x_2, ..., x_n], x=a \in \mathscr{A}\]

And let \(V\) be the viral genome model and let \(E\) be the eukaryotic genome model. We assume that we already know base compositions of viral genomes and eukaryotic genomes, that is we know following likelihood values:

\[p_a^V = P(x=a|V)\]

\[p_a^E = P(x=a|E)\]

Thus, likelihoods of sequence \(\mathbf{x}\) under two models are as follows:

\[P(\mathbf{x}|V) = \prod_{i=1}^n P(x_i=a|V) = \prod_{a\in \mathscr{A}} (p_a^V)^{n_a}\]

\[P(\mathbf{x}|E) = \prod_{i=1}^n P(x_i=a|E) = \prod_{a\in \mathscr{A}} (p_a^E)^{n_a}\]

Then, posterior ratio is as follows:

\[\begin{equation} \begin{aligned} \frac{P(E|\mathbf{x})}{P(V|\mathbf{x})} &= \frac{P(\mathbf{x}|E)P(E)}{P(\mathbf{x}|V)P(V)} = \frac{\prod_{a \in \mathscr{A}}(p_a^V)^{n_a}}{\prod_{a \in \mathscr{A}}(p_a^E)^{n_a}} \times \frac{P(E)}{P(V)} \\ &= \prod_{a \in \mathscr{A}}(\frac{p_a^V}{p_a^E})^{n_a} \times \frac{P(E)}{P(V)} \end{aligned} \end{equation}\]

Note that we don’t need to calculate marginal likelihood. Additionally, if we don’t have any prior information on the sequence \(\mathbf{x}\), we can use the prior ratio of 1 = 0.5/0.5. If some information is available, say that 0.1% of eukaryotic genome is viral sequences, then we can use prior ratio of 1/999. Using this model comparison method, genomic sequence can be scanned with a fixed size window (e.g., 100 bases) to find viral sequences.

This Bayesian model comparison can be easily applied to other sequence models. If model comparison requires a long input, the likelihood ratio is the product of many small values < 1 but underflow can be avoided by using log posterior ratio.

We can also compare random model and non-random model. Let \(M\) be the (non-random) model and let \(R\) be the random model.

\[\text{log}\frac{P(M|\mathbf{x})}{P(R|\mathbf{x})} = \text{log}[\prod_{i=1}^n (\frac{P(x_i|M)}{P(x_i|R)}) \times \frac{P(M)}{P(R)}]\]

\[\begin{equation} \begin{aligned} \text{log}\frac{P(M|\mathbf{x})}{P(R|\mathbf{x})} &= \text{log}[\prod_{i=1}^n (\frac{P(x_i|M)}{P(x_i|R)}) \times \frac{P(M)}{P(R)}] \\ &= [\sum_{i=1}^n \text{log} \frac{P(x_i|M)}{P(x_i|R)}] + \text{log}\frac{P(M)}{P(R)} \end{aligned} \end{equation}\]

Now, let's write posterior ratio again. Posterior ratio is the product of likelihood ratio (Bayes factor) and prior ratio.

\[\frac{P(M_1|D)}{P(M_0|D)} = \frac{P(D|M_1)}{P(D|M_0)} \times \frac{P(M_1)}{P(M_0)} = BF_{1,0} \times \frac{P(M_1)}{P(M_0)}\]

In Bayesian inference for model comparison, the likelihood ratio is called Bayes factor (BF). However, it is a slightly different value from that used in likelihood ratio test. Likelihood values in LRT are determined using MLE of parameters, but the likelihood values for BF is calculated using marginalization of parameters.

\[P(D|M_i) = \int_{\theta_i} P(D|M_i, \theta_i)P(\theta_i|M)d\theta\]

If the prior ratio is 1, posterior \(P(M_1|D)\) can be calculated from BF and the interpretation of BF can be possible.

\[\begin{equation} \begin{aligned} P(M_1|D) &= \frac{P(D|M_1)P(M_1)}{P(D|M_1)P(M_1) + P(D|M_0)P(M_0)} \\ &= \frac{1}{1 + \frac{P(D|M_0)P(M_0)}{P(D|M_1)P(M_1)}}\\ &= \frac{1}{1 + \frac{P(D|M_0)}{P(D|M_1)}} \\ &= \frac{1}{1+ (BF_{1,0})^{-1}} \end{aligned} \end{equation}\]

Using above equation (it assumes that prior ratio = 1), BF can be interpreted as follows:

\[\text{If}\ BF_{1,0}>1, P(M_1|D)>0.5\]

\[\text{If}\ BF_{1,0}=1, P(M_1|D)=0.5\]

\[\text{If}\ BF_{1,0}<1, P(M_1|D)<0.5\]

Bayesians say evidence for model \(M_1\) is in data if \(BF_{1,0}>1\) and evidence against \(M_1\) is in data if \(BF_{1,0}<1\).