OBAMA -- amino acid model

25 November 2020 by Remco Bouckaert

OBAMA is a BEAST package for Bayesian amino-acid model averaging. It has the benefit of not having to commit to a particular model, and integrating out site model uncertainty during an MCMC run.

Similar to bModelTest for nucleotide sequences, it averages over a number of substitution models, taking gamma rate heterogeneity and proportion invariable in account. So, if you have an amino-acid alignment and the site model is not of much interest to your analysis, it is just a matter of selecting OBAMA in the site model panel in BEAUti, and concentrate on more interesting details of the model.

For nucleotide models, there is a hierarchy of parametric substitution models, like HKY, GTR, TN93, etc. For amino acids, these do not exist, but there are a lot of empirical models based on large databases of previously sequenced data, such as WAG, LG, BLOSSOM, etc. The OBAMA model jumps between up to 15 of these models during the MCMC.

BEAUti

Furthermore, it jumps between having frequencies estimated during the MCMC or using the model frequencies that are part of the empirical model being selected. For estimated frequencies, the prior is Dirichlet(4,4,…,4), which has a shape very close to that of the distribution of frequencies from the empirical models (Bouckaert, 2020). Though proposals for frequencies are computationally expensive since the complete treelikelihood needs to be recalculated, I found that using an AVMN operator (Baele et al, 2017) helps considerably with convergence.

OBAMA also jumps between having gamma rate heterogeneity or not, and having a proportion invariable sites or not.

From the few datasets I looked at, OBAMA seems to very quickly settle on 1 or a small number of substitution models, and often frequencies estimated from the alignment fit a lot better than those that come with an empirical model.

Identifiability of Gamma + I

The Gamma+I model (i.e. both having gamma rate heterogeneity and a proportion invariable sites) has been criticised for leading to lack of identifiability (Yang, 2014, Bouckaert & Drummond, 2017). For values of the shape parameter for gamma rate heterogeneity less than 0.1, the rate of the slowest categories becomes very small, and behaves pretty much like a category of invariable sites, which causes this unidentifiability. For OBAMA, the default prior on shape parameter is lower bounded and cannot be less than 0.1. This turns out to solve the identifiability issue (see Bouckaert (2020) for a well calibrated simulation study to substantiate the claim).

Analysing results

OBAMA comes with an app for analysing the posterior from the trace log.

appstore

It shows the distribution of the various substitution models in the analysis, as well as the proportion of time external frequencies are used. It should look something like this:

appstore

How to install OBAMA

How to use OBAMA setting up an analysis and analysing results

Trouble shooting

References

Baele G, Lemey P, Rambaut A, Suchard MA. 2017. Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST. Bioinformatics 33(12):1798-1805 DOI

Bouckaert RR. 2020. OBAMA: OBAMA for Bayesian amino-acid model averaging. PeerJ 8, e9460

Bouckaert RR, Drummond AJ. 2017. bModelTest: Bayesian phylogenetic site model averaging and model comparison. BMC Evolutionary Biology 17(1):42

Yang Z. 2014. Molecular evolution: a statistical approach. Oxford University Press.