1 November 2024 by Remco Bouckaert
Be default, the stationary frequencies that come with a substitution model are used to represent the probability a root state is observed. Felsenstein’s peeling algorithm uses these frequencies by multiplying the partials at the root with the stationary frequencies, and sum over the states to obtain the site probability. However, it is possible there is prior information available for the root states that differs from the stationary frequencies. Here, we consider a few scenarios where this can happen.
Root frequencies differ from stationary frequencies
If root frequencies are known to differ, but are the same for every site, the rootFrequencies
of the (Threaded)TreeLikelihood
can be set.
To edit the XML, add a Frequencies
object in the rootFrequencies
element of the (Threaded)TreeLiklihood
, for example like so
<distribution id="treeLikelihood.dna" spec="ThreadedTreeLikelihood" data="@dna" tree="@Tree.t:dna">
<rootFrequencies id="rootFreqs" spec="Frequencies" frequencies="0.1 0.2 0.3 0.4"/>
<siteModel id="SiteModel.s:dna" spec="SiteModel">
...
In the example above, the root frequencies are fixed at 0.1 0.2 0.3 0.4
, so edit as appropriate.
If the frequencies need to be estimated, a bit more is required: a RealParameter
needs to be added to the state, a prior and operator added and potentially a logger as well.
Then the parameter needs to be referred to from the frequencies
attribute of the rootFrequencies
element.
Root frequencies differ between rate categories
If every rate category has its own frequencies and frequencies are shared among all sites, the 5rf model (Waddell & Bouckaert, 2024) can be used. The 5rf package needs to be installed (see here for installation instructions).
The XML needs to be edited to make it work: see the annotated example for details.
Root sequence is known exactly
If the root state is know exactly and differs at each site, this implies a root frequencies distribution where for each site one of the states has probability 1, while all others have probability 0. The rootfreqs package can be used for this.
To edit an XML
- Replace
spec="TreeLikelihood"
withspec="rootfreqs.TreeLikelihood"
, or if you have the threaded version, replacespec="ThreadedTreeLikelihood"
withspec="rootfreqs.ThreadedTreeLikelihood"
- Add a sequence representing the root state inside the tree likelihood as a
rootfreqseq
element, so it looks something like this:
<distribution id="treeLikelihood.dna" spec="rootfreqs.TreeLikelihood" data="@dna" tree="@Tree.t:dna">
<!-- specify root frequencies that differ from substitution model frequencies -->
<rootfreqseq id="seq_Root" spec="Sequence" taxon="Root"
totalcount="4" value="TTTTTTTTTTTTTTTTTTTTT..."/>
...
Here the sequence is all T
, but you need to set it to the appropriate value for your analysis.
Make sure to use the same sequence length as the alignment.
Root sequence is know with uncertainty
If the root state is know with some uncertainty and differs at each site, this implies a root frequencies distribution where for each site there is a different state probability. The rootfreqs package can be used for this situation as well. However, instead of using a fixed sequence, an uncertain sequence needs to be used. Uncertain sequences represent a distribution for each site by a semi-colon separated list of distributions, where each distribution is a comma separated list of probabilities.
To make a sequence uncertain, set the uncertain
attribute to true
.
The example above modified for uncertain root sequence can look something like this:
<distribution id="treeLikelihood.dna" spec="rootfreqs.TreeLikelihood" data="@dna" tree="@Tree.t:dna">
<!-- specify root frequencies that differ from substitution model frequencies -->
<rootfreqseq id="seq_Root" spec="Sequence" taxon="Root" uncertain="true"
totalcount="4" value="0.0025,0.00071,0.032,2e-04; 0.0021,0.0021,0.027,0.00059; 0.028,0.00062,0.00017,4.8e-05; ..."/>
...
Here, the root frequencies for the first site are 0.0025,0.00071,0.032,2e-04;
, for the second site 0.0021,0.0021,0.027,0.00059;
and for the third site 0.028,0.00062,0.00017,4.8e-05;
.
Note that each of these frequencies add up to 1.0, so make a proper distribution.
Edit as appropriate, and make sure there are as many root frequency distributions as there are sites in the alignment.
References
Peter J. Waddell, Remco Bouckaert. An independent base composition of each rate class for improved likelihood-based phylogeny estimation; the 5rf model bioRxiv 2024.09.03.610719