# Ascertainment correction/Sampling bias

18 July 2019 by Remco Bouckaert

Sampling bias can happen when data for the alignment is collected in such a way that some of the sites have lower probability than other sites. For example SNP data is typically selected from among the sites that have some variation among the taxa that you want to analyses, but there will be a very large set of constant sites, which never end up in your sample. Though this probably does not affect the topology of the tree very much, it can have substantial effect on the timing, so you may want to correct for sampling bias through ascertainment correction.

Say, we leave out all constant sites then where we usually calculate `P(D|T)` for alignment data `D` and tree `T` (leaving out other parameters to not clutter the notation too much), what we actually calculate is `P(D,where D not constant | T)`. Through application of Bayes rule, this can easily be shown to be equal to

``````    P(D | T)
---------------
1-P(D constant)
``````

So, we have `P(D|T)` in the numerator, which we calculate as usual, but the term `P(D constant)` in the denominator is what we need to specify.

## Ascertainment correction in XML

The way this is implemented in BEAST is by adding sites for constant characters to the alignment, and identifying these columns in the XML. Say, we have this (very short) alignment obtained by selecting non-constant sites:

## Step 1: add constant sites

There is currently no BEAUti support for changing ascertainment correction, so it requires some XML editing. The basics are quite simple: alignments have

It is most convenient to add constant sites at the start of the alignment. Since there are four possible nucleotide characters, we add four columns, and obtain this alignment

## Step 2: specify sites to exclude from `P(D|T)`

The `Alignment` has three attributes to specify the set of columns to take in account for ascertainment correction:

• excludefrom (Integer): first site to condition on, default 0 (optional, default: 0)
• excludeto (Integer): last site to condition on (but excluding this site), default 0 (optional, default: 0)
• excludeevery (Integer): interval between sites to condition on (default 1) (optional, default: 1)

Since we want to exclude the first four columns, we specify `excludefrom="0"` and `excludeto="4"`. We could specify `excludeevery="1"`, but since that is the default value, we can just leave it out, giving:

## Step 3: Setting the `ascertained` flag

For the treelikelihood to use the correction, the `ascertained` flag has to be set to `true` on the alignment, like so:

# Ascertainment tips and tricks

## SNAPP analyses

SNAPP has its own ascertainment correction scheme. The above only applies to cases where the `TreeLikelihood` are used. The `ThreadedTreeLiklihood` does not support ascertainment correction (see more below).

## Cognate data for languages

For language cognate data, it is common that the cognates have been selected in such a way that at least one of the languages has the cognate present. In this case you would ascertain on at least one `1` in a cognate column, e.g., like so:

## Missing data

Especially for cognate data for language alignments, it is possible there is data missing for a language for a particular cognate class. Then it cannot be determined whether a site is constant or not, since there is no conclusive way to say what value the missing taxon would have. In this case, you should not ascertain on the language having a `0` but on it being unknown, i.e., has a `?` in the ascertainment column, like so:

## More missing data

A more complex situation is where data was obtained only if there is variation between two (known) clades, but not when there is only variation inside a clade. This means, for cognate data there is at least one `1` in clade A (say, Germanic including Dutch, English and German), and at least one `1` in clade B (say Italic, including Spanish and French). In other words, clade A is never all `0` and neither is `B`, so we could ascertain like so:

but this is wrong, since both columns contain the all `0` site. So, we should subtract the probability of that column once (but more if there are more clades). To do this, there are three extra attributes to specify such columns:

• includefrom (Integer): first site to condition on, default 0 (optional, default: 0)
• includeto (Integer): last site to condition on, default 0 (optional, default: 0)
• includeevery (Integer): interval between sites to condition on (default 1) (optional, default: 1)

Adding an all zero column and `includefrom` and `inlcudeto` attributes gives:

This calculates

``````          P(D | T)
-------------------------------
1-P(D excluded) + P(D included)
``````

were `D excluded` is specified by the `exclude*` attributes and `D included` by the `include*` attributes. A more complex example can be found in (Robbeets & Bouckaert, 2018).

## Sub sets of alignments

It is possible to split up alignments into sub-alignments, especially when there are blocks of missing data as is common with language data. This can be done with the `FilteredAlignment`. Any `FilteredAlignment` takes an `Alignment` as `data` attribute and a filter specification that specifies which of the sites in the input alignment should be selected First site is 1. Filter specs are comma separated, either a singleton, a range [from]-[to] or iteration [from]:[to]:[step]; 1-100 defines a range, 1-100 or 1:100:3 defines every third in range 1-100, 1::3,2::3 removes every third site. Default for range [1]-[last site], default for iterator [1]:[last site]:[1].

For an alignment with `id="pny2016"` we can select the cognates for the word `and` as follows:

## Note:

``````The first site in filters is 1, but the ascertainment correction uses 0 for first site. We are aware this is confusing, but also hard to change while keeping backward compatibility of XML files, so it is what it is.
``````

It would be possible, but confusing to specify the ascertainment correction attributes directly on this filtered alignment. Less error prone is to just assume the first column is ascertained on and wrap the filtered alignment in another like so:

You can find a more elaborate example from (Bouckaert et al, 2018) here.

As mentioned above, the `ThreadedTreeLiklihood` does not support ascertainment correction (and warns so with the message `Note, can only use single thread per alignment because the alignment is ascertained` before crashing with a `NullPointerException` if you try it with multiple threads). One way around is to split up the alignment into multiple sections using the `FilteredAlignment` as outlined in the previous partitions, and having one `TreeLiklihood` per partitions, each with its own ascertainment correction on the same columns as the single threaded version would use.
Make sure to set `useThreads="true"` on the `CompoundDistribution` containing the `TreeLiklihood`s – usually, this is the element with `id="likelihood"`.