Simulating transmission trees

1 December 2024 by Remco Bouckaert

Transmission trees are phylogenies that represent infections spreading through a population. Transmission trees have transmission events representing one host infecting another host. The BREATH package allows simulation of such trees under the transmission likelihood (Colijn et al, 2024), allowing testing of models.

The transmission tree simulator is available as the TransmissionTreeSimulator app in the BREATH package for BEAST 2.

The parameters that determine the shape and size of the tree are the endTime, popSize and sampling and transmission hazards. The sampling hazard consists of a gamma distribution and a multiplier that can be interpreted as the probability of a host being sampled. Like the sampling hazard, the transmission hazard consists of a gamma distribution. It also comes with a multiplier representing the average number of other hosts infected by a host, which sets the scale of the tree.

Note that not all hosts will be sampled: some hosts remain unsampled and do not end up in the output tree.


transmissiontree


Simulated tree from time 0 to time te indicated on the x-axis. Left: small simulated tree where blue boxes indicate hosts, red tree the tree ending in samples, red+green branches are branches generated by the simulator as within host coalescent trees, red+green+black branches form the underlying phylogeny. Right: tree output by the simulator. Hosts D to H are not sampled, so these are removed from the simulator output. Host E becomes an unsampled host infected by A and infecting C. Hosts G and H form a block of size 2 of unknown hosts, while hosts D and F leave no trace and remain unknown unknowns.

The simulator can be conditioned on producing trees with a fixed number of taxa, but by default a mixture of taxa will be produced. Note that it is not uncommon for most trees to have 1 taxon, depending on parameter settings. Take care when setting parameters values: especially when conditioning on a large number of taxa it may take a long time for such trees to be generated if the parameters are not compatible with trees of that size.

Installing BREATH

  • Start BEAUti
  • Click to the File => Manage packages menu item.
  • Select BREATH in the list of packages and the click Install button. If BREATH is not in the list of packages, you must add a package repository first like so: in the package manager, click Package repositories button, then click Add URL in the window that pops up, where you can put https://raw.githubusercontent.com/CompEvol/CBAN/master/packages-extra-2.7.xml in the text field. Then return to the package manager window where the BREATH package should appear.
  • Close BEAUti – it needs to restart to pick up the new packages.

Using the simulator

To use the command line version of the simulator, use the applauncher application (which is part of the BEAST 2 distribution) from a terminal/command prompt. Any of the above options can be used.

Alternatively, start BEAUti (which is also part of the BEAST 2 distribution), select the File/Launch apps menu, and select TransmissionTreeSimulator from the list of applications. Click the launch button to start a GUI version of the simulator, which looks like so:

transmissionTreeSimulator

Simulator options

The simulator has the following options:

  • endTime (real number): end time of the study. This determines the length in time of the outbreak. Any hosts not sampled while reaching the end time will be pruned from the tree.
  • popSize (real number): population size governing the coalescent process that determine coalescent times within a single host.
  • sampleShape (real number): shape parameter of the sampling intensity function
  • sampleRate (real number): rate parameter of the sampling intensity function
  • sampleConstant (real number): constant multiplier of the sampling intensity function
  • transmissionShape (real number): shape parameter of the transmission intensity function
  • transmissionRate (real number): rate parameter of the transmission intensity function
  • transmissionConstant (real number): constant multiplier of the transmission intensity function
  • out (file name): output file for trees in Newick format. Print to stdout if not specified (optional)
  • trace (file name): trace output file with end time, tree heights and tree lengths, or stdout if not specified (optional)
  • seed (long): random number seed used to initialise the random number generator (optional)
  • maxAttempts (integer): maximum number of attempts to generate coalescent sub-trees (default: 1000)
  • taxonCount (integer): generate tree with taxonCount number of taxa. Ignored if negative, so different numbers of taxa can be expected in different trees (default: -1)
  • maxTaxonCount (integer): reject any tree with more than this number of taxa. Ignored if negative (default: -1).
  • treeCount (integer): generate treeCount number of trees (default: 1)
  • directOnly (true false): consider direct infections only, if false block counts are ignored (default: true)
  • quiet (true false): suppress some screen output like statistics on how many trees have a certain taxon count (default: false)

How to choose simulation parameters

Not all combinations of parameters lead to sensible trees. It is quite possible that only single taxon trees are generated. Even when choosing sensible parameter combinations, one of the modes of the taxon count distribution will be near 1.

  • Choose transmissionConstant in [1, 4]. This sets the mean number of transmission events per host and determines the scale of the tree.
  • Choose sampleConstant in (0.5, 1), to sample enough cases that person-to-person transmission inference is likely to be a reasonable task.
  • Choose transmissionShape/transmissionRate to set the mean inter-infection time (ignoring sampling) .
  • Choose sampleShape/sampleRate > transmissionShape/transmissionRate so that sampling occurs at after the mean generation time, on average. Otherwise it seems likely that the transmission chains will die out quickly.
  • Choose endTime the approximate number of transmission generations. Keep in mind that if the mean time to sampling is considerably greater than the mean time to infection, and transmissionConstant is high, the number of infections could grow very large.
  • Choose popSize in such a way that the probability that lineages will coalesce in the required time is pretty high, for example popSize < -transmissionRate/transmissionShape log(0.95) .

After choosing the hazard function parameters, a quick sanity check is to plot the gamma distribution densities of the sampling and transmission hazard in the same plot. This plot shows how likely it is for a transmission to happen at a given time and how likely it is for a host to be sampled. For an exponentially growing process, the mean of the sampling hazard should be larger than that of the transmission hazard.

Reducing transmissionConstant will make a big (nonlinear) difference.

Changing samplingConstant will not make much difference to transmission or the size of the process (though it will to the number of sampled cases, in a linear way), because if sampling happens, it’s most likely to happen after the peak in transmission anyway.

Troubleshooting

  • Do the transmission chains not take off? e.g. there are no more cases to simulate, but the max sampling times are much less than endTime?
    Solution: Increase sampleShape/sampleRate (delay sampling until longer after transmission), increase transmissionConstant, or if the transmission process is taking off but there are too few generations, increase endTime.
  • Is the number of cases exploding exponentially and there are too many?
    Solution: See above, but do the opposite.
  • Within-host coalescent trees are rejected because they don’t coalesce in time?
    Solution: Decrease popSize.
  • Within-host coalescent trees have very very short branch lengths? (This could be OK)
    Solution: Increase popSize.

References

Colijn C, Hall MD, Bouckaert R. Taking a BREATH (Bayesian Reconstruction and Evolutionary Analysis of Transmission Histories) to simultaneously infer phylogenetic and transmission trees for partially sampled outbreaks. bioRxiv. 2024:2024-07. doi:10.1101/2024.07.11.603095