06 October 2020 by Veronika Bošková

# How to efficiently analyze large data sets containing duplicates?

A recurrent problem that users of Bayesian phylogenetic methods, such as BEAST 2, encounter is analysis of large data sets. With the sequencing itself getting quite cheap, a large number of sequences can easily be produced, thus making very large alignments containing thousands of sequences a usual, rather than occasional, happening. This represents a serious hurdle for Bayesian phylogenetics methods, since these methods apply computationally intense MCMC sampling procedure in order to approximate the posterior distribution of phylogenetic and phylodynamic model parameters. For large parameter space, many MCMC samples are necessary to provide a good approximation of the posterior distribution. As the tree space to be explored grows very fast with the number of sequences in the alignment, classic analysis of large data sets in Bayesian framework may not always be possible in a reasonable runtime.

Scientist thus resort to down-sampling procedures in hopes to reduce the runtime necessary to obtain parameter estimates from (part of) their data set. In a recent publication, we have shown that all possible down-sampling procedures have their downfalls (see Boskova and Stadler, 2020). For data sets where all sequences are unique, the only possible down-sampling procedure is the random sub-sampling of the data. Such down-sampling of the data does not bias the parameter estimates, but yields estimates with much larger confidence intervals. For data sets which contain many duplicate sequences, selecting unique sequences only offers a very attractive alternative for data set size reduction; however, this down-sampling procedure leads to biased parameter estimates and is thus not a recommended solution.

Since none of these two down-sampling approaches offers an ideal solution, we designed a new method for faster analysis of large datasets (described in Boskova and Stadler, 2020). The method reduces the runtime for those alignments that contain many duplicate sequences. The method uses the complete data set, so no down-sampling is necessary. It assumes that each sequence arose only once during evolution of the population which gave rise to the sequences in the alignment. This forces the identical sequences to always cluster tightly together in a tree - a phenomenon that we also observed when we simulated data sets with duplicates. Through analysis of simulated data we have shown that our method works very well. The analysis runtime highly depends on the total number of sequences and the proportion of unique sequences in the data set, but is in general shorter than the runtime of the classic analysis methods that use the complete data set. In our simulations, we observed 1.7-times speed-up as compared to the classic methods for a dataset containing 300 homochronously sampled sequences, of which 20 were unique. The largest data set we were able to analyze with our method within a 2-week runtime contained 21,000 sequences, of which ~20 were unique.

We have implemented the method as a BEAST 2 package, called PIQMEE. It is currently available to download through BEAUti. For those interested in the source code, please, refer to the GutHub page https://github.com/boskovav/piqmee. We have also written up a short tutorial on how to set up analysis with PIQMEE, which can be found at https://github.com/boskovav/PIQMEE-Tutorial and will shortly also be listed at the Taming the BEAST webpage (https://taming-the-beast.org/tutorials/).

# References

Veronika Boskova, Tanja Stadler, PIQMEE: Bayesian phylodynamic method for analysis of large datasets with duplicate sequences, Molecular Biology and Evolution, 2020, msaa136, https://doi.org/10.1093/molbev/msaa136