Developing efficient BEAST operators

24 May 2021 by Remco Bouckaert

When developing new operators for BEAST there are two goals

  • making correct proposals (see developer guide on how to test them)
  • making them efficient.

Here we consider efficiency of proposals, for which there are three factors to take into account:

  • the boldness of the proposal: bigger jumps may mean more efficient exploration of the state space.
  • acceptance probability: higher acceptance may mean more efficient exploration, but is often only possible with not very bold proposals.
  • the computational cost in generating a proposal and then calculating the updated posterior.

Bolder proposals

Bactrian proposals are a way to avoid making proposals that change the state space in small steps. The consequence is that Bactrian proposal are bolder than proposals based on uniform or Gaussian random values.

Optimise acceptance through auto optimisation

Some balance between boldness and acceptance can be achieved by automatic optimisation. If the operator has a tunable parameter which can be set through the get/setCoercableParameterValue() methods, this tunable parameter can be (re)stored via the state file. After every MCMC step, the operator class is notified whether it was accepted or rejected through the accept() and reject() methods, which update the m_nNrRejected, m_nNrAccepted, m_nNrRejectedForCorrection, and m_nNrAcceptedForCorrection members. Furthermore, the optimize() method is called, which you can use to update the coercable parameter. For an example for how to update a scale factor, see the ScaleOperator class and for example of a window size update see the RealRandomWalkOperator class.

The getTargetAcceptanceProbability() method in Operator lets you set the level of acceptance that is aimed for by optimisiation: higher acceptance targets usually lead to less bold moves, while lower targets lead to bolder moves. On balance, a level of 0.234 seems to be optimal for a wide range of classic operators, while for Bactrian operators a higher level (0.3 - 0.4) can be optimal.

Selecting the best operator

If you have multiple candidate operators or a new operator candidate for an existing one, determining which one is best can be done by running multiple analyses and observing ESSs and run times to find optimal operators. However, this requires careful setting of weights and other operator parameters, which can be tedious and time consuming. The process can to some extend be automated using the AdaptableOperatorSampler from the ORC package (Douglas et al, 2021). The AdaptableOperatorSampler automatically considers boldness, acceptance probability and calculation time for a proposal and uses these to determine operator weights. Here is an example with a BactrianNodeOperator which moves heights of nodes between its parent and children, the SubtreeSlide which moves nodes along branches, and a fictitious new candidate for moving internal nodes. Here is an example with three operators:

<operator id="AdaptableOperatorSampler" spec="orc.operators.AdaptableOperatorSampler" tree="@Tree" weight="45.0">
   <operator id="UniformOperator" spec="BactrianNodeOperator" tree="@Tree" weight="45.0"/>
   <operator id="SubtreeSlide" spec="SubtreeSlide" tree="@Tree" weight="30.0" size="0.03" optimise="false"/>
    <operator id="NewMove" spec="my.package.operators.NewMove" tree="@Tree" weight="10.0" scaleFactor="0.15"/>
</operator>

Note that the weights of the operators inside the AdaptableOperatorSampler are ignored – initially, operators will be sampled uniformly. The weight of the AdaptableOperatorSampler itself however is used as normal, and perhaps a good value is the sum of weights of the operators inside it.

The AdaptableOperatorSampler takes a while (determined by burnin) before measuring boldness, acceptance and time for the operators. By default, boldness is measured as the sum of differences in parameter or node heights squared, but if you specify a tree metric (using the metric input) you can use other measures for tree distances like Robinson Foulds or RNNI. After another learnin steps it starts changing operator weights. It has the following inputs:

  • parameter (Function): list of parameters to compare before and after the proposal. If the tree heights are a parameter then include the tree under ‘tree’ (optional)
  • tree (Tree): tree containing node heights to compare before and after the proposal (optional)
  • operator (Operator): list of operators to select from
  • burnin (Integer): number of operator calls until the learning process begins (default: 1000)
  • learnin (Integer): number of operator calls after learning begins (ie. at burnin) but before this operator starts to use what it has learned (default: 100 x number of operators) (optional)
  • uniformp (Double): the probability that operators are sampled uniformly at random instead of using the trained parameters (default 0.01 in ORC v1.0.2 and earlier, but 0.1 in later versions)
  • metric (TreeMetric): A function for computing the distance between trees. If left empty, then tree distances will not be compared (optional)
  • weight (Double): weight with which this operator is selected (required)

The AdaptableOperatorSampler stores information in the state file (at the end, where operator information is stored) and contains among other things the following information for the XML fragment above.

Tip: run egrep "{|}" beast.xml.state > beast.json where you replace beast.xml.state with the name of your state file, and open beast.json in a web-browser for much nicer formatting than the state file provides.

param_mean_sum	"[0.1500565421442715]"
param_mean_SS	"[0.02267449706602849]"
numAccepts	    "[ 751953,  366999,  31868]"
numProposals	"[2512475, 1214120, 297648]"
operator_mean_runtimes	"[0.4886742948722717, 0.3574444165274914, 0.7862141018905646]"
mean_SS	"[[0.0013286448308873302], [4.1795419527996787E-4], [4.395970191816695E-4]]"
weights	"[0.6631638370649471, 0.28804870042162256, 0.04878746251343036]"
operators	
0	
id	"UniformOperator"
p	3.0161224156080126
accept	752174
reject	1760640
acceptFC	751149
rejectFC	1760114
rejectIv	0
rejectOp	0
1	
id	"SubtreeSlide"
p	0.03
accept	367130
reject	847321
acceptFC	366176
rejectFC	846183
rejectIv	314368
rejectOp	314368
2	
id	"NewMove"
p	0.2502998822070235
accept	31885
reject	266092
acceptFC	31850
rejectFC	265602
rejectIv	59080
rejectOp	59080

The important ones to look at are

  • operator_mean_runtimes shows average run times for the operators in milli seconds. It is an array with one entry for each operator and operators are listed in the operators entry.
  • mean_SS mean sum of squares of changes in state space, effectively the boldness of the operators. The mean_SS is updated only when the proposal is accepted.
  • numAccepts/numProposals shows how often operators are accepted and proposed after burnin.

Together they end up in new weights for the operators. In the above, 66% goes to the uniform, 29% to the sub-tree slide and just 5% to the new operator. The new operator does not fare well because it takes long to run: the operator_mean_runtimes shows it takes more than twice the sub-tree slide more than 50% more than the uniform. Furthermore, the mean_SS shows the change in node heights is a third of the uniform but slightly better than the sub-tree slide. Looking at the numAccepts/numProposals entries, we see that the new operator gets accepted about 10% of the time, while the other two have acceptance rates of about 1/3. Together, this means a large down-weighting of the new operator. So, back to the drawing board for this operator.

Note 1: The AdaptableOperatorSampler still randomly picks any of the operators 1% of the time (or more, if you set uniformp to some higher level). This explains why the accept and reject counts still go up for operators that have extremely low weights.

Note 2: You should not put operators together that work on different parts of the tree. For example, adding both tree scaler and root height scaler in the same AdaptableOperatorSampler probably leads to a lot of root height scaling, since it is cheap to calculate and, because the root has less restrictions than the complete tree to scale. Consequently, most of the weight will be assigned to the root scaler and taken away from the tree scaler, which can result in bad mixing.

Acknowledgements

Thanks to Jordan Douglas for helpful comments and suggestions.

References

Douglas J, Zhang R, Bouckaert R. Adaptive dating and fast proposals: Revisiting the phylogenetic relaxed clock model. PLoS computational biology. 2021 Feb 2;17(2):e1008322. DOI