Tuesday 10 January 2017

A Primer on the Bias-Variance Trade-off in ecology

This blog is a originally from the Appendix of Rankin RW. 2016. EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture. bioRxiv. doi: 10.1101/052266. Available from 10.1101/052266


I will use simulations to illustrate the ``bias-variance trade-off'' for ecologists, which is one of the most important ideas in statistical modelling in the past half-century. The goal is to show how the AICc and another model-selection/model-averaging technique called "boosting" each negotiate the trade-off in order to improve estimation. Specifically, in order to minimize the expected error of estimating survival phi over T capture periods. The trade-off is fundamental to understanding the optimality of Frequentist shrinkage estimators and AIC model-selection. The illustrations are inspired by figure 6.5 in Murphy 2012, but is adapted to a type of capture-mark-recapture model called the Cormack-Jolly-Seber model.

The trade-off is an old idea without a citable origin (although Geman et al 1992, is often considered to be a definitive reference, but the phenomenon is clearly discussed as early as 1970 by Hoerl considering Ridge Regression, and likely earlier). Despite being an old and fundamental concept of statistical estimation, I have noticed that it poorly understood among academics and government scientists. In particular, it is my experience that ecologists are unduly wedded to the idea of being unbiased (in estimation), such that when they are presented with visual and quantitative evidence about the optimality of biased shrinkage estimators, they recoil at the sight of systematic bias, and ignore the crucial role of variance. Of course, bias is not desirable in and of itself, but so long as the bias goes to zero at a rate proportional to that of the variance, we may be able to improve our overall estimation performance by incurring a little bias.

In the following simulations, the goal is to minimize the Expected Error of estimating survival, as quantified by the Mean Square Error (MSE). It is a population-level abstract quantity that can only be measured in simulations when we know to the ``true'' process. It is Frequentist in the sense that we hope to minimize the error over all possible data-sets that one might sample from the true population Y. These multiple realizations are shown as grey lines in Figures 1 and 2. Of course, an analyst only has one dataset, and his goal is to get his estimates as close as possible to the truth.

Figure 1: Decomposing the error of estimation (MSE) into its bias and variance components. An estimation procedure will negotiate the bias and variance so to minimize the MSE. Top, a simulation of a true survival process (red line). Each grey line represents one dataset sampled from the population and an analyst's attempt to estimate survival using multi-model inference procedures, such as boosting. The dashed black line is the mean estimate over all 30 independent grey-lines. Middle, a visualization of the variance component, showing the variability of point-wise estimates due to randomness in the sampled data and a procedure's sensitivity to such differences. Bottom, a visualization of the bias: the expected difference between the truth and the procedure's estimates, over all realizations of the data.


The bias-variance trade-off arises from a classic decomposition of the expected error: MSE = E[phi_hat-phi_true]^2 + Var[phi] + constant. Figure 1 also shows this decomposition. The first term is the expected difference between an estimate and the true value, i.e. the bias. This difference is visualized as the red polygon in Figure 1. In the same figure, the bias manifests as shrinkage from the true red line towards a flat global mean. Quantifying the bias requires knowledge of the truth phi_true, and is therefore inaccessible in real-life situations. The second term is the variance and it does not depend on knowledge of the truth. Rather, it arises due to the vagaries of random sampling as well as the complexity of the estimation procedure: overly complex models which ``over-fit'' one dataset will vary wildly when fitted to a new dataset sampled from the same population. The variance can be visualized as the spread of the grey lines, or the green polygon in Figure 1.

The MSE decomposition has a naive meaning: in order to optimize our estimation performance, we should reduce the bias and/or the variance. Clearly, most ecologists see the value of tackling either of these two terms. But the nature of a trade-off has a more elusive importance: we cannot, in general, minimize both terms for a given sample-size, and we may deliberately increase one term in order to decrease the other. Shrinkage estimators incur a little bias and have lower variance (i.e., the red polygon is bigger but the green polygon is smaller). This strategy results in much smaller MSE values than complex unbiased estimators. In contrast, the MLEs of the complex full-model are unbiased but they typically have very high variance. This strategy is often worse at minimizing the MSE, for small-to-moderate samples sizes.

The following simulations show how different statistical methods have different strategies in negotiating the bias-variance trade-off. Imagine an analyst who is confronted with four different methods to estimate survival. The first is estimation by Maximum Likelihood using the full-model p(t)phi(t). The second method is AICc model-selection, and the third is AICc model-averaging; both use the following fixed-effects models: p(.)phi(.), p(t)phi(.), p(.)phi(t), and p(t)phi(t) with obvious constraints on the last terms. The fourth method is called CJSboost, from statistical boosting. It uses submodels called "base-learners" equivalent to the aforementioned fixed-effect models (but without the previous constraints). The AICc-methods should theoretically do best because they are fundamentally motivated by trying to minimize an objective function that is very closely related to MSE called the KL-loss (see Akaike 1979 and Akaike 1998). Likewise, CJSboost is trying to minimize a related generalization-error called the negative Expected log-Likelihood, which is approximated through bootstrapping the capture-histories.

The fake data-sets were generated according to the following. The time-varying survival values were:
phi_t = cos((t-2.3)/1.2)/11+0.75. The time-varying capture-probabilities p_t were drawn from a beta distribution with shape parameters A=12 and B=12, resulting in an average capture-probability of 0.5. The p_t values were the same for all simulations. MLE and AICc analyses were run in Program MARK. For CJSboost, a ten-times 70-fold bootstrap-validation exercise was run per dataset to tune the CJSboost regularization parameters. The CJSboost program is available on my Github site. The simulations and analyses were repeated 40 times for three scenarios pertaining to the number of capture-histories n = {50,200,800).

Visualizing the bias-variance trade-off and the error of estimating survival in a Cormack-Jolly-Seber analysis, using four procedures (panel rows): i) the shrinkage estimator CJSboost; ii) AICc model-averaging based on four fixed-effect models of time-varying vs. time-constant survival and capture-probabilities; iii) the best AICc model; and iv) the Maximum Likelihood Estimate using the full-model p(t)phi(t). Panel columns are different sample sizes (number of capture-histories) over T=10 primary periods. The red-lines show the true survival. Each grey line is an independently sampled dataset and an analyst's attempt to estimate survival. The dashed-lines represent each procedure's average estimate over 40 simulated data-sets and analyses. The best estimation procedure has the lowest MSE (turquoise for emphasis). Each procedure may have a high/low bias or low/high variance, but generally cannot succeed at minimizing both. The bias is the difference between the red and dashed line. The variance is represented by the dispersion among grey lines. At small sample sizes, the AICc methods and boosting are very biased but have better MSE.

The results clearly show the trade-off (Figure 2). At high sample sizes (n=800), the shrinkage estimator CJSboost has the lowest MSE and therefore wins at estimating survival. However, it has the highest bias. How can it be considered a better estimator than the other methods when it is biased? The answer is obvious when looking at the grey lines in Figure 2, where each line is an estimate of phi from an independent realization of data: compared to the other methods, each grey line from CJSboost is much more likely to be closer to the truth, despite systematic bias. In contrast, using the MLEs, one can only claim to be unbiased over all possible realizations of the data as shown by the closeness of the dashed black line to the true red line. But, for any one realization (a single grey line) the MLEs can be very far away from the truth due to much higher variance.

At smaller sample sizes, we see that the bias becomes much more extreme for both AICc methods and CJSboost. In the case of the AICc methods, the model with most support is often phi(.), in which case the estimates are a single flat line. This is also the case in CJSboost, were shrinkage is so extreme as to force a flat line. Therefore, at low sample sizes, we are much better off, in terms of MSE, to use the flat-lined phi(.) estimates rather than use the full-model MLEs, which vary so wildly as to be useless.

This primer is meant to illustrate the role of bias and variance in estimation errors. Simulations show how shrinkage estimators (CJSboost) and model-selection (by AICc) each negotiate the trade-off between bias and variance to try and minimize the Expected Error. CJSboost does particularly better by incurring a little bias.

The original article can be accessed at: http://dx.doi.org/10.1101/052266. Original blog post at mucru.org/a-primer-on-the-bias-variance-trade-off-in-ecology/.

No comments:

Post a Comment