Tuesday 10 May 2016

New pre-print: EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture

Rankin RW (2016) "EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture", bioRxiv pre-print, doi:10.1101/052266, URL:http://github.com/faraway1nspace/HMMboost


A new pre-print article is available online at: http://dx.doi.org/10.1101/052266. The study proposes a new way to fit capture-recapture models based on boosting: a machine-learning method which iteratively fits simple, non-parametric learners and build a strong prediction function. The ensemble-method is a type of "multi-model inference" and "model-parsimony" technique, not unlike model-averaging by AICc. This new method, called CJSboost, is motivated by several deficiencies in the AICc model-selection approach, such as over-fitting the data and a tendency to produce ridiculous estimates, such as 100% survival. In contrast, boosting has the following benefits:

  • automatic variable selection and step-wise multimodel inference (without the sometimes-impossible task of fitting all possible fixed-effects models, as in AIC-based model averaging);

  • regularization and sparse estimation, which deflates the influence of unimportant covariates (especially important during the current crisis of reproduciblility);

  • shrinkage of estimates away from extreme values and inadmissible values (e.g., survival= 100%);

  • highly extensible (see the wide variety of base-learners available under the "mboost" R package).

  • inference based on predictive performance


  • One disadvantage of the machine-learning paradigm for capture-recapture is the computational burden of running 50 or 100 bootstrap-validation to find the optimal regularization parameters that minimize our generalization error.

    The Prediction Perspective

    Capture-recapture practitioners are perennially obsessed with model-selection and model-parsimony: the challenge of revealing important patterns in the data, without being tricked by frivolous false discoveries, i.e., not over-fitting models. The machine-learning community has addressed this challenge through the idea of generalization error: minimizing the cost of bad predictions on new, unseen data. Models that are too simple (e.g., the classic phi(.)p(.) model) are unlikely to make good predictions, whereas overly-complex models are likewise unlikely to generalize to new data.

    The boosting framework controls the complexity of the model (called regularization) by trying to minimize the generalization error (aka expected loss), and hence, avoid overfitting the data. Interestingly, most capture-recapture practitioners have been implicitly using a prediction criteria, the AIC, to rank and select models. Few ecologists seem to recognize the connections between AIC-selection and prediction, and this manuscript tries to make these connections more obvious. However, while the AIC was developed in the context of General Linear Models with Gaussian error, the CJSboost method, proposed in this manuscript, is crafted specifically for capture-recapture models.

    Sparsity and Prediction

    One the of philosophical divisions between prediction-optimal regularizers, like boosting or the AICc, and other criteria, like the BIC/SBC, is belief in a sparse "true model": whether there is a small number of truly influential variables that have large/medium effects, or, whether there are an infinite number of influential variables, each with a decreasingly important influence on our data. This division is rooted in whether we want to minimize our risk of making bad numerical predictions, or whether we want to recover the "truth". The AIC and CJSboost adopt the former persepective: they both reduce the influence of unimportant variables, but there is still some positive-weight placed on unimportant variables (this is the consequence of being prediction optimal!).

    If prediction is not one's focus, and one is interested in finding a sparse set of truly influential variables, then CJSboosted coefficients may be "hard-thresholded": discard unimportant covariates and rerun the algorithm. But how to judge which covariates are important? In the manuscript, I describe the technique of "stability selection" (Meinshausen and Bühlmann, 2010) for estimating (approximate) inclusion probabilities, based on repeatedly resampling the data and re-running the CJSboost algorithm. Probabilities lead to straight-forward inference: covariates with high inclusion probabilities are probably important; covariates with low probabilities are probably not important. An interesting thing about l1-regularizers, like boosting algorithms, is that by hard-thresholding the covariates with stability-selection probabilities, we can transform a prediction-optimal procedure into a "model-selection consistent" procedure. This may not be exactly true for capture-recapture models, but the manuscript explores the issue with simulations.

    The animation below shows 30 simulations and their stability selection pathways. Each slide is a different simulated dataset with 3 truly influential covariates (in red) and 18 other frivolous covariates, just to try and trick the algorithm. The pathways show that as the regularization gets weaker (larger m, to the right), there is a larger probability that any covariate will be selected by the algorithm. At strong regularization (small m, left), only the 3 most influential covariates (in red) enter the algorithm. Visually, this is an interesting way to discriminate between important and unimportant covariates, and can also help one achieve model-selection consistency.


    Simulation: How to discriminate between important covariates (in red) and unimportant covariates (in grey) with stability selection curves. By repeated re-sampling the capture-histories (bootstrapping), and re-training the CJSboost model, we can estimate the posterior probability that a covariate is included in the model.

    Online Tutorial

    The manuscript includes an online tutorial in R at: http://github.com/faraway1nspace/HMMboost. Interested readers can step through the analyses in the manuscript.

    Peer-review

    Readers should note that the manuscript is only a preprint: it is currently in peer-review.

    @article{Rankin052266,
    author = {Rankin, Robert William},
    title = {EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture},
    year = {2016},
    doi = {10.1101/052266},
    publisher = {Cold Spring Harbor Labs Journals},
    abstract = {This study presents a new boosting method for capture-recapture models, rooted in predictive-performance and machine-learning. The regularization algorithm combines Expectation-Maximization and boosting to yield a type of multimodel inference, including automatic variable selection and control of model complexity. By analyzing simulations and a real dataset, this study shows the qualitatively similar estimates between AICc model-averaging and boosted capture-recapture for the CJS model. I discuss a number of benefits of boosting for capture-recapture, including: i) ability to fit non-linear patterns (regression-trees, splines); ii) sparser, simpler models that are less prone to over-fitting, singularities or boundary-value estimates than conventional methods; iii) an inference paradigm that is rooted in predictive-performance and free of p-values or 95\% confidence intervals; and v) estimates that are slightly biased, but are more stable over multiple realizations of the data. Finally, I discuss some philosophical considerations to help practitioners motivate the use of either prediction-optimal methods (AIC, boosting) or model-consistent methods. The boosted capture-recapture framework is highly extensible and could provide a rich, unified framework for addressing many topics in capture-recapture, such as spatial capture-recapture, individual heterogeneity, and non-linear effects.},
    URL = {http://dx.doi.org/10.1101/052266},
    eprint = {http://www.biorxiv.org/content/early/2016/05/09/052266.full.pdf},
    journal = {bioRxiv}
    }