8 Section 8. Model checking & Cross-validation

2021-10-31

8.1 Resources

8.2 Notes

8.2.1 Chapter 6 reading instructions

  • Replicates vs. future observation
    • predictive \(\tilde{y}\): the next yet unobserved possible observation
    • \(y^\text{rep}\): replicating the whole experiment (with same values of \(x\)) and obtaining as many replicated observations as in the original data
  • Posterior predictive p-values
    • do not recommend p-values any more especially in a form of hypothesis testing
  • Prior predictive checking
    • using just the prior predictive distributions for assessing the sensibility of the model and priors before observing any data

8.2.2 Chapter 6. Model checking

6.1 The place of model checking in applied Bayesian statistics

  • must assess the fit of a model to the data and to our substantive domain knowledge
Sensitivity analysis and model improvement
  • sensitivity analysis: “how much do posterior inference change when other reasonable probability models are used in place of the present model?” (pg. 141)
Judging model flaws by their practical implications
  • not interested in if the model is true or false - will likely always be false
  • more interested in the question: “Do the model’s deficiencies have a noticeable effect on the substantive inferences?” (pg. 142)
    • keep focus on the more important parts of the model, too

6.2 Do the inferences from the model make sense?

  • there will be knowledge that is not included in the model
    • if the additional information suggests that posterior inferences are false, this suggests an option for improving the model’s accuracy
External validation
  • external validation: “using the model to make predictions about future data, and then collecting those data and comparing to their predictions” (pg. 142)

6.3 Posterior predictive checking

  • if the model fits, then generated replicate data should look like the observed data
    • “the observed data should look plausible under the posterior predictive distribution” (pg. 143)
    • is a self-consistency check
  • important to choose test quantities that are of interest of the goals of the model
    • may be inaccurate in some regards, but the relevance should be taken into account
  • need not worry about adjusting for multiple comparisons:
    • “We are not concerned with ‘Type I error’ rate… because we use the checks not to accept or reject a model but rather to understand the limits of its applicability in realistic replications.” (pg. 150)

6.4 Graphical posterior predictive checks

  • three types of graphical display to start a posterior predictive check:
    1. direct display of all the data
    • may need to get clever with how to do this effectively
    1. display of data summaries or parameter inferences
    • useful when dataset is very large or want to focus on a particular part of the model
    1. graphs of residuals or other measures of discrepancy between the model and data
    • description of how to do this effectively for discrete data

6.5 Model checking for the educational testing example

  • check that posterior parameter values and predictions are reasonable
  • compare summary statistics of real data and predictive distributions
    • min. and max. values, averages, skewness
  • sensitivity analysis can assuage concerns that the outcome was not determined by specific choices of priors

8.2.3 Chapter 6. Lecture notes

Lecture 8.1. Model Checking

  • model checking overview:
    • Sensibility with respect to additional information not used in modeling
      • e.g., if posterior would claim that hazardous chemical decreases probability of death
    • External validation
      • compare predictions to completely new observations
      • compare to theoretical values
        • e.g., relativity theory predictions on the speed of light (not based on model optimized to data)
    • Internal validation
      • posterior predictive checking
      • cross-validation predictive checking
  • example of posterior checks with air quality model
  • examples with binary data and logistic regression
  • get posterior predictive distribution in Stan:
data {
  int<lower=1> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0> lambda;
}
model {
  lambda ~ exponential(0.2);
  y ~ poisson(lambda);
}
generated_quantities {
  real log_lik[N];
  int y_rep[N];
  for (n in 1:N) {
    y_rep[n] = poisson_rng(lambda);
    log_lik[n] = poisson_lpmf(y[n] | lambda);
  }
}

8.2.4 Chapter 7 reading instructions

8.2.5 Chapter 7. Evaluating, comparing, and expanding models

  • goal of this chapter is not to check model fit but to compare models and explore directions for improvement

7.1 Measures of predictive accuracy

  • can use predictive accuracy for comparing models
  • log predictive density: the logarithmic score for predictions is the log predictive density \(\log p(y|\theta)\)
    • expected log predictive density as a measure of overall model fit
  • external validation: ideally, would check a model’s fit on out-of-sample (new) data
Averaging over the distirbution of future data
  • expected log predictive density (elpd) for a new data point:
    • where \(f\) is the true data-generating process and \(p_\text{post}(y)\) is the posterior probability of \(y\)
    • \(f\) is usually unknown

\[ \text{elpd} = \text{E}_f(\log p_\text{post}(\tilde{y}_i)) = \int (\log p_\text{post}(\tilde{y}_i)) f(\tilde{y}_i) d\tilde{y} \]

  • for a new dataset (instead of a single point) of \(n\) data points
    • kept as pointwise so can be related to cross-validation

\[ \text{elppd} = \text{expected log pointwise predictive density} = \sum_{i=1}^{n} \text{E}_f (\log p_\text{post}(\tilde{y}_i)) \]

Evaluating predictive accuracy for a fitted model
  • in practice, we do not know \(\theta\) so cannot know the log predictive density \(\log p(y|\theta)\)
  • want to use the posterior distribution \(p_\text{post}(\theta) = p(\theta|y)\) and summarize the predictive accuracy of a fitted model to data:

\[ \begin{aligned} \text{lppd} &= \text{log pointwise predictive density} \\ &= \log \prod_{i=1}^{n} p_\text{post}(y_i) \\ &= \sum_{i=1}^{n} \log \int p(y_i|\theta) p_\text{post}(\theta) d\theta \end{aligned} \]

  • to actually compute lppd, evaluate the expectation using the draws from \(p_\text{post}(\theta)\), \(\theta^s\):
    • this equation is a biased version of elppd so need to correct it (next section in information criteria)

\[ \begin{aligned} \text{computed lppd} &= \text{computed log pointwise predictive density} \\ &= \sum_{i=1}^{n} \log \lgroup \frac{1}{S} \sum_{s=1}^{S} p(y_i|\theta^s) \rgroup \end{aligned} \]

7.2 Information criteria and cross-validation

  • historically, measures of predictive accuracy are referred to as information criteria
    • are typically defined based on the deviance
Estimating out-of-sample predictive accuracy using available data
  • estimate the expected predictive accuracy without waiting for new data
  • some reasonable approximations for out-of-sample predictive accuracy
    • within-sample predictive accuracy: “naive estimate of the expected log predictive density for new data is the log predictive density for existing data” (pg 170)
    • adjusted within-sample predictive accuracy: information criteria such as WAIC
    • cross-validation: estimate out-of-sample prediction error by fitting to training data and evaluating predictive accuracy on the held-out data
  • descriptions of Akaike information criterion (AIC), deviance information criterion (DIC), and Watanabe-Akaike information criterion (or widely applicable information criterion; WAIC)
    • DIC and WAIC try to adjust for the effective number of parameters
    • WAIC is the best for Bayesian because it uses the full posterior distributions, not point estimates
      • there are actually two formulations for WAIC, and Gelman et al. recommend the second form they describe
Effective number of parameters as a random variable
  • DIC and WAIC adjust the effective number of parameters according to the model structure and the data
    • the latter seems unintuitive
  • example: image a simple model: \(y_i, \dots, y_n \sim N(\theta, 1)\) with \(n\) large and \(\theta \sim U(0, \infty)\)
    • \(\theta\) is constrained to be positive but is otherwise uninformed
    • if the measure of \(y\) is near 0, then the effective number of parameters \(p\) is effectively \(\frac{1}{2}\) because the prior removed all negative values
    • if the measure of \(y\) is a large positive value, then the constraint by the prior is unimportant and the effective number of parameters \(p\) is essentially 1
  • Bayesian information criterion (BIC) is not comparable to AIC, DIC, and WAIC as it serves a different purpose
    • more discussion on pg. 175

7.3 Model comparison based on predictive performance

  • in comparing “nested” models, the large model is often better fit, but is more difficult to understand and compute
    • “nested” models are where one contains the structure of the other and a little more
      • either broader priors or additional parameters
    • key questions of model comparison:
      1. Is the improvement in fit large enough to justify the additional difficulty of fitting?
      2. Is the prior distribution on the additional parameters reasonable?
  • for non-nested models, not typically interested in choosing one over the other, but more interested in seeing the differences between them
    • ideally, could construct a single model containing both of the non-nested models
  • authors recommend using LOO-=CV where possible and WAIC, otherwise (pg. 182)

7.4 Model comparison using Bayes factor

  • generally not recommended (pg. 182 for reasons why)

7.5 Continuous model expansion

  • posterior distributions of model parameters can either overestimate or underestimate different aspects of the “true” posterior uncertainty
    • overestimate uncertainty because the model usually does not contain all of one’s substantive knowledge
    • underestimate uncertainty by
      • the model is almost always wrong (i.e. imperfect) - the reason for posterior predictive checking
      • other reasonable models could have fit the data equally well - the reason for sensitivity analysis
Adding parameters to a model
  • reasons to expand a model:
    1. add knew parameters if the model does not fit the data or prior knowledge in some important way
    2. the class of models can be broadened if some modeling assumption was unfounded
    3. if two different models are under consideration, they can be combined into a larger model with a continuous parameterization that includes both models as special cases
    • e.g. complete-pooling and no-pooling can be combined into a hierarchical model
    1. expanding a model to include new data

Practical advice for model checking and expansion

  • examine posterior distributions of substantively important parameters and predicted quantities
    • e.g. number of zeros in a count model
    • maximum and minimum predicted values
  • compare posterior distributions and posterior predictions with substantive knowledge
    • this includes the observed data

8.2.6 Additional Reading

Visualization in Bayesian workflow

(pdf, link)

  • the phases of statistical workflow:
    1. exploratory data analysis
    • aid in setting up an initial model
    1. computational model checks using fake-data simulation and the prior predictive distribution
    2. computational checks to ensure the inference algorithm works reliably
    3. posterior predictive checks and other juxtapositions of data and predictions under the fitted model
    4. model comparison via tools such as cross-validation
3. Fake data can be almost as valuable as real data for building your model
  • visualize simulations from the prior marginal distribution of the data to assess the consistency of the chosen priors with domain knowledge
  • weakly informative prior: if draws from the prior data generating distribution \(p(y)\) could represent any dataset that could plausibly be observed
    • this prior predictive distribution for the data has at least some mass around extreme but plausible data sets
    • there should be no mass on completely implausible data sets
    • generate a “flip book” of simulated datasets that can be used to investigate the variability and multivariate structure of the distribution
4. Graphical Markov chain Monte Carlo diagnostics: moving beyond trace plots
  • catching divergent draws heuristically is a powerful feature of HMC
    • sometimes get falsely flagged, so must check that the divergences were infact outside of the typical set
    • two additional plots for diagnosing troublesome areas of the parameter space
      1. bivariate scatterplots that highlight divergent transitions
      • bad sign: if the divergences were clustered
      1. parallel coordinate plot
      • bad sign: the divergences would have similar structure

mcmc-diagnositcs-fig

5. How did we do? Posterior predictive checks are vital for model evaluation

  • “The idea behind posterior predictive checking is simple: if a model is a good fit we should be able to use it to generate data that resemble the data we observed”
    • “can also perform similar checks within levels of a grouping variable”
  • check that predictions are calibrated using LOO-CV predictive cumulative density function values which should be uniform (for continuous data)

6. Pointwise plots for predictive model comparison

  • identify unusual points in the data
    • are other either outliers or points with high leverage
    • indicate how the model can be modified to better fut the data
  • main tool for this analysis is the LOO predictive distribution \(p(y_i|y_{-i})\)
    • examine LOO log-predictive density values to find observations that are difficult to predict
    • can be used for model comparison by checking which model best captures each held-out data point
  • also compare the full data log-posterior predictive density against each LOO log-predictive density to see which data points are difficult to model but not very influential
    • this is automatically calculated in PSIS-LOO as the parameter \(\hat{k}\)

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

(pdf, link)

Here we lay out fast and stable computations for LOO and WAIC that can be performed usingexisting simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in thefinite case with weak priors or influential observations.

Introduction
  • exact CV requires re-fitting the model with every new training set
  • can approximate this with LOO-CV using importance sampling, but results can be very noisy
  • use Pareto smoothed importance sampling (PSIS) to calculate a more accurate and stable estimate
    • fit a Pareto distribution to the upper tail of the distribution of the importance weights
  • this paper demonstrates that PSIS-LOO is better than WAIC in the finite case
    • also provide diagnostics for which method, WAIC or PSIS-LOO, is better or whether k-fold CV should be used instead
Estimating out-of-sample pointwise predictive accuracy using posterior simulations
  • posterior predictive distribution: \(p(\tilde{y}|y) = \int p(\tilde{y}_i|\theta) p(\theta|y) d\theta\)
  • expected log pointwise predictive density (ELPD)
    • measure of predictive accuracy for the \(n\) data points in a dataset
    • \(p_t(\tilde{y}_i)\): distribution of the true data-generating process for \(\tilde{y}_i\)
    • \(p_t(\tilde{y}_i)\) is usually unknown so CV and WAIC are used to approximate ELPD

\[ \text{elpd} = \sum_{i=1}^n \int p_t(\tilde{y}_i) \log p(\tilde{y}_i|y) d\tilde{y}_i \]

  • useful quantity is the log pointwise predictive density (LPD)
    • “The LPD of observed data \(y\) is an overestimate of the ELPD for future data.”

\[ \text{lpd} = \sum_{i=1}^n \log p(y_i|y) = \sum_{i=1}^n \int p(y_i|\theta) p(\theta|y) d\theta \]

  • to compute LPD in practice, evaluate the expectation using draws for the posteior \(p_\text{post}(\theta)\), \(\theta^s\) for \(s = 1, \dots, S\)
    • \(\widehat{lpd}\): computed log pointwise predictive density

\[ \widehat{lpd} = \sum_{i=1}^n \lgroup \frac{1}{S} \sum_{s=1}^S p(y_i|\theta^s) \rgroup \]

Pareto smoothed importance sampling
  • fit the right tail of importance weights to smooth the values
  • The estimated shape parameter \(\hat{k}\) of the generalized Pareto distribution can be used to assessthe reliability of the estimate:
    • \(k < \frac{1}{2}\): the variance of the raw importance ratios is finite, the central limit theorem holds, and the estimate converges quickly
    • \(\frac{1}{2} < k < 1\): the variance of the raw importance ratios is infinite but the mean exists, the generalized central limit theorem for stable distributions holds, and the convergence of the estimate is slower
      • the variance of the PSIS estimate is finite but may be large
    • \(k > 1\): the variance and the mean of the raw ratios distribution do not exist
      • the variance of the PSIS estimate is finite but may be large
  • “If the estimated tail shape parameter \(\hat{k}\) exceeds 0.5, the user should be warned, although in practice we have observed good performance for values of\(\hat{k}\) up to 0.7.”
  • if the PSIS estimate has a finite variance, when \(\hat{k} > 0.7\) the user should consider:
    1. sampling directly from \(p(\theta^s|y_{−i})\) for the problematic \(i\),
    2. use k-fold CV, or
    3. use a more robust model

Model assesment, selection and inference after selection

(link)

Cross-validation FAQ

(link)

8.2.7 Chapter 7. Lecture notes

Lecture 8.2. Cross-Validation (part 1)

  • predictive performance
    • true predictive performance can be found by making predictions on new data and comparing to true observation
      • external validation
    • expected predictive performance as an approximation
  • calculating the posterior predictive density for a data point:
    • generate a posterior predictive distribution for the data point
    • find the density of the distribution at the actual observed value
  • good walk-through of calculating posterior predictive distributions and LOO analyses
  • some discussion of making predictions in parts of \(x\) not in the data set
    • e.g. predictions in the future of time series models

References

Navarro, Danielle J. 2019. “Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection.” Computational Brain & Behavior 2 (1): 28–34. https://doi.org/10.1007/s42113-018-0019-z.
Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Stat. Comput. 27 (3): 711–35. https://doi.org/10.1007/s11222-016-9649-y.
Sivula, Tuomas, Måns Magnusson, and Aki Vehtari. 2020. “Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison,” August. https://arxiv.org/abs/2008.10296.
Vehtari, Aki, Jonah Gabry, Mans Magnusson, Yuling Yao, Paul-Christian Bürkner, Topi Paananen, and Andrew Gelman. 2020. Loo: Efficient Leave-One-Out Cross-Validation and WAIC for Bayesian Models. https://CRAN.R-project.org/package=loo.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017a. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27: 1413–32. https://doi.org/10.1007/s11222-016-9696-4.
———. 2017b. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Stat. Comput. 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.
Vehtari, Aki, and Janne Ojanen. 2012. “A Survey of Bayesian Predictive Methods for Model Assessment, Selection and Comparison.” Ssu 6 (none): 142–228. https://doi.org/10.1214/12-SS102.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2017. “Using Stacking to Average Bayesian Predictive Distributions.” Bayesian Analysis. https://doi.org/10.1214/17-BA1091.