18 Section 20. Notes on ‘Ch 22. Finite mixture models’
2022-01-18
These are just notes on a single chapter of BDA3 that were not part of the course.
18.1 Chapter 22. Finite mixture models
- “when measurements of a random variable are taken under two different conditions” (pg. 519)
- where the data contains multiple subpopulations where each has a different, relatively simple model
- basic mixture modeling principle is to introduce unobserved indicators \(z\) to specify the mixture component for an observation
- can think of a mixture indicator as missing data
22.1 Setting up and interpreting mixture models
Finite mixtures
- want to model the distribution of \(y = (y_1, \dots, y_n)\) or \(y|x\) as a mixture of \(H\) components
- for each component \(h \in (1, \dots, H)\), the distribution \(f_h(y_i | \theta_h)\) depends on a parameter vector \(\theta_h\)
- \(\lambda_h\) denotes the proportion of the population in component \(h\)
- \(\sum_{h=1}^{H} \lambda_h = 1\)
- common to assume all mixture components have the same parametric form
- thus, the sampling distribution of \(y\) is:
\[ p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots + \lambda_H f(y_i | \theta_H) \]
- can think of the mixture distribution probabilities \(\lambda\) as priors over the parameters \(\theta_h\)
- or as a description of the variation in \(\theta\) in a population
- akin to a hierarchical model
- introduce the indicator variables \(z_{ih}\) where \(z_{ih} = 1\) if the \(i\)th data point is drawn from component \(h\) and 0 otherwise
- the \(lambda\) values are used to determine \(z\)
- can think of \(lambda\) as a hyperprior over \(z\)
- joint distribution of the observed data \(y\) and the unobserved indicators \(z\) conditions on the model parameters:
- only one \(z_{ih}\) can be 1 for each \(i\)
- the \(lambda\) values are used to determine \(z\)
\[ \begin{aligned} p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\ &= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i | \theta_h))^{z_{i,h}} \end{aligned} \]
Continuous mixtures
- generalize the finite mixture to allow probability of an observation belongs to a class
- hierarchical models are a form a continuous mixture model
- each observed value \(y_i\) is modeled as coming from a mixture of models defined by the probability of values for \(\theta\)
- in the book, the focus is on finite mixtures and “minor modifications” are generally required to form a continuous distribution
Identifiabilitiy of the mxixture likelihood
- all finite mixture models are nonidentifiable: the distribution is unchanged if the group labels are permuted
- in many cases, purposeful, informative priors can solve the issue
Priors distributions
- the priors for a finite mixture model’s parameters \(\theta\) and \(\lambda\) are usually the product of the two independent priors on each variable
- because the vector of mixture indicators \(z_i = (z_{i,1}, \dots, z_{i,H})\) is multinomial with parameter \(\lambda\), a common prior for \(\lambda\) is the Dirichlet
- \(\lambda \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_H)\)
- \(\theta = (\theta_1, \dots, \theta_H)\) is the vector of parameters for each component’s sub-model
- some can be shared across components (i.e. equal variance for a group of normal distributions)
- because the vector of mixture indicators \(z_i = (z_{i,1}, \dots, z_{i,H})\) is multinomial with parameter \(\lambda\), a common prior for \(\lambda\) is the Dirichlet
Number of mixture components
- can model \(H\) as unknown but is computationally expensive
- usually can just build models with different \(H\) and compare their goodness of fit
- compare the posterior predictive distributions with a “suitably chosen” test quantity
Mixtures as true models or approximating distributions
- two classes of thought1:
- theoretical: a mixture model is “a realistic characterization of the true data-generating mechanism” (pg. 522)
- pragmatic: “trying to infer latent subpopulations is an intrinsically ill-defined statistical problem, but finite mixture models are nonetheless useful” (pg. 523)
22.4 Unspecifed number of mixture components
- can assign a Poisson distribution as a on \(H\) (the number of groups/components in the mixture model)
- computationally intensive
- more common to just fit the model with different \(H\) and compare with some statistic and a penalty for model complexity
- WAIC is theoretically justified, but ignores the uncertainty over \(H\)
- LOO-CV may be even better