18 Section 20. Notes on ‘Ch 22. Finite mixture models’

2022-01-18

These are just notes on a single chapter of BDA3 that were not part of the course.

18.1 Chapter 22. Finite mixture models

“when measurements of a random variable are taken under two different conditions” (pg. 519)
where the data contains multiple subpopulations where each has a different, relatively simple model
basic mixture modeling principle is to introduce unobserved indicators \(z\) to specify the mixture component for an observation
- can think of a mixture indicator as missing data

22.1 Setting up and interpreting mixture models

Finite mixtures

want to model the distribution of \(y = (y_1, \dots, y_n)\) or \(y|x\) as a mixture of \(H\) components
- for each component \(h \in (1, \dots, H)\), the distribution \(f_h(y_i | \theta_h)\) depends on a parameter vector \(\theta_h\)
- \(\lambda_h\) denotes the proportion of the population in component \(h\)
  - \(\sum_{h=1}^{H} \lambda_h = 1\)
- common to assume all mixture components have the same parametric form
- thus, the sampling distribution of \(y\) is:

\[ p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots + \lambda_H f(y_i | \theta_H) \]

can think of the mixture distribution probabilities \(\lambda\) as priors over the parameters \(\theta_h\)
- or as a description of the variation in \(\theta\) in a population
- akin to a hierarchical model
introduce the indicator variables \(z_{ih}\) where \(z_{ih} = 1\) if the \(i\)th data point is drawn from component \(h\) and 0 otherwise
- the \(lambda\) values are used to determine \(z\)
  - can think of \(lambda\) as a hyperprior over \(z\)
- joint distribution of the observed data \(y\) and the unobserved indicators \(z\) conditions on the model parameters:
  - only one \(z_{ih}\) can be 1 for each \(i\)

\[ \begin{aligned} p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\ &= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i | \theta_h))^{z_{i,h}} \end{aligned} \]

Continuous mixtures

generalize the finite mixture to allow probability of an observation belongs to a class
hierarchical models are a form a continuous mixture model
- each observed value \(y_i\) is modeled as coming from a mixture of models defined by the probability of values for \(\theta\)
in the book, the focus is on finite mixtures and “minor modifications” are generally required to form a continuous distribution

Identifiabilitiy of the mxixture likelihood

all finite mixture models are nonidentifiable: the distribution is unchanged if the group labels are permuted
in many cases, purposeful, informative priors can solve the issue

Priors distributions

the priors for a finite mixture model’s parameters \(\theta\) and \(\lambda\) are usually the product of the two independent priors on each variable
- because the vector of mixture indicators \(z_i = (z_{i,1}, \dots, z_{i,H})\) is multinomial with parameter \(\lambda\), a common prior for \(\lambda\) is the Dirichlet
  - \(\lambda \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_H)\)
- \(\theta = (\theta_1, \dots, \theta_H)\) is the vector of parameters for each component’s sub-model
  - some can be shared across components (i.e. equal variance for a group of normal distributions)

Number of mixture components

can model \(H\) as unknown but is computationally expensive
usually can just build models with different \(H\) and compare their goodness of fit
- compare the posterior predictive distributions with a “suitably chosen” test quantity

Mixtures as true models or approximating distributions

two classes of thought¹:
1. theoretical: a mixture model is “a realistic characterization of the true data-generating mechanism” (pg. 522)
2. pragmatic: “trying to infer latent subpopulations is an intrinsically ill-defined statistical problem, but finite mixture models are nonetheless useful” (pg. 523)

22.4 Unspecifed number of mixture components

can assign a Poisson distribution as a on \(H\) (the number of groups/components in the mixture model)
- computationally intensive
- more common to just fit the model with different \(H\) and compare with some statistic and a penalty for model complexity
  - WAIC is theoretically justified, but ignores the uncertainty over \(H\)
  - LOO-CV may be even better