2 Section 2. Basics of Bayesian inferences

2021-08-21

2.2 Notes

2.2.1 Chapter instructions

  • recommendations about weakly informative priors has changed a bit
    • updated recommendations: Prior Choice Recommendations
    • “5 levels of priors”:
      1. Flat prior (not usually recommended)
      2. Super-vague but proper prior: \(N(0, 10^6)\) (not usually recommended)
      3. Weakly informative prior: very weak; \(N(0, 10)\)
      4. Generic weakly informative prior: \(N(0, 1)\)
      5. Specific informative prior: \(N(0.4, 0.2)\) or whatever; can sometimes be expressed as a scaling followed by a generic prior: \(\theta = 0.4 + 0.2z; \text{ } z \sim N(0, 1)\)
    • “flat and super-vague priors are not usually recommended”
    • even a seemingly weakly informative prior could be informative
      • e.g. a prior of \(N(0, 1)\) could put weight on values too large if a large effect size would only be on the scale of 0.1
    • def. weakly informative: “if there’s a reasonably large amount of data, the likelihood will dominate, and the prior will not be important”
    • section on General Principles; some stand-outs copied here:
      • “Computational goal in Stan: reducing instability which can typically arise from bad geometry in the posterior”
      • “Weakly informative prior should contain enough information to regularize: the idea is that the prior rules out unreasonable parameter values but is not so strong as to rule out values that might make sense”
      • “When using informative priors, be explicit about every choice; write a sentence about each parameter in the model.”

2.2.2 Chapter 2. Single-parameter models

I took notes in the book, so below are just some main points.

2.2 Posterior as compromise between data and prior information

  • “general feature of Bayesian inference: the posterior distribution is centered at a point that represents a compromise between the prior information and the data”

2.3 Estimating a probability from binomial data

  • a key benefit of Bayesian modeling is the flexibility of summarizing posterior probabilities
    • can be used to answer the key research questions
  • commonly used summary statistics
    • centrality: mean, median, mode
    • variation: standard deviation, interquartile range, highest posterior density

2.4 Informative prior distributions

  • hyperparameter: parameter of a prior distribution
  • conjugacy: “the property that the posterior distribution follows the same parameter form as the prior distribution”
    • e.g. the beta prior is a conjugate family for the binomial likelihood
    • e.g. the gamma prior is a conjugate family for the Poisson likelihood
    • convenient because the posterior follows a known parametric family
    • formal definition of conjugacy:

\[ p(\theta | y) \in \mathcal{P} \text{ for all } p(\cdot | \theta) \in \mathcal{F} \text{ and } p(\cdot) \in \mathcal{P} \]

2.5 Normal distribution with known variance

  • precision (when discussing normal distributions): the inverse of the variance \(\frac{1}{\tau^2}\)

2.6 Other standard single-parameter models

  • Poisson model for count data
    • data \(y\) is the number of positive events
    • unknown rate of the events \(\theta\)
    • conjugate prior is the gamma distribution section 2.7 is a good example of a hierarchical Poisson model

2.8 Noninformative prior distributions

See more information in the notes from the chapter instructions.

  • rationale: let the data speak for themselves; inferences are unaffected by external information/bias
  • problems:
    • can cause the posterior to become improper
    • computationally, makes it harder to sample from the posterior

2.9 Weakly informative prior distributions

See more information in the notes from the chapter instructions.

  • weakly informative: the prior is proper, but intentionally weaker than whatever actual prior knowledge is available
  • “in general, any problem has some natural constraints that would allow a weakly informative model”
  • small amount of real-world knowledge to ensure the posterior makes sense

2.2.3 Lecture notes

2.1 Basics of Bayesian inference, observation model, likelihood, posterior and binomial model, predictive distribution and benefit of integration

  • predictive distribution
    • “integrate over uncertainties”
    • for the example of pulling red or yellow chips out of a bag:
      • want a predictive distribution for some data point \(\tilde{y} = 1\):
      • if we know \(\theta\) then it is easy: \(p(\tilde{y} = 1 | \theta, y, n, M)\)
        • where \(n\) is number of draws, \(y\) is number of success (red chip), \(M\) is model
      • we don’t know \(\theta\), we weight the probability of the new data for a given \(\theta\) by the posterior probability that \(\theta\) is that value
        • sum (integrate) over all possible values for \(\theta\) (“integrate out the uncertainty of \(\theta\)”)
        • \(p(\tilde{y}=1|y, n, M) = \int_0^1 p(\tilde{y} = 1 | \theta, y, n, M) p(\theta | y, n, M) d\theta\)
      • now the prediction is not conditioned on \(\theta\), just on what was observed
  • prior predictive: predictions before seeing any data
    • \(p(\tilde{y}=1|M) = \int_o^1 p(\tilde{y}=1 | \theta, y, n, M) p(\theta|M)\)

2.2 Priors and prior information, and one parameter normal model

  • proper prior: \(\int p(\theta) = 1\)
    • better to use proper priors
  • improper prior density does not have a finite integral
    • the posterior can sometimes still be proper, though
    • uniform distributions to infinity are improper
  • a weak prior is not non-informative
    • could give a lot of a weight to very unlikely (or impossible) values
    • make sure to check prior values against knowable values
  • sufficient statistic: the quantity \(t(y)\) is a sufficient statistic for \(\theta\) because the likelihood for \(\theta\) depends on the data \(y\) only through the value of \(t(y)\)
    • smaller dimensional data that fully summarizes the full data
    • can define a Gaussian with just the mean and s.d.

Extras: likelihood, normalization term, density, and conditioning on model M

Predictive distribution and benefit of integration

  • predictive dist.
    • effect of integration
      • predictive dist of new \(\hat{y}\) (discrete) with model \(M\):
        • if we know \(\theta\): \(p(\hat{y} = 1| y, n, M) = p(\hat{y} = 1 | \theta, y, n, M)\)
        • if we don’t know \(\theta\): \(p(\hat{y} = 1 | \theta, y, n, M) p(\theta| y, n, M)\)
          • weight by the probability of the value for \(\theta\)
      • integrate over all possible values of \(\theta\): \(p(\hat{y} = 1|y, n, M) = \int_0^1 p(\hat{y}=1| \theta, y, n, M) p(\theta|y, n, M)d\theta\)
        • “integrate out the uncertainty over \(\theta\)
  • prior predictive for new data \(\hat{y}\): \(p(\hat{y} =1|M) = int_0^1 p(\hat{y}=1|\theta,y,n,M)p(\theta|M)\)

Priors and prior information

  • conjugate priors do not result in any computational benefits in HMC or NUTS
    • can be useful to analytically reduce the size of a model, beforehand, though