or, Another Confusing Example of the Differences Between Bayesian Statistics and What You Were Taught in School
2019-06-14
For, probably entirely, my own edification, this short note is one way to look at a difference between frequentist and Bayesian statistics. I don’t have any particular bias or stake in the various approaches, and my own motivation for writing this is to help myself understand some widely used terminology. So the description here is probably an unfair generalization for someone familiar with the field.
The setting, dramatically simplified: you have taken a single measurement of something from the real world, call it \(x\). You’ve made some assumptions about the unknown, real-world process that generated \(x\), which include that
So: we have a piece of observed data \(x\), and we have a model that says that \(x\) was generated from a pdf \(f_\theta(x)\). But there are an infinite number of such pdfs, parametrized by \(\theta\).
What is the most reasonable value of \(\theta\) based on the evidence?
The frequentist tradition uses an argument called maximum likelihood estimation.
We imagine that \(\theta\) is some fixed but unknown value. In practice, it might be impossible for any experiment carried out by humans to definitely, accurately measure \(\theta\). But the Devil is omniscient, and They know \(\theta\) perfectly — it is a real thing in the universe.
For any particular value of \(\theta\), I could look at the probability density value \(f_\theta(x)\). If my choice of \(\theta\) is “compatible” with \(x\), then the density should be high; if my choice is incompatible then the density will be relatively lower. I should choose the value of \(\theta\) that results in the largest value of \(f_\theta(x)\). Said another way, I should choose the value of \(\theta\) that maximizes the probability that I would observe \(x\), if I could conduct a future measurement.
We could generalize this a little bit by defining a likelihood function: \[\mathcal{L}(\theta \mid x) = f_\theta(x)\]
This standard notation for the likelihood function might be confusing for a couple of reasons.
To find the most reasonable value of \(\theta\), we just need to maximize \(\mathcal{L}\), using whatever methods are available and tractable. For some probability models \(F\), we can get a closed-form solution for the maximum likelihood estimate, but most often we’ll need to apply some kind of numerical procedure to try to approximate it.
Maximum likelihood estimation is plausible, widely used, and works well for many problems. However, if we were to try and justify maximum likelihood estimation from an axiomatic foundation, we would probably end up with a different kind of argument, called maximum a posteriori estimation. In a Bayesian setting, we assume there is a full joint distribution between the data and the model parameters: \[p(x, \theta)\] Here \(p\) is a probability distribution, and we model the data-generating process by making an assumption about \(p(x \mid \theta)\), the conditional distribution of the data given the parameter. To find the most plausible value of \(\theta\), we’ll use Bayes’ rule: \[p(\theta \mid x) = \frac{p(x \mid \theta) p(\theta)}{p(x)}\] The term on the left-hand side, called the posterior, is the conditional distribution of the unknown parameter given the observed data; we think of this as being a function of \(\theta\) parametrized by the fixed data \(x\). The posterior is actually a probability distribution (density function), and it measures the probability of the various values that \(\theta\) might have. In a real application we usually imagine that \(\theta\) actually does have some definite fixed value, but the Bayesian approach recognizes that we can’t ever know this value with certainty, and uses the probability of \(\theta\) to model our subjective amount of confidence that \(\theta\) might have some value.
A nice feature of the Bayesian approach is that the posterior is a full distribution, so it summarizes much more information that just a point estimate of \(\theta\). But, if we want to find the (single, point estimate) most reasonable value of \(\theta\), then we should just find the maximum (the mode) of the posterior.
It’s easy (for me) to get mixed up about all these quantities and how frequentist versus Bayesian methods actually differ. It helps me to keep some things in mind: