How to estimate statistical parameters

or, Another Confusing Example of the Differences Between Bayesian Statistics and What You Were Taught in School

2019-06-14

For, probably entirely, my own edification, this short note is one way to look at a difference between frequentist and Bayesian statistics. I don’t have any particular bias or stake in the various approaches, and my own motivation for writing this is to help myself understand some widely used terminology. So the description here is probably an unfair generalization for someone familiar with the field.

The setting, dramatically simplified: you have taken a single measurement of something from the real world, call it \(x\). You’ve made some assumptions about the unknown, real-world process that generated \(x\), which include that

\(x\) is an independent, identically distributed sample from a random variable \(X\)
the random variable is distributed according to some family of probability distributions that depend on a parameter \(\theta\): \[X \sim F_\theta\]
there is a probability density function associated with \(F_\theta\), written as \(f_\theta(x)\)

So: we have a piece of observed data \(x\), and we have a model that says that \(x\) was generated from a pdf \(f_\theta(x)\). But there are an infinite number of such pdfs, parametrized by \(\theta\).

What is the most reasonable value of \(\theta\) based on the evidence?

The frequentist tradition uses an argument called maximum likelihood estimation.

We imagine that \(\theta\) is some fixed but unknown value. In practice, it might be impossible for any experiment carried out by humans to definitely, accurately measure \(\theta\). But the Devil is omniscient, and They know \(\theta\) perfectly — it is a real thing in the universe.

For any particular value of \(\theta\), I could look at the probability density value \(f_\theta(x)\). If my choice of \(\theta\) is “compatible” with \(x\), then the density should be high; if my choice is incompatible then the density will be relatively lower. I should choose the value of \(\theta\) that results in the largest value of \(f_\theta(x)\). Said another way, I should choose the value of \(\theta\) that maximizes the probability that I would observe \(x\), if I could conduct a future measurement.

We could generalize this a little bit by defining a likelihood function: \[\mathcal{L}(\theta \mid x) = f_\theta(x)\]

This standard notation for the likelihood function might be confusing for a couple of reasons.

Although the notation is similar to a conditional probability, it is not a conditional distribution. The function \(\mathcal{L}\) is a function of \(\theta\), parametrized on \(x\).
In fact, \(\mathcal{L}\) is not (neccessarily) a probability distribution at all. It’s just a function that takes on larger values for choices of \(\theta\) that are more compatible with \(x\). (As a degenerate example, \(f_\theta(x)\) might not actually depend on theta. In this case, \(\mathcal{L}(\theta \mid x)\) is constant, and if \(\theta\) can vary across the whole real line it’s not even possible to normalize \(\mathcal{L}\) to produce a probability distribution.)
But, remember that the likelihood does measure a probability density: the probability density that we would observe \(x\) in future experiments.
So there is a two-sided view of the likelihood: it, itself, is not a probability density or distribution, but the value of the likelihood does measure a closely related probability density.

To find the most reasonable value of \(\theta\), we just need to maximize \(\mathcal{L}\), using whatever methods are available and tractable. For some probability models \(F\), we can get a closed-form solution for the maximum likelihood estimate, but most often we’ll need to apply some kind of numerical procedure to try to approximate it.

Maximum likelihood estimation is plausible, widely used, and works well for many problems. However, if we were to try and justify maximum likelihood estimation from an axiomatic foundation, we would probably end up with a different kind of argument, called maximum a posteriori estimation. In a Bayesian setting, we assume there is a full joint distribution between the data and the model parameters: \[p(x, \theta)\] Here \(p\) is a probability distribution, and we model the data-generating process by making an assumption about \(p(x \mid \theta)\), the conditional distribution of the data given the parameter. To find the most plausible value of \(\theta\), we’ll use Bayes’ rule: \[p(\theta \mid x) = \frac{p(x \mid \theta) p(\theta)}{p(x)}\] The term on the left-hand side, called the posterior, is the conditional distribution of the unknown parameter given the observed data; we think of this as being a function of \(\theta\) parametrized by the fixed data \(x\). The posterior is actually a probability distribution (density function), and it measures the probability of the various values that \(\theta\) might have. In a real application we usually imagine that \(\theta\) actually does have some definite fixed value, but the Bayesian approach recognizes that we can’t ever know this value with certainty, and uses the probability of \(\theta\) to model our subjective amount of confidence that \(\theta\) might have some value.

A nice feature of the Bayesian approach is that the posterior is a full distribution, so it summarizes much more information that just a point estimate of \(\theta\). But, if we want to find the (single, point estimate) most reasonable value of \(\theta\), then we should just find the maximum (the mode) of the posterior.

It’s easy (for me) to get mixed up about all these quantities and how frequentist versus Bayesian methods actually differ. It helps me to keep some things in mind:

The standard name for the term \(p(x \mid \theta)\) in Bayes’ rule is also “likelihood,” although it differs in an important way from the “likelihood” used in MLE. The Bayesian likelihood is a proper conditional distribution (because we start by assuming there is a full joint distribution between \(x\) and \(\theta\)).
However: In MAP, we’re trying to maximize the right-hand side of Bayes’ rule. The observation \(x\) is fixed, so the denominator won’t affect the solution and we can ignore it. Also, we need to make some assumption about what \(p(\theta)\) looks like, and one very common assumption is that it’s “uninformative,” which means that we assume it’s just constant, which means we can ignore it too. Now we have \(p(\theta \mid x) \propto p(x \mid \theta)\). Going one more step, in the Bayesian setting, we assume a full joint distribution \(p(x, \theta)\) but we don’t specify it; instead we specify \(p(x \mid \theta)\) and \(p(\theta)\). We just specified that \(p(\theta) = c\) for some constant \(c\) (hand-waving over the issues around whether this is a proper distribution). We could easily specify \(p(x \mid \theta) = f_\theta(x)\). (That is, we could just choose the same distribution we used in MLE as the functional form of \(p(x \mid \theta)\) for our Bayesian model.) So long as there could be some distribution \(p(x, \theta)\) such that the marginal distribution for \(\theta\) is constant and the conditional distribution of \(x\) given \(\theta\) is \(f_\theta\) then everything is consistent. At this point, using our model specifications for the Bayesian likelihood and prior, Bayes’ rule says \[p(\theta \mid x) \propto f_\theta(x)\] The takeaway: maximum likelihood estimation is a special case of maximum a posteriori estimation, under some common assumptions.