2020-05-18
See also Probability; How to estimate statistical parameters
is a good example to start with.* Assume the presence or absence of a specific word is a good indicator of whether an email is spam or not. Given a new instance, the thing we want to know is \(\Pr(spam \mid word)\), which is just \[\Pr(spam \mid word) = \frac{\Pr(word \mid spam) \Pr(spam)}{\Pr(word)}.\] We can calculate everything on the right-hand side using some labeled data.
Generalizing, we figure that many words are good indicators of spam. We’ll imagine a bag-of-words feature vector: high dimensional, each entry is binary indicating the presence or absence of a particular word in an email. The “naive” part of the model is the assumption that words do not interact at all. So we assume there is a generative model that looks like (remember \(x_j\) is either \(0\) or \(1\))
\[p(\vec{x} \mid c) = \prod_j \theta_{jc}^{x_j} (1-\theta_{jc}^{(1-x_j)})\]
In other words the presence of any single word is just a single Bernoulli trial that depends on the class \(c\) and a parameter to be fitted from our data \(\theta_{jc}\) — we don’t count word occurrences, and we assume words don’t interact at all.
Following sort of roughly Kevin Patrick Murphy,† the basic idea is that there is some full joint probability distribution for the features of an item \(\vec{x}\), the item’s class \(y\), and some parameters \(\vec{\theta}\):
\[p(\vec{x}, y, \vec{\theta})\]
So we assume this exists, and although we could marginalize or condition this distribution, instead we’ll tackle it a different way: in the “generative model” approach, we want to write down a model that generates the data we observe, conditioned on particular values for the model parameters. In the case of a classification problem we want to write down a model that would generate a data vector \(\vec{x}\), assuming \(\vec{x}\) came from a particular class and we had selected particular model parameters \(\vec{\theta}\). In other words we want to create a model for the “class conditional distribution” \[p(\vec{x} \mid y, \vec{\theta}).\]
The “naive” assumption is that all the components of \(\vec{x}\) are conditionally independent given the class label, meaning we can write the class-conditional density as
\[p(\vec{x} \mid y=c, \vec{\theta}) = \prod_{i=1}^D p(x_i \mid y = c, \theta_{ic})\]
In other words, if there are actually \(K\) different classes, then the equation above actually corresponds to \(K\) “different” probability models; and each model is the product of \(D\) independent distributions. Each of the \(K \cdot D\) distributions have independent parameters \(\theta_{ic}\), which itself might actually be a vector of parameters.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman‡ present naive Bayes in the context of kernel smoothing and density estimation. I think this is one angle for explaining why it works well: the point is that you can use Bayes’ rule with separately estimated densities for each class to come up w a decision procedure, but this starts to break down in high dimensions or with not enough data. But if all you care about is the decision and not the density estimate, you can modify this to naive Bayes which does a good job at figuring out the decision boundary.