Naive Bayes classifiers

is a good example to start with.^* Assume the presence or absence of a specific word is a good indicator of whether an email is spam or not. Given a new instance, the thing we want to know is \(\Pr(spam \mid word)\), which is just \[\Pr(spam \mid word) = \frac{\Pr(word \mid spam) \Pr(spam)}{\Pr(word)}.\] We can calculate everything on the right-hand side using some labeled data.

Generalizing, we figure that many words are good indicators of spam. We’ll imagine a bag-of-words feature vector: high dimensional, each entry is binary indicating the presence or absence of a particular word in an email. The “naive” part of the model is the assumption that words do not interact at all. So we assume there is a generative model that looks like (remember \(x_j\) is either \(0\) or \(1\))

In other words the presence of any single word is just a single Bernoulli trial that depends on the class \(c\) and a parameter to be fitted from our data \(\theta_{jc}\) — we don’t count word occurrences, and we assume words don’t interact at all.

General model

Following sort of roughly Kevin Patrick Murphy,^† the basic idea is that there is some full joint probability distribution for the features of an item \(\vec{x}\), the item’s class \(y\), and some parameters \(\vec{\theta}\):

So we assume this exists, and although we could marginalize or condition this distribution, instead we’ll tackle it a different way: in the “generative model” approach, we want to write down a model that generates the data we observe, conditioned on particular values for the model parameters. In the case of a classification problem we want to write down a model that would generate a data vector \(\vec{x}\), assuming \(\vec{x}\) came from a particular class and we had selected particular model parameters \(\vec{\theta}\). In other words we want to create a model for the “class conditional distribution” \[p(\vec{x} \mid y, \vec{\theta}).\]

The “naive” assumption is that all the components of \(\vec{x}\) are conditionally independent given the class label, meaning we can write the class-conditional density as

\[p(\vec{x} \mid y=c, \vec{\theta}) = \prod_{i=1}^D p(x_i \mid y = c, \theta_{ic})\]

In other words, if there are actually \(K\) different classes, then the equation above actually corresponds to \(K\) “different” probability models; and each model is the product of \(D\) independent distributions. Each of the \(K \cdot D\) distributions have independent parameters \(\theta_{ic}\), which itself might actually be a vector of parameters.

As density estimation

Trevor Hastie, Robert Tibshirani, and Jerome Friedman^‡ present naive Bayes in the context of kernel smoothing and density estimation. I think this is one angle for explaining why it works well: the point is that you can use Bayes’ rule with separately estimated densities for each class to come up w a decision procedure, but this starts to break down in high dimensions or with not enough data. But if all you care about is the decision and not the density estimate, you can modify this to naive Bayes which does a good job at figuring out the decision boundary.

Other scribbling

Hand, David J., and Keming Yu. “Idiot’s Bayes—Not so Stupid After All?” International Statistical Review 69, no. 3 (2001): 385–98. https://doi.org/10.2307/1403452.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. Springer Series in Statistics. New York, NY, USA: Springer, 2009. https://doi.org/10.1007/b94608.

Lewis, David D. “Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval.” In Machine Learning: ECML-98, edited by Claire Nédellec and Céline Rouveirol, 4–15. Lecture Notes in Computer Science 1398. Berlin, Heidelberg: Springer, 1998. https://doi.org/10.1007/BFb0026666.

Metsis, Vangelis, Ion Androutsopoulos, and Georgios Paliouras. “Spam Filtering with Naive Bayes - Which Naive Bayes?” In CEAS 2006 - the Third Conference on Email and Anti-Spam, 2006. http://www2.aueb.gr/users/ion/docs/ceas2006_paper.pdf.

Murphy, Kevin Patrick. Machine Learning: A Probabilistic Perspective. Cambridge, Massachusetts: The MIT Press, 2012. https://www.cs.ubc.ca/~murphyk/MLbook/.

O’Neil, Cathy, and Rachel Shutt. Doing Data Science: Straight Talk from the Front Line. Sebastopol, California, USA: O’Reilly Media, Inc., 2014.

Rennie, Jason D. M., Lawrence Shih, Jaime Teevan, and David R. Karger. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers.” In Proceedings of the Twentieth International Conference on Machine Learning, edited by Tom Fawcett and Nina Mishra, 616–23. Menlo Park, California, United States of America: The AAAI Press, 2003. https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.