Post

ML Foundations: The Naive Bayes Classifier

ML Foundations: The Naive Bayes Classifier

Introduction

In our journey through machine learning algorithms, we’ve seen models that find lines (Linear Regression), create S-curves (Logistic Regression), and ask questions (Decision Trees). Now we turn to a model that thinks in terms of probabilities: the Naive Bayes classifier.

Naive Bayes is a simple but surprisingly powerful probabilistic classifier based on Bayes’ Theorem. It’s particularly famous for its use in text classification, such as spam filtering and document categorization. Its elegance lies in its simplicity and efficiency, especially on datasets with a large number of features.

In this guide, we’ll demystify Naive Bayes by exploring:

  1. The core concept of Bayes’ Theorem.
  2. The “naive” assumption that makes the algorithm so fast.
  3. An intuitive example of how it’s used for spam filtering.
  4. The different types of Naive Bayes classifiers.
  5. A practical implementation with Scikit-Learn.

1. The Core Idea: Bayes’ Theorem

At its heart, Naive Bayes is an application of Bayes’ Theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event.

The formula is:

P(A|B) = [P(B|A) * P(A)] / P(B)

Let’s translate this into a classification context:

P(Class | Features) = [P(Features | Class) * P(Class)] / P(Features)

  • P(Class | Features): This is the posterior probability, the probability of a data point belonging to a certain class given its features. This is what we want to calculate. For example, “What is the probability this email is spam, given that it contains the word ‘viagra’?”
  • P(Features | Class): This is the likelihood. It’s the probability of observing these features given that the data point belongs to a certain class. For example, “How often does the word ‘viagra’ appear in emails that are known to be spam?”
  • P(Class): This is the prior probability of the class. It’s the overall probability of this class in the dataset. For example, “What percentage of all emails are spam?”
  • P(Features): This is the probability of observing the features. It acts as a normalizing constant. In practice, we can often ignore this denominator because we only care about which class has the highest probability.

The algorithm calculates the posterior probability for every class and then picks the class with the highest probability.

2. The “Naive” Assumption: A Clever Simplification

Calculating P(Features | Class) directly is computationally very difficult, especially with many features. The “features” part is actually a vector of many features (e.g., thousands of words in an email). Calculating the joint probability of all these features would require a massive amount of data.

This is where the “naive” assumption comes in. The algorithm naively assumes that all features are conditionally independent of each other, given the class.

In the context of spam filtering, this means the algorithm assumes that the presence of the word “viagra” in an email has no effect on the presence of the word “sale,” given that the email is spam.

This assumption is almost always wrong in the real world (the words “viagra” and “sale” are very likely to appear together in spam emails). However, this simplification is what makes Naive Bayes so efficient, and it works remarkably well in practice, especially for text classification.

With this assumption, the likelihood P(Features | Class) becomes the simple product of the individual probabilities of each feature:

P(Features | Class) ≈ P(feature₁ | Class) * P(feature₂ | Class) * ...

This is much, much easier to calculate!

3. Intuitive Example: Is This Email Spam?

Let’s say we want to classify a new email with the subject line “Big sale today”.

Our Goal: Is P(Spam | "Big sale today") greater than P(Not Spam | "Big sale today")?

1. Calculate Priors: First, we look at our training data.

  • Let’s say 25% of our emails are spam and 75% are not.
  • P(Spam) = 0.25
  • P(Not Spam) = 0.75

2. Calculate Likelihoods: We look at the frequency of each word in spam and not-spam emails from our training data.

  • P("Big" | Spam) = 0.1 (10% of spam emails contain “Big”)
  • P("sale" | Spam) = 0.2 (20% of spam emails contain “sale”)
  • P("today" | Spam) = 0.05 (5% of spam emails contain “today”)

  • P("Big" | Not Spam) = 0.01 (1% of non-spam emails contain “Big”)
  • P("sale" | Not Spam) = 0.01 (1% of non-spam emails contain “sale”)
  • P("today" | Not Spam) = 0.1 (10% of non-spam emails contain “today”)

3. Calculate the Posterior “Scores”:

  • Spam Score: P(Spam) * P("Big"|Spam) * P("sale"|Spam) * P("today"|Spam) = 0.25 * 0.1 * 0.2 * 0.05 = 0.00025

  • Not Spam Score: P(Not Spam) * P("Big"|Not Spam) * P("sale"|Not Spam) * P("today"|Not Spam) = 0.75 * 0.01 * 0.01 * 0.1 = 0.0000075

4. Compare and Classify:

  • The Spam Score (0.00025) is much higher than the Not Spam Score (0.0000075).
  • The algorithm classifies the email as Spam.

Note: A problem arises if a word in the new email never appeared in the training data for a certain class (e.g., P("winner" | Not Spam) = 0). This would make the entire score zero. To solve this, we use a smoothing technique called Laplace (or Additive) Smoothing, which adds a small value to the count of each word.

4. Types of Naive Bayes

There are three main types of Naive Bayes classifiers, used for different kinds of data:

  1. Gaussian Naive Bayes: Used for features that are continuous and follow a Gaussian (normal) distribution. It models each feature’s distribution by calculating its mean and standard deviation for each class.
  2. Multinomial Naive Bayes: Used for discrete counts. It’s the most common type for text classification, where the features are the frequency of each word in a document (a “bag-of-words” model).
  3. Bernoulli Naive Bayes: Used for binary features (e.g., a word either appears in a document or it doesn’t).

5. Practical Example with Scikit-Learn

Let’s build a simple text classifier using Scikit-Learn’s MultinomialNB. We’ll use a CountVectorizer to convert our text documents into a matrix of token counts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

# Sample data: simple reviews and their sentiment (1=positive, 0=negative)
reviews = [
    "this was an amazing movie",
    "I really liked this product",
    "what a great experience",
    "this is my favorite restaurant",
    "I did not like this at all",
    "this was a terrible waste of time",
    "I am very disappointed",
    "the product was broken and bad"
]
labels = [1, 1, 1, 1, 0, 0, 0, 0]

# 1. Split data
X_train, X_test, y_train, y_test = train_test_split(reviews, labels, test_size=0.25, random_state=42)

# 2. Vectorize the text data
# This converts the text into a matrix of word counts
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# 3. Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train_counts, y_train)

# 4. Make predictions
y_pred = model.predict(X_test_counts)

# 5. Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Let's test it with a new sentence
new_sentence = ["I loved the experience, it was great"]
new_sentence_counts = vectorizer.transform(new_sentence)
prediction = model.predict(new_sentence_counts)
print(f"Prediction for new sentence: {'Positive' if prediction[0] == 1 else 'Negative'}")

Conclusion

Naive Bayes is a testament to the power of simplicity. Despite its “naive” assumption of feature independence—which is rarely true—it remains a highly effective and efficient algorithm, especially for text-based problems.

  • It’s built on the foundation of Bayes’ Theorem, allowing it to calculate the probability of a class given a set of features.
  • Its “naive” assumption makes it computationally fast and requires less training data than more complex models.
  • It’s a fantastic baseline model for any text classification task. If a more complex model like a transformer can’t beat Naive Bayes, you might be over-engineering your solution!

Suggested Reading

  • “Data Science from Scratch” by Joel Grus: Provides a great first-principles implementation of Naive Bayes.
  • Scikit-Learn Documentation: The user guide on Naive Bayes is a practical resource for implementation details.
  • “Speech and Language Processing” by Dan Jurafsky and James H. Martin: For a deep, academic dive into Naive Bayes for natural language processing.
This post is licensed under CC BY 4.0 by the author.