Naïve Bayes Algorithm (2024)

Exploring Naive Bayes: Mathematics, How it works, Pros & Cons, and Applications

Bassant Gamal

Published in

Analytics Vidhya

What is Naïve Bayes Algorithm?

Naive Bayes is a classification technique that is based on Bayes’ Theorem with an assumption that all the features that predicts the target value are independent of each other. It calculates the probability of each class and then pick the one with the highest probability. It has been successfully used for many purposes, but it works particularly well with natural language processing (NLP) problems.

Bayes’ Theorem describes the probability of an event, based on a prior knowledge of conditions that might be related to that event.

What makes Naive Bayes a “Naive” algorithm?

Naive Bayes classifier assumes that the features we use to predict the target are independent and do not affect each other. While in real-life data, features depend on each other in determining the target, but this is ignored by the Naive Bayes classifier.

Though the independence assumption is never correct in real-world data, but often works well in practice. so that it is called “Naive”.

Math behind Naive Bays Algorithm

Given a features vector X=(x1,x2,…,xn) and a class variable y, Bayes Theorem states that:

We’re interested in calculating the posterior probability P(y | X) from the likelihood P(X | y) and prior probabilities P(y),P(X).

Using the chain rule, the likelihood P(X ∣ y) can be decomposed as:

but because of the Naive’s conditional independence assumption, the conditional probabilities are independent of each other.

Thus, by conditional independence, we have:

And as denominator remains constant for all values, the posterior probability can then be:

The Naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that’s most probable; this is known as the maximum a posteriori or MAP decision rule.

How Naive Bays really works:

Let’s explain it using an example to make things clear:

Assume we have a bunch of emails that we want to classify as spam or not spam.

Our dataset has 15 Not Spam emails and 10 Spam emails. Some analysis had been done, and the frequency of each word had been recorded as shown below:

Note: Stop Words like “the”, “a”, “on”, “is”, “all” had been removed as they do not carry important meaning and are usually removed from texts. The same thing applies to numbers and punctuations.

exploring some probabilities:

P(Dear|Not Spam) = 8/34
P(Visit|Not Spam) = 2/34
P(Dear|Spam) = 3/47
P(Visit|Spam) = 6/47

and so on.

now assume we have the message “Hello friend” and we want to know whether it is a spam or not.

so, using Bayes’ Theorem

ignoring the denominator

But, P(Hello friend | Not Spam) = 0, as this case (Hello friend) doesn’t exist in our dataset, i.e. we deal with single words, not the whole sentence, and the same for P(Hello friend | Spam) will be zero as well, which in turn will make both probabilities of being a spam and not spam both are zero, which has no meaning!!

But wait!! we said that the Naive Bayes assumes that `the features we use to predict the target are independent`.

so,

now let’s calculate the probability of being spam using the same procedure:

so, the message “Hello friend” is not a spam.

now, let’s try another example:

assume the message “dear visit dinner money money money”. It’s obvious that it’s a spam, but let’s see what Naive Bayes will say.

oops!! Naive Bays says that this message is not a spam?!!!

This happened because the word “dinner” does not appear in the spam dataset, so that P(dinner | spam) = 0, hence all other probabilities will have no effect.

This is called the Zero-Frequency Problem.

and to solve that we can use Laplace smoothing.

Laplace Smoothing is a technique for smoothing categorical data. A small-sample correction, or pseudo-count, will be incorporated in every probability estimate. Hence, no probability will be zero. this is a way of regularizing Naive Bayes.
Given an observation x = (x1, …, xd) from a multinomial distribution with N trials and parameter vector θ = (θ1, …, θd), a “smoothed” version of the data gives the estimator:

where the pseudo-count α > 0 is the smoothing parameter (α = 0 corresponds to no smoothing).

Read more about Additive Smoothing here.

Back to our problem, we are going to pick α = 1, and ‘d’ is the number of the unique words in the dataset which is 10 in our case.

Now it’s correctly classified the message as spam.

Types of Naive Bayes Classifiers

Multinomial: Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. For example, the count how often each word occurs in the document. This is the event model typically used for document classification.
Bernoulli: Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used rather than term frequencies(i.e. frequency of a word in the document).
Gaussian: It is used in classification and it assumes that features follow a normal distribution.

Pros and Cons for Naive Bayes

Pros:

Requires a small amount of training data. So the training takes less time.
Handles continuous and discrete data, and it is not sensitive to irrelevant features.
Very simple, fast, and easy to implement.
Can be used for both binary and multi-class classification problems.
Highly scalable as it scales linearly with the number of predictor features and data points.
When the Naive Bayes conditional independence assumption holds true, it will converge quicker than discriminative models like logistic regression.

Cons:

The assumption of independent predictors/features. Naive Bayes implicitly assumes that all the attributes are mutually independent which is almost impossible to find in real-world data.
If a categorical variable has a value that appears in the test dataset, and not observed in the training dataset, then the model will assign it a zero probability and will not be able to make a prediction. This is what we called the “Zero Frequency problem“, and can be solved using smoothing techniques.

Applications of Naive Bayes Algorithm

Real-time Prediction.
Multi-class Prediction.
Text classification/ Spam Filtering/ Sentiment Analysis.
Recommendation Systems.

FAQs

Is Naive Bayes algorithm accurate? ›

For this simple dataset, the Gaussian Naive Bayes classifier achieves an accuracy score of 0.96 in predicting the flower species.

See Details ›

What are the limitations of naive Bayes algorithm? ›

1 Data Quality

One of the main limitations of naive Bayes is that it is sensitive to the quality of the data. If the data is noisy, incomplete, imbalanced, or contains irrelevant features, the naive Bayes algorithm may produce inaccurate or biased results.

Why is Naive Bayes performing poorly? ›

If the training data has a different frequency distribution compared to the test set, the Naive Bayes classifier performs poorly. The classifier is especially affected by values not being represented in the training set.

See Details ›

Is Naive Bayes easy to understand? ›

An NB model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Above, P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).

Learn More ›

Which algorithm is better than naive Bayes? ›

Naive Bayes is also very simple and can work very fast but it tends to have lower accuracy than linear SVM or perceptron. Luis. If the data is humongous, use SVM.

Show Me More ›

What is the average accuracy of naive Bayes? ›

Naive Bayes algorithms have an average accuracy rate of 97.63%. ML algorithms have an average precision of 84.08%. Naive Bayes algorithms show higher precision values compared to ML algorithms. Naive Bayes algorithms have an average sensitivity rate of 97.79%.

See Details ›

Is Naive Bayes prone to overfitting? ›

Naive Bayes is also a good choice when CPU and memory resources are a limiting factor. Because naive Bayes is very simple, it doesn't tend to overfit data and can be trained very quickly.

View Details ›

Is Naive Bayes a lazy algorithm? ›

Different from the nearest neighbor algorithm, the Naive Bayes algorithm is not a lazy method; A real learning takes place for Naive Bayes.

Explore More ›

What are the criticism of Bayes factors? ›

A common criticism of Bayes factors is their sensitivity to the prior distributions used on the parameters (Kass and Raftery, 1995).

See Details ›

Where does Naive Bayes fail? ›

The Zero-Frequency Problem

One of the disadvantages of Naïve-Bayes is that if you have no occurrences of a class label and a certain attribute value together then the frequency-based probability estimate will be zero. And this will get a zero when all the probabilities are multiplied.

Explore More ›

How to increase Naive Bayes accuracy? ›

Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

Missing Data. Naive Bayes can handle missing data. ...
Use Log Probabilities. ...
Use Other Distributions. ...
Use Probabilities For Feature Selection. ...
Segment The Data. ...
Re-compute Probabilities. ...
Use as a Generative Model. ...
Remove Redundant Features.

More items...

Aug 12, 2019

Learn More ›

What is the Bayes controversy? ›

The controversy arose when Bayes and contemporaries used the expression to solve the “inverse problem” to calculate the probability of cause: They let N represent an unknown state of nature and O represent observations about that state.

Is Naive Bayes good for small dataset? ›

It is an intuitive Classification Algorithm that is based on the principles of Bayes Theorem, named after Reverend Thomas Bayes, a Statistician. Naive Bayes model is easy to build and is particularly useful for dealing with smaller data sets.

Know More ›

What is the Bayes rule for beginners? ›

Bayes' Rule is the most important rule in data science. It is the mathematical rule that describes how to update a belief, given some evidence. In other words – it describes the act of learning.

Is Naive Bayes classification or regression? ›

The Naive Bayes algorithm is a classification algorithm based on Bayes rule and a set of conditional independence assumptions.

Tell Me More ›

Why is naive Bayes biased? ›

The Naive Bayes classifier employs a very simple (linear) hypothesis function, the function it uses to model data. It suffers from high bias, or error resulting from inaccuracies in its hypothesis class, because its hypothesis function is so simple it cannot accurately represent many complex situati…

See Details ›

Is naive Bayes more accurate than KNN? ›

Naïve Bayes is the most accurate classifier compared to Decision Tree and k-NN with the average accuracy of 0.737.

Get More Info Here ›

Which of the following is incorrect about naive Bayes? ›

Attributes are statistically dependent of one another given the class value: This statement is incorrect. Naive Bayes assumes that attributes are statistically independent of one another given the class value.

Tell Me More ›

How accurate is the Bayesian network? ›

A Bayesian network (BN) is accurate with respect to a data set (D), if and only if, the joint probability distribution represented by the Bayesian network (PBN) matches the joint probability distribution described by the data set (PD). space.

Read On ›