Building a Spam Filter with Naive Bayes


we are goingt to build spam filter using Naive Bayes algorithm with the UCI Machine Learning Repository. The goal is to build practical sms spam filter.

Exploring the Dataset

Loading output library...
Loading output library...

From the result above, we have total of 5572 rows with label and content column. 86.6% of them are ham and the other 13.4% of them are spam.

Traning and Test Set


We are going to keep 80% of the data set for training, and 20% for testing. The goal is to create a spam filter that classifies new messages with an accuracy greater than 80%. The 20% of the data will be used for that.

Loading output library...
Loading output library...

With the randomize sample, it looks like we have quite similar ratio of ham and spam in the both data sets.

Letter Case and Punctuation


We mentioned that we are going to use Naive Bayes Algorithm to classify whether the message is spam or ham. In order to do that, we need to quantify the number of occurance for the words. We need to clean the data a bit.

Creating the Vocabulary


To apply Naive Bayes, we need a vocabulary, set of unique words, to count the occurance of each words.

The Final Training Set

Loading output library...

Calculating Constants First


Naive Bayes algorithm will need to know the probabilty values of the two equations below to classify new messages:


Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, recall that we need to use these equations:


The values we need to find out are:

  • @@2@@
  • @@3@@
  • @@4@@

Also it's worth mentioning that we are going to use Laplace smoothing wiwht set @@5@@.

Caculating Parameters


Classifying a New Message


Measuring the Spam Filter's Accuracy

Loading output library...
Loading output library...



From the result above(about 96.8% accuracy), Naive Bayes Algorithm is extremely accurate with the test set with relatively simple steps.