Building a Spam Filter with Naive Bayes

#Building-a-Spam-Filter-with-Naive-Bayes

we are goingt to build spam filter using Naive Bayes algorithm with the UCI Machine Learning Repository. The goal is to build practical sms spam filter.

Exploring the Dataset

#Exploring-the-Dataset
Loading output library...
Loading output library...

From the result above, we have total of 5572 rows with label and content column. 86.6% of them are ham and the other 13.4% of them are spam.

Traning and Test Set

#Traning-and-Test-Set

We are going to keep 80% of the data set for training, and 20% for testing. The goal is to create a spam filter that classifies new messages with an accuracy greater than 80%. The 20% of the data will be used for that.

Loading output library...
Loading output library...

With the randomize sample, it looks like we have quite similar ratio of ham and spam in the both data sets.

Letter Case and Punctuation

#Letter-Case-and-Punctuation

We mentioned that we are going to use Naive Bayes Algorithm to classify whether the message is spam or ham. In order to do that, we need to quantify the number of occurance for the words. We need to clean the data a bit.

Creating the Vocabulary

#Creating-the-Vocabulary

To apply Naive Bayes, we need a vocabulary, set of unique words, to count the occurance of each words.

The Final Training Set

#The-Final-Training-Set
Loading output library...

Calculating Constants First

#Calculating-Constants-First

Naive Bayes algorithm will need to know the probabilty values of the two equations below to classify new messages:

@@0@@

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, recall that we need to use these equations:

@@1@@

The values we need to find out are:

  • @@2@@
  • @@3@@
  • @@4@@

Also it's worth mentioning that we are going to use Laplace smoothing wiwht set @@5@@.

Caculating Parameters

#Caculating-Parameters

Classifying a New Message

#Classifying-a-New-Message

Measuring the Spam Filter's Accuracy

#Measuring-the-Spam-Filter's-Accuracy
Loading output library...
Loading output library...

Conclusion

#Conclusion

From the result above(about 96.8% accuracy), Naive Bayes Algorithm is extremely accurate with the test set with relatively simple steps.