From the result above, we have total of 5572 rows with label and content column. 86.6% of them are ham and the other 13.4% of them are spam.
We are going to keep 80% of the data set for training, and 20% for testing. The goal is to create a spam filter that classifies new messages with an accuracy greater than 80%. The 20% of the data will be used for that.
With the randomize sample, it looks like we have quite similar ratio of ham and spam in the both data sets.
We mentioned that we are going to use Naive Bayes Algorithm to classify whether the message is spam or ham. In order to do that, we need to quantify the number of occurance for the words. We need to clean the data a bit.
To apply Naive Bayes, we need a vocabulary, set of unique words, to count the occurance of each words.
Naive Bayes algorithm will need to know the probabilty values of the two equations below to classify new messages:
Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, recall that we need to use these equations:
The values we need to find out are:
Also it's worth mentioning that we are going to use Laplace smoothing wiwht set @@5@@.
From the result above(about 96.8% accuracy), Naive Bayes Algorithm is extremely accurate with the test set with relatively simple steps.