For this demo, we will be using the Amazon Fine Food Reviews Data. The Amazon Fine Food Reviews dataset consists of 568,454 food reviews Amazon users left up to October 2012.
This script is based off of the Craigslist Word2Vec Demo.
We will begin by importing our review data into our H2O cluster. In this case, I will be starting up an H2O cluster on my local computer.
We will start by training a baseline model that does not use the review and instead uses other attributes in our dataset.
We can see that there is a big room for improvement. Our error is 22%. To improve our model, we will train word embeddings for the review.
The variable importance plot below shows us that the most important variable is
HelpfulnessNumerator. Looking at the partial dependency plot for that variable, we see that the more people who find the review helpful, the more likely it is a good review.
Our first step will be to tokenize the words in the review column. We will do this by creating a function called
tokenize. This will split the reviews into words and remove any stop words, small words, or words with numbers in them.
Now that we've tokenized our words, we can train a word2vec model. We can use the
find_synonms function to sanity check our word2vec model after training.
Now that we have a word embedding for each word in our vocabulary, we will aggregate the words for each review using the
transform function. This will give us one aggregated word embedding for each review.
We will train a GBM model with the same parameters as our baseline gbm. This time, however, we will add the aggregated word embeddings as predictors.
We saw that the review column is not the only column with text. We also have a column called
Summary which summarizes the review. We will add the word embeddings of the summary to see if this improves our model.
We can see that a low
C43 is associated with a smaller probability of positive review. Let's see what words have a low
The words with low
refund all seem to be related to contacting for a refund. Words like:
salmonella are obviously an indicator of a negative review for a food product.
Now that we've built a model we are satisifed with, we will see how the model performs on new reviews.