Predicting Good Amazon Reviews

#Predicting-Good-Amazon-Reviews

For this demo, we will be using the Amazon Fine Food Reviews Data. The Amazon Fine Food Reviews dataset consists of 568,454 food reviews Amazon users left up to October 2012.

This script is based off of the Craigslist Word2Vec Demo.

Import Data

#Import-Data

We will begin by importing our review data into our H2O cluster. In this case, I will be starting up an H2O cluster on my local computer.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Train Baseline Model

#Train-Baseline-Model

We will start by training a baseline model that does not use the review and instead uses other attributes in our dataset.

We can see that there is a big room for improvement. Our error is 22%. To improve our model, we will train word embeddings for the review.

Loading output library...
Loading output library...

The variable importance plot below shows us that the most important variable is HelpfulnessNumerator. Looking at the partial dependency plot for that variable, we see that the more people who find the review helpful, the more likely it is a good review.

Loading output library...
Loading output library...

Tokenize Words in Review

#Tokenize-Words-in-Review

Our first step will be to tokenize the words in the review column. We will do this by creating a function called tokenize. This will split the reviews into words and remove any stop words, small words, or words with numbers in them.

Loading output library...
Loading output library...

Train Word2Vec Model

#Train-Word2Vec-Model

Now that we've tokenized our words, we can train a word2vec model. We can use the find_synonms function to sanity check our word2vec model after training.

Loading output library...
Loading output library...

Now that we have a word embedding for each word in our vocabulary, we will aggregate the words for each review using the transform function. This will give us one aggregated word embedding for each review.

Loading output library...
Loading output library...

Train GBM Model to Predict Good Review

#Train-GBM-Model-to-Predict-Good-Review

We will train a GBM model with the same parameters as our baseline gbm. This time, however, we will add the aggregated word embeddings as predictors.

Loading output library...
Loading output library...

Adding Summary

#Adding-Summary

We saw that the review column is not the only column with text. We also have a column called Summary which summarizes the review. We will add the word embeddings of the summary to see if this improves our model.

Loading output library...
Loading output library...

We can see that a low C43 is associated with a smaller probability of positive review. Let's see what words have a low C43 value.

Loading output library...
Loading output library...
Loading output library...

The words with low C43 like contacted, answered, emails, phone and refund all seem to be related to contacting for a refund. Words like: salmonella are obviously an indicator of a negative review for a food product.

Predict on New Reviews

#Predict-on-New-Reviews

Now that we've built a model we are satisifed with, we will see how the model performs on new reviews.

  • "The taste is great! especially when you cook it with some vegetable and egg. I like it very much, though it's more expensive than the other noodles"
  • "Quite tasteless and they make you order so many. I am stuck with 12 bags of this tasteless stuff. I am not ordering large amounts of anything from Amazon again. So often I don't like it and I am stuck with so much on hand."
Loading output library...
Loading output library...