Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.
Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.
The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.
Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that we lowercase words and remove puntuation so Don't and don't aren't considered to be different words when we compare them.
The Value column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.
The Air Date column should also be a datetime, not a string, to enable you to work with it more easily.
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.
From the result above, we can see about 6% of answers could be deduced from the questions. If you don't have no clue, it might be a good betting strategy.
Let's say we want to investigate how often new questions are repeats of older ones. we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.
The script above looks for if each terms came up in the previous questions. Which could be insignificant, but if certain topics are recycles from the past, it could be worth looking at the previous questions.
Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help we earn more money when we're on Jeopardy.
You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:
We can loop through each of the terms from terms_userd and:
We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.
Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 10, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.