While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.
We will be cleaning and exploring the data set.
The data has several columns, including:
RespondentID should be unique as it's identifirer. Because it's an identifirer, it should not have null values eitehr.
We can already see it has NaN values. Let's see if it has any non-unique values.
There is no unique value violation. Let's remove the values that has null values.
We can confirm that there no more null values in RespondentID column.
These two columns have Yes/No/NaN values. It's NaN when they choose not to answer the question. However, it is generally easier to work with boolean values then string values. Let's convert the values.
We can confirm that both columns now only contain True/False/NaN values from above.
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
The columns for this question are:
Unnamed: 8 - Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
It's also difficult to work with the columns names are too long. Let's change that first.
We are going to convert each column to contain only True or False
We can confirm that the columns only contain True or False
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
Fortunately, these columns don't require a lot of cleanup. We'll need to convert each column to a numeric type, though, then rename the columns so that we can tell what they represent more easily.
Also it's a good idea to change the column names to something that is more discriptive.
Now that we cleaned up the ranking columns, we can find the highest-ranked movie more quickly.
Let's visualize it.
From the result above, Episode 5 was the most favored series of six. Also let's see how many people seen each movie and see if it conforms with the rank.
From the glance of it it looks like the higher the rank is the more people watched it.
We know which movies the survey population as a whole has ranked the highest. Now let's examine how certain segments of the survey population responded. Let's see if there is any difference between between the genders
Still, it looks like episode 5 is the most favored and most watched among male audience.
The result is slightly different among female audience. Episode 5 is still most watched and fovored. However, Episode 6 took the second place, while among the male group, episode 4 was the second place.