Star Wars Survey

#Star-Wars-Survey

While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.

We will be cleaning and exploring the data set.

Load data

#Load-data

The data has several columns, including:

  • RespondentID - An anonymized ID for the respondent (person taking the survey)
  • Gender - The respondent's gender
  • Age - The respondent's age
  • Household Income - The respondent's income
  • Education - The respondent's education level
  • Location (Census Region) - The respondent's location
  • Have you seen any of the 6 films in the Star Wars franchise? - Has a Yes or No response
  • Do you consider yourself to be a fan of the Star Wars film franchise? - Has a Yes or No response

Data Clean Up

#Data-Clean-Up

Unique Row Violation

#Unique-Row-Violation

RespondentID should be unique as it's identifirer. Because it's an identifirer, it should not have null values eitehr.

Loading output library...

We can already see it has NaN values. Let's see if it has any non-unique values.

Loading output library...

There is no unique value violation. Let's remove the values that has null values.

Loading output library...

We can confirm that there no more null values in RespondentID column.

  • Have you seen any of the 6 films in the Star Wars franchise?
  • Do you consider yourself to be a fan of the Star Wars film franchise?

These two columns have Yes/No/NaN values. It's NaN when they choose not to answer the question. However, it is generally easier to work with boolean values then string values. Let's convert the values.

Loading output library...
Loading output library...

We can confirm that both columns now only contain True/False/NaN values from above.

The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
The columns for this question are:

  • Which of the following Star Wars films have you seen? Please select all that apply. - Whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
  • Unnamed: 4 - Whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
  • Unnamed: 5 - Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.
  • Unnamed: 6 - Whether or not the respondent saw Star Wars: Episode IV A New Hope.
  • Unnamed: 7 - Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.
  • Unnamed: 8 - Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.

For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.

It's also difficult to work with the columns names are too long. Let's change that first.

Loading output library...

We are going to convert each column to contain only True or False

We can confirm that the columns only contain True or False

The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:

  • Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace
  • Unnamed: 10 - How much the respondent liked Star Wars: Episode II Attack of the Clones
  • Unnamed: 11 - How much the respondent liked Star Wars: Episode III Revenge of the Sith
  • Unnamed: 12 - How much the respondent liked Star Wars: Episode IV A New Hope
  • Unnamed: 13 - How much the respondent liked Star Wars: Episode V The Empire Strikes Back
  • Unnamed: 14 - How much the respondent liked Star Wars: Episode VI Return of the Jedi

Fortunately, these columns don't require a lot of cleanup. We'll need to convert each column to a numeric type, though, then rename the columns so that we can tell what they represent more easily.

Also it's a good idea to change the column names to something that is more discriptive.

Loading output library...

Now that we cleaned up the ranking columns, we can find the highest-ranked movie more quickly.

Loading output library...

Let's visualize it.

Loading output library...
Loading output library...

From the result above, Episode 5 was the most favored series of six. Also let's see how many people seen each movie and see if it conforms with the rank.

Loading output library...
Loading output library...

From the glance of it it looks like the higher the rank is the more people watched it.

We know which movies the survey population as a whole has ranked the highest. Now let's examine how certain segments of the survey population responded. Let's see if there is any difference between between the genders

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Still, it looks like episode 5 is the most favored and most watched among male audience.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

The result is slightly different among female audience. Episode 5 is still most watched and fovored. However, Episode 6 took the second place, while among the male group, episode 4 was the second place.