Introduction

#Introduction

We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.

Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:

  • Rank - Rank by median earnings (the dataset is ordered by this column).
  • Major_code - Major code.
  • Major - Major description.
  • Major_category - Category of major.
  • Total - Total number of people with major.
  • Sample_size - Sample size (unweighted) of full-time.
  • Men - Male graduates.
  • Women - Female graduates.
  • ShareWomen - Women as share of total.
  • Employed - Number employed.
  • Median - Median salary of full-time, year-round workers.
  • Low_wage_jobs - Number in low-wage service jobs.
  • Full_time - Number employed 35 hours or more.
  • Part_time - Number employed less than 35 hours.

Using visualizations, we can start to explore questions from the dataset like:

  • Do students in more popular majors make more money?
    • Using scatter plots
  • How many majors are predominantly male? Predominantly female?
    • Using histograms
  • Which category of majors have the most students?
    • Using bar plots

We'll explore how to do these and more while primarily working in pandas. Before we start creating data visualizations, let's import the libraries we need and remove rows containing null values.

Environment Setup

#Environment-Setup

Loading Data

#Loading-Data
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Dropping Missing Values

#Dropping-Missing-Values

This is mainly for visualization. If there is missing values, matplot will throw an error. To avoid that, we'll drop rows that have missing values.

Loading output library...

There are 173 rows total before dropping rows

Loading output library...

Surprisingly there was only one row that had missing values.

Visualizatoin Through Pandas

#Visualizatoin-Through-Pandas

Most of the plotting functionality in pandas is contained within the DataFrame.plot() method. When we call this method, we specify the data we want plotted as well as the type of plot. We use the kind parameter to specify the type of plot we want. We use x and y to specify the data we want on each axis. You can read about the different parameters in the documentation.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

None of these seems to show any strong relationship between variables.

The popularity(Sample_size) doesn't seem to have any link to graduates median income. Popularity within a certain gender doesn't seem to link to higher gruates median income either. Likewise, whether it's fulltime or not doesn't seem to affect graduate median income.

Histogram

#Histogram
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

The most common median income is 30~40k. About 50% of majors are predominantly male and 50% of majors predominantly female.

Scatter Matrix

#Scatter-Matrix

Because scatter matrix plots are frequently used in the exploratory data analysis, pandas contains a function named scatter_matrix() that generates the plots for us. This function is part of the pandas.plotting module and needs to be imported separately. To generate a scatter matrix plot for 2 columns, select just those 2 columns and pass the resulting DataFrame into the scatter_matrix() function.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Bar Plots

#Bar-Plots
Loading output library...
Loading output library...
Loading output library...
Loading output library...