We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
Using visualizations, we can start to explore questions from the dataset like:
Using bar plots
We'll explore how to do these and more while primarily working in pandas. Before we start creating data visualizations, let's import the libraries we need and remove rows containing null values.
This is mainly for visualization. If there is missing values, matplot will throw an error. To avoid that, we'll drop rows that have missing values.
There are 173 rows total before dropping rows
Surprisingly there was only one row that had missing values.
Most of the plotting functionality in pandas is contained within the DataFrame.plot() method. When we call this method, we specify the data we want plotted as well as the type of plot. We use the kind parameter to specify the type of plot we want. We use x and y to specify the data we want on each axis. You can read about the different parameters in the documentation.
None of these seems to show any strong relationship between variables.
The popularity(Sample_size) doesn't seem to have any link to graduates median income. Popularity within a certain gender doesn't seem to link to higher gruates median income either. Likewise, whether it's fulltime or not doesn't seem to affect graduate median income.
The most common median income is 30~40k. About 50% of majors are predominantly male and 50% of majors predominantly female.
Because scatter matrix plots are frequently used in the exploratory data analysis, pandas contains a function named scatter_matrix() that generates the plots for us. This function is part of the pandas.plotting module and needs to be imported separately. To generate a scatter matrix plot for 2 columns, select just those 2 columns and pass the resulting DataFrame into the scatter_matrix() function.