Introduction to data analytics with pandas

#Introduction-to-data-analytics-with-pandas

Quentin Caudron

#Quentin-Caudron

PyData Seattle, July 2017

#PyData-Seattle,-July-2017

Systems check

#Systems-check

Do you have a working Python installation, with the pandas package ?

Note : This cell should run without raising a traceback. Assuming it runs, you can also try printing the value of pd.__version__ to see what version of pandas you have installed.

A little about me

#A-little-about-me
  • Lapsed computational physicist
  • PhD computational neuroscience, postdoc statistical epidemiology
  • Data Scientist at CBRE - www.cbredev.com
  • ATOM at Seattle

A little about the hero of this story

#A-little-about-the-hero-of-this-story

We'll be analysing a real-world dataset together. It's about my favourite thing in the world : coffee. This dataset was collected at the Mathematics Institute at the University of Warwick. It's a time-series dataset, describing the total number of coffees made by our espresso machine by a certain date.

A little about this workshop

#A-little-about-this-workshop

We'll be running through an analysis of this dataset as a way to expose you to the pandas API. The aim is to develop a little familiarity with how to work with pandas.

Slides are available at https://github.com/QCaudron/pydata_pandas. One notebook contains solutions; beware of spoilers.

The notebooks contain notes about what we're doing that I'll skip during this workshop, but try to explain on the way.

The pandas API is enormous. The documentation is excellent, don't hesitate to look things up.

Key questions

#Key-questions

The dataset : total number of coffees made vs time.

  • Who are the main contributors to this dataset, and when are contributions generally made ?
  • What are the department's weekday coffee habits ?
  • How much coffee are people drinking ?

Let's begin

#Let's-begin

Note : The second line here tells matplotlib to plot directly under the cell where any plotting code is called. pandas uses matplotlib to generate graphs, and without this, the graphs would appear outside the Jupyter notebook when you called plt.show() - but we just want them to appear without having to do this.

http://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1

Importing the data

#Importing-the-data

Let's import the coffee data from CSV.

Note : pandas can read from many data formats : CSV, JSON, Excel, HDF5, SQL, and more.

http://pandas.pydata.org/pandas-docs/version/0.20/io.html

What does this data look like ?

#What-does-this-data-look-like-?
Loading output library...

Let's just look at the first few rows.

#Let's-just-look-at-the-first-few-rows.
Loading output library...

We have an index, and three columns : timestamp, coffees, and contributor.

Uh-oh. Why is there a string of text, testing, in our coffee numbers ? What's going on in the coffees column in the row after that ?

Note : df.head(n=10) would show the first ten rows. The default is n=5.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

Let's look at that string in the third row.

#Let's-look-at-that-string-in-the-third-row.
Loading output library...

Definitely a string. We'll note this as something to fix after we finish looking around.

Note : .loc uses a label-based lookup, which means that the value you pass into the square brackets must be in the index. Another method, .iloc, is integer-location-based, so .iloc[2] would return the third row. In this case, they're the same, but had we changed our index, as we'll see later, things would work differently.

Indexing a dataframe with [] directly returns a pd.Series or pd.DataFrame by searching over columns, not rows. Indexing a pd.Series with [] is like indexing a dataframe with .iloc.

https://pandas.pydata.org/pandas-docs/stable/indexing.html

We should also take a look at that NaN. In fact, let's look at the first five values in coffees.

#We-should-also-take-a-look-at-that-NaN.-In-fact,-let's-look-at-the-first-five-values-in-
Loading output library...

Note : here, we're indexing a series ( a pd.Series object ). From a pd.DataFrame ( here, data ), when you access a single column ( data.coffees or data["coffees"] ), the object returned is a pd.Series. From that, indexing directly with [] works in an integer-location-based manner, and like with numpy arrays, you can take slices ( [:5] ).

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

How long is the dataset ?

#How-long-is-the-dataset-?

What else can we find out ?

#What-else-can-we-find-out-?
Loading output library...

Looks like we also have some missing data - we have 671 rows, but the coffees column only has 658 entries.

Note : .describe() returns different things based on what's in the dataframe, as we'll see later. For numerical columns, it will return things like the mean, standard deviation, and percentiles. For object columns ( strings or datetimes ), it will return the most frequent entry and the first and last items. For all columns, .describe() will return the count of objects in that column ( not counting NaNs ) and the unique number of entries. You can determine what's returned using .describe()'s keyword arguments.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

Let's look at the dataframe where coffees is null.

#Let's-look-at-the-dataframe-where-
Loading output library...

Note : .isnull() returns a boolean array ( an array of Trues and Falses ), that you can then use to index the dataframe directly. Here, our boolean array tells us which entries in the coffees column are null, and we use that to index against the full dataframe - so we get back every column in the dataframe, but only those rows where coffees is null.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html

What type of Python objects are the columns ?

#What-type-of-Python-objects-are-the-columns-?
Loading output library...

The contributor column makes sense as object, because we expect strings there; but surely the timestamp should be a timestamp-type, and coffees should be numerical ?

It looks like the timestamp field was read from CSV as a string. That makes sense - CSV files are very basic. We'll have pandas interpret these strings as datetimes for us automatically.

Note : here's an example of using direct [] indexing on a pd.Series. We're accessing the first entry, just to see what type of object we have there.

On our first pass, what problems did we find ?

#On-our-first-pass,-what-problems-did-we-find-?
  • The timestamp column contains strings; these need to be datetimes
  • The coffees column contains some null values and at least one string

Cleaning the data

#Cleaning-the-data
Loading output library...
Loading output library...
Loading output library...
Loading output library...

So where do we stand ?

#So-where-do-we-stand-?
Loading output library...
Loading output library...

Note : .describe(include="all") is describing all attributes of all columns, but some don't make sense based on the column's dtype. For example, the contributor column has no first and last attributes, because those describe the first and last entries in an ordered series. That makes sense for the timestamp - those have an intuitive definition of sorting - but not so much for strings ( alphabetical order doesn't really matter when they're arbitrary strings ). Similary, the timestamp column has no mean or other numerical traits. What does it mean to calculate the mean timestamp ?

The time-series at a glance

#The-time-series-at-a-glance

Let's begin by visualising the coffee counts.

#Let's-begin-by-visualising-the-coffee-counts.
Loading output library...
Loading output library...

pandas is plotting the coffees against the index, which is just a series of integers.

Note : .plot() on a pd.Series will plot the data against the index. On a pd.DataFrame, the .plot() method allows plotting of one column against another.

By default, .plot() renders a line graph, but you can specify which type of plot you'd like - bar, line, histogram, area, scatter, etc..

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

Loading output library...
Loading output library...

We have some very uneven spacing in places. We might start by cutting off the last few points of this time-series, which is missing a lot of data.

We'll inspect the last few points of this time-series.

#We'll-inspect-the-last-few-points-of-this-time-series.
Loading output library...

After mid-March, things start getting spaced rather erratically.

Let's cut off the tail of the time-series, anything after 2013-03-01.

#Let's-cut-off-the-tail-of-the-time-series,-anything-after-2013-03-01.
Loading output library...

Note : this is another example of boolean indexing. data.timestamp < "2013-03-01" is a boolean array, and can be passed into the dataframe immediately in [], much like with a np.ndarray.

One final look.

#One-final-look.
Loading output library...
Loading output library...

1. Contributions to the time-series

#1.-Contributions-to-the-time-series

Who are our main contributors ?

#Who-are-our-main-contributors-?
Loading output library...

Note : .value_counts() counts the unique values in a series. It's similar to doing a .groupby() followed by a .count(), as we'll see soon.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

Let's plot this.

#Let's-plot-this.
Loading output library...
Loading output library...

On which weekdays were contributions made ?

#On-which-weekdays-were-contributions-made-?
Loading output library...

Can we replace these integers with actual weekdays ?

#Can-we-replace-these-integers-with-actual-weekdays-?
Loading output library...

Let's group by these weekdays.

#Let's-group-by-these-weekdays.
Loading output library...

Note : this first line could be replaced by weekday_counts = data.weekdays.value_counts(), with the only difference being that that would return a series to us, and here, we got back a dataframe.

We can now visualise these weekday counts.

#We-can-now-visualise-these-weekday-counts.
Loading output library...
Loading output library...

2. Weekday trends

#2.-Weekday-trends

First, we'll set our timestamps to the dataframe's index

#First,-we'll-set-our-timestamps-to-the-dataframe's-index
Loading output library...

Let's add some rows at midnight on every day.

#Let's-add-some-rows-at-midnight-on-every-day.
Loading output library...

Note : pd.date_range creates a fixed-frequency DatetimeIndex. normalize=True ensures these datetimes are at midnight, and not at whatever time the starting point is.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html

Let's take the union of this index and our dataset's index.

#Let's-take-the-union-of-this-index-and-our-dataset's-index.
Loading output library...

Note : the union of these indices is just a new index where entries from both indices are present. It's sorted by time.

Now we can reindex our dataframe with this new index.

#Now-we-can-reindex-our-dataframe-with-this-new-index.
Loading output library...

Note : .reindex() keeps any values that conform to the new index, and inserts NaNs where we have no values.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

Loading output library...

We're now ready to resample the time-series at a daily frequency.

#We're-now-ready-to-resample-the-time-series-at-a-daily-frequency.
Loading output library...
Loading output library...
Loading output library...

Let's begin by figuring out how many coffees are made on any given day.

#Let's-begin-by-figuring-out-how-many-coffees-are-made-on-any-given-day.
Loading output library...

Note : we use .shift() here because if we look at the .diff() between a Monday and a Tuesday, those coffees are attributed to the Tuesday. However, what we want to say is "this many coffees were made at some point on the Monday", so we shift the entire series up one.

Now we can group this by weekday.

#Now-we-can-group-this-by-weekday.
Loading output library...

Let's order this series and then plot it.

#Let's-order-this-series-and-then-plot-it.
Loading output library...
Loading output library...

Wednesdays was seminar day...

3. Coffee per person

#3.-Coffee-per-person

We can now pull in data on how many people were in the department.

#We-can-now-pull-in-data-on-how-many-people-were-in-the-department.
Loading output library...

Let's join the datasets.

#Let's-join-the-datasets.
Loading output library...

Note : by default, inner joins are performed. That is, if a row from one of the datasets has an index that isn't in the other dataset, that row is dropped. You can specify whether you want outer, left, or right joins, as well plenty of other useful options. The pandas API for joining or merging datasets is very developed.

https://pandas.pydata.org/pandas-docs/stable/merging.html

Let's create a column for the number of coffees consumed per person.

#Let's-create-a-column-for-the-number-of-coffees-consumed-per-person.
Loading output library...

We can now plot this column.

#We-can-now-plot-this-column.
Loading output library...
Loading output library...

Those are strange plateaus. We'll pull in another dataset, telling us when the machine was broken.

#Those-are-strange-plateaus.-We'll-pull-in-another-dataset,-telling-us-when-the-machine-was-broken.
Loading output library...

Note : the parse_dates keyword argument takes several values. By passing in a list of strings, we're telling pandas to attempt to parse the dates in columns with those names.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Loading output library...

A quick trick to plot this as a time-series...

#A-quick-trick-to-plot-this-as-a-time-series...
Loading output library...
Loading output library...

Note : the first line here creates a boolean pd.Series, holding the value True when machine_status.status is "OK", and False otherwise. Because it's a pd.Series, its index stays the same as that of machine_status, which was a DatetimeIndex. Then, we can plot the boolean series ( True appearing as 1, and False appearing as 0 ), and just quickly scan to see that there are long areas where the coffee machine was operations, with short bouts ( thankfully ! ) of the machine being broken.

Let's join the datasets on the date field !

#Let's-join-the-datasets-on-the-date-field-!
Loading output library...

We'll bring in this numerical representation of status column into our dataframe too.

#We'll-bring-in-this-numerical-representation-of-status-column-into-our-dataframe-too.
Loading output library...

Let's plot both the coffees per person and the numerical status.

#Let's-plot-both-the-coffees-per-person-and-the-numerical-status.
Loading output library...
Loading output library...

We see a strong weekday-weekend effect. Resampling weekly will fix that.

#We-see-a-strong-weekday-weekend-effect.-Resampling-weekly-will-fix-that.
Loading output library...
Loading output library...

What have we achieved ?

#What-have-we-achieved-?

Cleaning

#Cleaning
  • Cast columns to the correct dtypes
  • Dropped rows with no data
  • Truncated the time-series when things got sparse

Exploring the contributions

#Exploring-the-contributions
  • Discovered who contributed to the dataset and how much
  • Established how contributions varied by day of the week

Exploring the time-series

#Exploring-the-time-series
  • Resampled the uneven time-series to regular daily intervals
  • Interpolated over missing data using our time index
  • Discovered on what days of the week coffee was especially popular
  • Joined the dataset with others to enrich the information available
  • Derived new columns from existing ones
  • Smoothed the time-series by weekly downsampling

Insights

#Insights
  • A small number of people contributed most of the data
  • Contributions were lacking at the weekends, and Mondays had the most contributions
  • Seminar Wednesdays is a strong driver of coffee consumption
  • Periods of low coffee drinking correlated strongly with periods where the machine was broken
  • A significant dip in consumption occurred in the summer months