Estimating and Predicting Number of Public Jupyter Notebooks on GitHub


Data Source


Data on number of public notebooks on Github was downloaded from this repository by Peter Parente, contributor to the Jupyter Project

He created a script that scrapes the GitHub web search UI for the count, appends to a CSV, executes a notebook. The entire collection process is automated and set to run on TravisCI on a daily schedule.

I've simply re-written the plots in plotly to make the graphs more readable and interactive. Enjoy!

Loading output library...

Raw Hits


First, let's load the historical data into a DataFrame indexed by date. There might be missing counts for days that we failed to sample. We build up the expected date range and insert NaNs for dates we missed. The we can plot the known notebook counts.

Loading output library...



Next, let's look at various measurements of change.

The total change in the number of *.ipynb hits between the first day we have data and today is:

Loading output library...

The mean daily change for the entire duration is:

Loading output library...

The change in hit count between any two consecutive days for which we have data looks like the following:

Loading output library...

The large jumps in the data are from GitHub reporting drastically different counts from one day to the next. Maybe GitHub was rebuilding a search index when we queried or had a search broker out-of-sync with the others?

Let's drop outliers defined as values more than two standard deviations away from a centered 180 day rolling mean.

Loading output library...
Loading output library...

Now let's do a simple linear interpolation for missing values and then look at the rolling mean of change.

Loading output library...



Now let's use fbprophet to forecast growth for the next 2 years. First, we'll try to forecast based on the raw search hit data with outliers removed.

The model appears to favor seasonality effects in the early data and replicate them throughout the forecast period. The density of early data versus the sparsity of later data is a likely cause.

Loading output library...
Loading output library...



Finally, it's nice to celebrate million-notebook milestones. We can use our model to predict when they're going to occur.

Loading output library...