I use Juypyter Notebooks extensively. They've almost completely replaced the Python REPL in my own development, my consulting projects, and with my data science students at Codementor. Notebooks are amazing and a great boon to interactive testing, teaching, and exploratory data science. But collaboration with and through them is...primitive. GitHub repos and Gists display Notebooks nicely, but that's really only a beginning.
I've begun testing kyso.io as a tool to improve notebook collaboration. What follows is a simple example of a lesson in approximation techniques and pitfalls. I'm using it as a test case for kyso.
So far kyso is working well for me. The only glitch I've found related to mis-rendered inline math expressions (e.g. @@0@@). Once reported, that was quickly fixed.
Let's look at some basic plotting and curve-fitting code using Jupyter Notebook (Jupyter Lab, actually, because who doesn't love to be on the cutting edge?) along with standard Python data science tools.
numpy to plot a curve
which is the good ole' sine wave with
some random noise added. We then attempt a polynomial approximation of the curve
matplotlib then plots the noisy sin curve, the real sin curve, and the
Note that while the polynomial appoximation is imperfect, esp. in having a bit of phase offset from true sin, it has nice smoothness characteristics resembling sin quite closely. The degree of the polynomial, however, must be carefully managed. Too high a degree, esp. on a small number of data points, and you'll get significant overfitting. Too low a degree, esp. on larger, truly ebbing and flowing data, can significantly underfit.
For example, where our first degree 3 fit looked great above, a degree 15 fit below zigs and zags too much, trying to carefully match the original noisy data. A lumpy, poor fit results.
Choosing a polynomial degree is an important consideration. Doing so well takes knowing the number of raw data points, the length of the curve, and similar "omniscent" or "global" parameters of the data.
In later sessions, we'll look at ways to automate degree selection and evaluation. We'll compare that approach to the piecewise approximations so beloved in the creative arts (a.k.a. "splines").
However you look at it, and whatever your personal politics, this is not a good approximation of a sine wave,
In fact, if you didn't have the original sin curve as a reference, hadn't been told it was a curve fit to sin data points, or hadn't read the code, would you even guess it was a sin wave?
Here's the naked polynomial. If you had seen a lot of trig functions plotted over [0, 2π], or if sin + noise were your constant companion, maybe. But I'd argue it's a close call.
And this is with a just modest noise. With a little more noise added, the case for "well, of course that was a sine wave!" becomes more tenuous.
In a lot of machine laearning (ML) tasks, the assumption seems to be "we'll have a human there" to make sure the fit is good.
This xkcd panel sums it up nicely:
But is this really a sustainable proposition? Wouldn't it be superior to have some automatic way to measure and evaluate the quality of fit? In the ML context, the future is vast, fundamentally unknowable, and potentially fickle or chaotic. But even there, the practice is to mechanically evaluate model fit (ideally on a set of data held apart from the training data set, and never used in model construction).
But let's begin with the simplified case, where we actually know the curve we're trying to fit. For ML, "kowing the underlying curve" is crazy-talk. But for other applications, it isn't so implausible or unusual; for a large class of phenomena we already have some established meta-models we can test against.
Here we try a lot of different potential polynomial degrees, and visually present their results against the actual sin curve. Things are going pretty well at degree 9, but 11 starts to look a little funky. By degree 15 it's getting worse and worse. But we don't stop there, because who doesn't love a good trainwreck?
Indeed, by this point it's likely that NumPy has emitted a few
RankWarning: Polyfit may be poorly conditioned warnings. We've
not only overfit, we've stressed out our numerical machinery.
Since we know we can easily generate an entire family of fit polynomials, the obvious next task is to define an evaluation function to choose amongst the options, then some sort of policy about choosing. (For sine, the policy is likely to be simple, but in real applications, factors such as minimum aggregate error, minimum instantaneous error, smoothness, computational cost, and maximum evaluation time must be balanced.)
But before we go there, let's look a little beyond the pre-defined interval we've fitted against. How do the polynomial models, carefully fitted on [0, 2π], do over a larger interval? [0, 4π] say?
As it turns out, terribly. Some of these models, so well-fit over our initial interval, immediately run off the tracks when applied to a larger interval. Even with an expanded frame, they run out of bounds. Even great interpolators can make lousy extrapolators.
Ok...why were we bothering again? If these curves are such crappy extrapolators, and if you need such caution to construct good curves even for interpolation, why bother with them? Patience, young Padawan! Yes, polynomial interpolations have all sorts of downsides. But they can make remarkably efficient estimators, and some of their defects can be patched up, e.g. with piecewise use. Let's take a moment to improve our motivation with that first case.