In this lab we'll look at images as data. In particular, we'll show how to use standard Python libraries to visualize and distort images, switch between matrix and vector representations, and run machine learning to perform a classification task. We'll be using the ever popular MNIST digit recognition data set. This will be a two part lab, comprised of the following steps:
Part1: 1. Generate random matrices and visualize as images 2. Data Prep and Exploration -- Load MNIST data -- Convert the vectors to a matrix and visualize -- Look at pixel distributions -- Heuristic feature reduction -- Split data into train and validation
Part 2: 3. Classification -- One vs. All using logisitic regression. -- Random Forests on all 4. Error analysis -- Visualize confusion matrix -- Look at specific errors 5. Generate synthetic data and rerun RF
Before looking at actual data, let's introduce the basic image data construct in Python. A color image is generally a 3-dimensional array with dimensions: NxMx3. The last dimension is the color channel, and the first two represent the X and Y grid of the image. The values are typically integer-valued pixel intensities for each color channel, with values between 0 and 255.
The first thing we'll do is generate a random image and look at it.
Now do it again with a much smaller grid to see how it looks
Now let's do a little prep on the data. We need to split the training data to produce a validation set
Note that this data is in a single vector format (i.e., the dimensionality of each record is 1xK, and not in the typical 2-dim photo layout). Before doing any data mining, let's first explore the data visually.
The first thing we can do is look at the actual images. What better way to visualize image data?
As we can see from the black and white images, many pixels might only have black (x=0) values. These won't be useful for modeling, so we can drop them. Write a function that finds all of the features with stdev=0 and then drop them from the training data.
Let's now take a look at the distribution of pixel values in a more traditional way. Let's also look at the distribution by class to see if we see any differences. As there are many pixels, let's look at the top 4 by pixel variance.
Given how polarized the pixels are, this is a bit hard to read. Next let's just look at means by label group.
Can we start to see how the distributions differ by class?
With our data generally explored and prepped and a little intuition about how pixel values might help us discrimate between the digits, let's start to use our predictive modeling tools.
We'll first start with a simple linear model. Also, this is a multi-class scenario so we can't just use and out-of-the-box logistic regression. Luckliy, like for most things machine learning, SkLearn has a tool for that.
Let's take a quick look under the hood of the OneVsRestClassifier object
Now let's evaluate on test data using a confusion matrix
Looking at the above, are any of the errors a bit weird?
This accuracy is pretty good for an untuned LR on a sample of data. Can we do better with a random forest? Without running first, why might a RF do better in this case?
One interesting aspect of image is data is that we can often improve our models by using synthetic data. The key is to produce such data in a way that it doesn't deviate too far from realistic examples. We can do this by making small pertubations to the image matrix, such as adding random noise, adding blurs and by performing small rotations.
We'll do that here using tools from scipy alone (note there are other image libraries that offer even more functionality). We'll then take the augmented data set and build another random forest and see if we can improve our model.
First let's create a function that adds random gaussian noise, rotates the angle of the number, and applies some gaussian smoothing to create blur.
Before building more training data, let's first plot some examples here.
So we need to choose distortation parameters that are significant enough to create a new image, but don't make it unreadable. This is a subjective call (which can be tested empirically...how?). What are some reasonable values?
Our next step will be to generate another 10k training data points, and then run a second random forest.
Note to students: choose your own values here, including total new and the distortation parameters. We'll execute and then see as a class what works best (in a sense we'll parallelize the exploration of the parameter space).