The purpose of this notebook is to serve as an explanation of two crucial linear algebra operations used when coding neural networks: matrix multiplication and broadcasting.
Matrix multiplication is a way of combining two matrices (involving multiplying and summing their entries in a particular way). Broadcasting refers to how libraries such as Numpy and PyTorch can perform operations on matrices/vectors with mismatched dimensions (in particular cases, with set rules). We will use broadcasting to show an alternative way of thinking about matrix multiplication from, different from the way it is standardly taught.
This is different from how most math courses are taught, where you have to learn all the individual elements before you can combine them (Harvard professor David Perkins call this elementitis), but it is similar to how topics like driving and baseball are taught. That is, you can start driving without knowing how an internal combustion engine works, and children begin playing baseball before they learn all the formal rules.
(source: [Demba Ba](https://github.com/zalandoresearch/fashion-mnist) and [Arvind Nagaraj](https://medium.com/towards-data-science/thoughts-after-taking-the-deeplearning-ai-courses-8568f132153))
This notebook was originally created for a 40 minute talk I gave at the O'Reilly AI conference in San Francisco. If you want further resources for linear algebra, here are a few recommendations:
We will be using the open source deep learning library, fastai, which provides high level abstractions and best practices on top of PyTorch. This is the highest level, simplest way to get started with deep learning. Please note that fastai requires Python 3 to function. It is currently in pre-alpha, so items may move around and more documentation will be added in the future.
The fastai deep learning library uses PyTorch, a Python framework for dynamic neural networks with GPU acceleration, which was released by Facebook's AI team.
PyTorch has two overlapping, yet distinct, purposes. As described in the PyTorch documentation:
The neural network functionality of PyTorch is built on top of the Numpy-like functionality for fast matrix computations on a GPU. Although the neural network purpose receives way more attention, both are very useful. We'll implement a neural net from scratch today using PyTorch.
Further learning: If you are curious to learn what dynamic neural networks are, you may want to watch this talk by Soumith Chintala, Facebook AI researcher and core PyTorch contributor.
Graphical processing units (GPUs) allow for matrix computations to be done with much greater speed, as long as you have a library such as PyTorch that takes advantage of them. Advances in GPU technology in the last 10-20 years have been a key part of why neural networks are proving so much more powerful now than they did a few decades ago.
You may own a computer that has a GPU which can be used. For the many people that either don't have a GPU (or have a GPU which can't be easily accessed by Python), there are a few differnt options:
.cuda()wherever it appears.
Today we will be working with MNIST, a classic data set of hand-written digits. Solutions to this problem are used by banks to automatically recognize the amounts on checks, and by the postal service to automatically recognize zip codes on mail.
A matrix can represent an image, by creating a grid where each entry corresponds to a different pixel.
Let's download, unzip, and format the data.
Many machine learning algorithms behave better when the data is normalized, that is when the mean is 0 and the standard deviation is 1. We will subtract off the mean and standard deviation from our training set in order to normalize the data:
Note that for consistency (with the parameters we learn when training), we subtract the mean and standard deviation of our training set from our validation set.
In any sort of data science work, it's important to look at your data, to make sure you understand the format, how it's stored, what type of values it holds, etc. To make it easier to work with, let's reshape it into 2d images from the flattened 1d format.
It's the digit 3! And that's stored in the y value:
We can look at part of an image:
A function takes inputs and returns outputs. For instance, @@0@@ is an example of a function. If we input @@1@@, the output is @@2@@, or if we input @@3@@, the output is @@4@@
Functions have parameters. The above function @@5@@ is @@6@@, with parameters a and b set to @@7@@ and @@8@@.
Machine learning is often about learning the best values for those parameters. For instance, suppose we have the data points on the chart below. What values should we choose for @@9@@ and @@10@@?
In the above gif fast.ai Practical Deep Learning for Coders course, intro to SGD notebook), an algorithm called stochastic gradient descent is being used to learn the best parameters to fit the line to the data (note: in the gif, the algorithm is stopping before the absolute best parameters are found). This process is called training or fitting.
Most datasets will not be well-represented by a line. We could use a more complicated function, such as @@0@@. Now we have 4 parameters to learn: @@1@@, @@2@@, @@3@@, and @@4@@. This function is more flexible than @@5@@ and will be able to accurately model more datasets.
Neural networks take this to an extreme, and are infinitely flexible. They often have thousands, or even hundreds of thousands of parameters. However the core idea is the same as above. The neural network is a function, and we will learn the best parameters for modeling our data.
Possibly the most important idea in machine learning is that of having separate training & validation data sets.
As motivation, suppose you don't divide up your data, but instead use all of it. And suppose you have lots of parameters:
This is called over-fitting. A validation set helps prevent this problem.
The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it's not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.
This illustrates how using all our data can lead to overfitting.
We will use fastai's ImageClassifierData, which holds our training and validation sets and will provide batches of that data in a form ready for use by a PyTorch model.
We will begin with the highest level abstraction: using a neural net defined by PyTorch's Sequential class.
Each input is a vector of size @@0@@ pixels and our output is of size @@1@@ (since there are 10 digits: 0, 1, ..., 9).
We use the output of the final layer to generate our predictions. Often for classification problems (like MNIST digit classification), the final layer has the same number of outputs as there are classes. In that case, this is 10: one for each digit from 0 to 9. These can be converted to comparative probabilities. For instance, it may be determined that a particular hand-written image is 80% likely to be a 4, 18% likely to be a 9, and 2% likely to be a 3. In our case, we are not interested in viewing the probabilites, and just want to see what the most likely guess is.
Sequential defines layers of our network, so let's talk about layers. Neural networks consist of linear layers alternating with non-linear layers. This creates functions which are incredibly flexible. Deeper layers are able to capture more complex patterns.
Layer 1 of a convolutional neural network:
Deeper layers can learn about more complicated shapes (although we are only using 2 layers in our network):
Next we will set a few inputs for our fit method:
Fitting is the process by which the neural net learns the best parameters for the dataset.
GPUs are great at handling lots of data at once (otherwise don't get performance benefit). We break the data up into batches, and that specifies how many samples from our dataset we want to send to the GPU at a time. The fastai library defaults to a batch size of 64. On each iteration of the training loop, the error on 1 batch of data will be calculated, and the optimizer will update the parameters based on that.
An epoch is completed once each data sample has been used once in the training loop.
Now that we have the parameters for our model, we can make predictions on our validation set.
Let's see how some of our preditions look!
These predictions are pretty good!
Recall that above we used PyTorch's
Sequential to define a neural network with a linear layer, a non-linear layer (
ReLU), and then another linear layer.
It turns out that
Linear is defined by a matrix multiplication and then an addition. Let's try defining this ourselves. This will allow us to see exactly where matrix multiplication is used (we will dive in to how matrix multiplication works in teh next section).
Just as Numpy has
np.matmul for matrix multiplication (in Python 3, this is equivalent to the
@ operator), PyTorch has
PyTorch class has two things: constructor (says parameters) and a forward method (how to calculate prediction using those parameters) The method
forward describes how the neural net converts inputs to outputs.
In PyTorch, the optimizer knows to try to optimize any attribute of type Parameter.
We create our neural net and the optimizer. (We will use the same loss and metrics from above).
Now we can check our predictions:
Now let's dig in to what we were doing with
torch.matmul: matrix multiplication. First, let's start with a simpler building block: broadcasting.
Broadcasting and element-wise operations are supported in the same way by both numpy and pytorch.
Operators (+,-,*,/,>,<,==) are usually element-wise.
Examples of element-wise operations:
The term broadcasting describes how arrays with different shapes are treated during arithmetic operations. The term broadcasting was first used by Numpy, although is now used in other libraries such as Tensorflow and Matlab; the rules can vary by library.
From the Numpy Documentation:
1 2 3 4 5 6 7
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.
In addition to the efficiency of broadcasting, it allows developers to write less code, which typically leads to fewer errors.
How are we able to do a > 0? 0 is being broadcast to have the same dimensions as a.
Remember above when we normalized our dataset by subtracting the mean (a scalar) from the entire data set (a matrix) and dividing by the standard deviation (another scalar)? We were using broadcasting!
Other examples of broadcasting with a scalar:
We can also broadcast a vector to a matrix:
Although numpy does this automatically, you can also use the
expand_dims method lets us convert the 1-dimensional array
c into a 2-dimensional array (although one of those dimensions has value 1).
When operating on two arrays, Numpy/PyTorch compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
Arrays do not need to have the same number of dimensions. For example, if you have a @@0@@ array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:
1 2 3
Image (3d array): 256 x 256 x 3 Scale (1d array): 3 Result (3d array): 256 x 256 x 3
The numpy documentation includes several examples of what dimensions can and can not be broadcast together.
We are going to use broadcasting to define matrix multiplication.
We get the same answer using
The following is NOT matrix multiplication. What is it?
From a machine learning perspective, matrix multiplication is a way of creating features by saying how much we want to weight each input column. Different features are different weighted averages of the input columns.
Draw a picture
If you want to test your understanding of the above tutorial. I encourage you to work through it again, only this time use CIFAR 10, a dataset that consists of 32x32 color images in 10 different categories. Color images have an extra dimension, containing RGB values, compared to black & white images.
(source: [Cifar 10](https://www.cs.toronto.edu/~kriz/cifar.html))
Fortunately, broadcasting will make it relatively easy to add this extra dimension (for color RGB), but you will have to make some changes to the code.
The matrix below gives the probabilities of moving from 1 health state to another in 1 year. If the current health states for a group are:
what will be the % in each health state in 1 year?
A Tensor is a multi-dimensional matrix containing elements of a single data type: a group of data, all with the same type (e.g. A Tensor could store a 4 x 4 x 6 matrix of 32-bit signed integers).