The purpose of this notebook is to serve as an explanation of two crucial linear algebra operations used when coding neural networks: matrix multiplication and broadcasting.

**Matrix multiplication** is a way of combining two matrices (involving multiplying and summing their entries in a particular way). **Broadcasting** refers to how libraries such as Numpy and PyTorch can perform operations on matrices/vectors with mismatched dimensions (in particular cases, with set rules). We will use broadcasting to show an alternative way of thinking about matrix multiplication from, different from the way it is standardly taught.

In keeping with the fast.ai teaching philosophy of "the whole game", we will:

- first use a pre-defined class for our neural network
- then define the net ourselves to see where it uses matrix multiplication & broadcasting
- and finally dig into the details of how those operations work

This is different from how most math courses are taught, where you have to learn all the individual elements before you can combine them (Harvard professor David Perkins call this *elementitis*), but it is similar to how topics like *driving* and *baseball* are taught. That is, you can start driving without knowing how an internal combustion engine works, and children begin playing baseball before they learn all the formal rules.

(source: [Demba Ba](https://github.com/zalandoresearch/fashion-mnist) and [Arvind Nagaraj](https://medium.com/towards-data-science/thoughts-after-taking-the-deeplearning-ai-courses-8568f132153))

This notebook was originally created for a 40 minute talk I gave at the O'Reilly AI conference in San Francisco. If you want further resources for linear algebra, here are a few recommendations:

- 3Blue1Brown Essence of Linear Algebra videos about
*geometric intuition*, which are gorgeous and great for visual learners - Khan Academy Linear Algebra videos covering traditional linear algebra material
- Immersive linear algebra free online textbook with interactive graphics
- Chapter 2 of Ian Goodfellow's Deep Learning Book for a fairly academic take
- Computational Linear Algebra: a free, online fast.ai course, originally taught in the University of San Francisco's Masters in Analytics program. It includes a free online textbook and series of videos. This course is very different from standard linear algebra (which often focuses on how
**humans**do matrix calculations), because it is about how to get**computers**to do matrix computations with speed and accuracy, and incorporates modern tools and algorithms. All the material is taught in Python and centered around solving practical problems such as removing the background from a surveillance video or implementing Google's PageRank search algorithm on Wikipedia pages.

We will be using the open source deep learning library, fastai, which provides high level abstractions and best practices on top of PyTorch. This is the highest level, simplest way to get started with deep learning. Please note that fastai requires Python 3 to function. It is currently in pre-alpha, so items may move around and more documentation will be added in the future.

The fastai deep learning library uses PyTorch, a Python framework for dynamic neural networks with GPU acceleration, which was released by Facebook's AI team.

PyTorch has two overlapping, yet distinct, purposes. As described in the PyTorch documentation:

The neural network functionality of PyTorch is built on top of the Numpy-like functionality for fast matrix computations on a GPU. Although the neural network purpose receives way more attention, both are very useful. We'll implement a neural net from scratch today using PyTorch.

**Further learning**: If you are curious to learn what *dynamic* neural networks are, you may want to watch this talk by Soumith Chintala, Facebook AI researcher and core PyTorch contributor.

If you want to learn more PyTorch, you can try this introductory tutorial or this tutorial to learn by examples.

Graphical processing units (GPUs) allow for matrix computations to be done with much greater speed, as long as you have a library such as PyTorch that takes advantage of them. Advances in GPU technology in the last 10-20 years have been a key part of why neural networks are proving so much more powerful now than they did a few decades ago.

You may own a computer that has a GPU which can be used. For the many people that either don't have a GPU (or have a GPU which can't be easily accessed by Python), there are a few differnt options:

**Don't use a GPU**: For the sake of this tutorial, you don't have to use a GPU, although some computations will be slower. The only change needed to the code is to remove`.cuda()`

wherever it appears.**Use crestle, through your browser**: Crestle is a service that gives you an already set up cloud service with all the popular scientific and deep learning frameworks already pre-installed and configured to run on a GPU in the cloud. It is easily accessed through your browser. New users get 10 hours and 1 GB of storage for free. After this, GPU usage is 34 cents per hour. I recommend this option to those who are new to AWS or new to using the console.**Set up an AWS instance through your console**: You can create an AWS instance with a GPU by following the steps in this fast.ai setup lesson.] AWS charges 90 cents per hour for this.

Today we will be working with MNIST, a classic data set of hand-written digits. Solutions to this problem are used by banks to automatically recognize the amounts on checks, and by the postal service to automatically recognize zip codes on mail.

A matrix can represent an image, by creating a grid where each entry corresponds to a different pixel.

Let's download, unzip, and format the data.

Many machine learning algorithms behave better when the data is *normalized*, that is when the mean is 0 and the standard deviation is 1. We will subtract off the mean and standard deviation from our training set in order to normalize the data:

Loading output library...

Note that for consistency (with the parameters we learn when training), we subtract the mean and standard deviation of our training set from our validation set.

Loading output library...

In any sort of data science work, it's important to look at your data, to make sure you understand the format, how it's stored, what type of values it holds, etc. To make it easier to work with, let's reshape it into 2d images from the flattened 1d format.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

It's the digit 3! And that's stored in the y value:

Loading output library...

We can look at part of an image:

Loading output library...

Loading output library...

Loading output library...

A **function** takes inputs and returns outputs. For instance, @@0@@ is an example of a function. If we input @@1@@, the output is @@2@@, or if we input @@3@@, the output is @@4@@

Functions have **parameters**. The above function @@5@@ is @@6@@, with parameters a and b set to @@7@@ and @@8@@.

Machine learning is often about learning the best values for those parameters. For instance, suppose we have the data points on the chart below. What values should we choose for @@9@@ and @@10@@?

In the above gif fast.ai Practical Deep Learning for Coders course, intro to SGD notebook), an algorithm called stochastic gradient descent is being used to learn the best parameters to fit the line to the data (note: in the gif, the algorithm is stopping before the absolute best parameters are found). This process is called **training** or **fitting**.

Most datasets will not be well-represented by a line. We could use a more complicated function, such as @@0@@. Now we have 4 parameters to learn: @@1@@, @@2@@, @@3@@, and @@4@@. This function is more flexible than @@5@@ and will be able to accurately model more datasets.

Neural networks take this to an extreme, and are infinitely flexible. They often have thousands, or even hundreds of thousands of parameters. However the core idea is the same as above. The neural network is a function, and we will learn the best parameters for modeling our data.

Possibly **the most important idea** in machine learning is that of having separate training & validation data sets.

As motivation, suppose you don't divide up your data, but instead use all of it. And suppose you have lots of parameters:

This is called over-fitting. A validation set helps prevent this problem.

[Underfitting and Overfitting](https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted)

The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it's not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.

This illustrates how using all our data can lead to **overfitting**.

We will use fastai's ImageClassifierData, which holds our training and validation sets and will provide batches of that data in a form ready for use by a PyTorch model.

We will begin with the highest level abstraction: using a neural net defined by PyTorch's Sequential class.

Each input is a vector of size @@0@@ pixels and our output is of size @@1@@ (since there are 10 digits: 0, 1, ..., 9).

We use the output of the final layer to generate our predictions. Often for classification problems (like MNIST digit classification), the final layer has the same number of outputs as there are classes. In that case, this is 10: one for each digit from 0 to 9. These can be converted to comparative probabilities. For instance, it may be determined that a particular hand-written image is 80% likely to be a 4, 18% likely to be a 9, and 2% likely to be a 3. In our case, we are not interested in viewing the probabilites, and just want to see what the most likely guess is.

Sequential defines layers of our network, so let's talk about layers. Neural networks consist of **linear layers alternating with non-linear layers**. This creates functions which are incredibly flexible. Deeper layers are able to capture more complex patterns.

Layer 1 of a convolutional neural network:

[Matthew Zeiler and Rob Fergus](http://www.matthewzeiler.com/wp-content/uploads/2017/07/arxive2013.pdf)

Layer 2:

[Matthew Zeiler and Rob Fergus](http://www.matthewzeiler.com/wp-content/uploads/2017/07/arxive2013.pdf)

Deeper layers can learn about more complicated shapes (although we are only using 2 layers in our network):

[Matthew Zeiler and Rob Fergus](http://www.matthewzeiler.com/wp-content/uploads/2017/07/arxive2013.pdf)

Next we will set a few inputs for our *fit* method:

**Optimizer**: algorithm for finding the minimum. typically these are variations on*stochastic gradient descent*, involve taking a step that appears to be the right direction based on the change in the function.**Loss**: what function is the optimizer trying to minimize? We need to say how we're defining the error.**Metrics**: other calculations you want printed out as you train

*Fitting* is the process by which the neural net learns the best parameters for the dataset.

Loading output library...

GPUs are great at handling lots of data at once (otherwise don't get performance benefit). We break the data up into **batches**, and that specifies how many samples from our dataset we want to send to the GPU at a time. The fastai library defaults to a batch size of 64. On each iteration of the training loop, the error on 1 batch of data will be calculated, and the optimizer will update the parameters based on that.

An **epoch** is completed once each data sample has been used once in the training loop.

Now that we have the parameters for our model, we can make predictions on our validation set.

Let's see how some of our preditions look!

Loading output library...

These predictions are pretty good!

Recall that above we used PyTorch's `Sequential`

to define a neural network with a linear layer, a non-linear layer (`ReLU`

), and then another linear layer.

It turns out that `Linear`

is defined by a matrix multiplication and then an addition. Let's try defining this ourselves. This will allow us to see exactly where matrix multiplication is used (we will dive in to how matrix multiplication works in teh next section).

Just as Numpy has `np.matmul`

for matrix multiplication (in Python 3, this is equivalent to the `@`

operator), PyTorch has `torch.matmul`

.

PyTorch class has two things: constructor (says parameters) and a forward method (how to calculate prediction using those parameters) The method `forward`

describes how the neural net converts inputs to outputs.

In PyTorch, the optimizer knows to try to optimize any attribute of type **Parameter**.

We create our neural net and the optimizer. (We will use the same loss and metrics from above).

Loading output library...

Loading output library...

Now we can check our predictions:

Loading output library...

Now let's dig in to what we were doing with `torch.matmul`

: matrix multiplication. First, let's start with a simpler building block: **broadcasting**.

Broadcasting and element-wise operations are supported in the same way by both numpy and pytorch.

Operators (+,-,*,/,>,<,==) are usually element-wise.

Examples of element-wise operations:

Loading output library...

Loading output library...

The term **broadcasting** describes how arrays with different shapes are treated during arithmetic operations. The term broadcasting was first used by Numpy, although is now used in other libraries such as Tensorflow and Matlab; the rules can vary by library.

From the Numpy Documentation:

```
1
2
3
4
5
6
7
```

```
The term broadcasting describes how numpy treats arrays with
different shapes during arithmetic operations. Subject to certain
constraints, the smaller array is “broadcast” across the larger
array so that they have compatible shapes. Broadcasting provides a
means of vectorizing array operations so that looping occurs in C
instead of Python. It does this without making needless copies of
data and usually leads to efficient algorithm implementations.
```

In addition to the efficiency of broadcasting, it allows developers to write less code, which typically leads to fewer errors.

*This section was adapted from Chapter 4 of the fast.ai Computational Linear Algebra course.*

Loading output library...

Loading output library...

How are we able to do a > 0? 0 is being **broadcast** to have the same dimensions as a.

Remember above when we normalized our dataset by subtracting the mean (a scalar) from the entire data set (a matrix) and dividing by the standard deviation (another scalar)? We were using broadcasting!

Other examples of broadcasting with a scalar:

Loading output library...

Loading output library...

Loading output library...

We can also broadcast a vector to a matrix:

Loading output library...

Loading output library...

Although numpy does this automatically, you can also use the `broadcast_to`

method:

Loading output library...

Loading output library...

The numpy `expand_dims`

method lets us convert the 1-dimensional array `c`

into a 2-dimensional array (although one of those dimensions has value 1).

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

When operating on two arrays, Numpy/PyTorch compares their shapes element-wise. It starts with the **trailing dimensions**, and works its way forward. Two dimensions are **compatible** when

- they are equal, or
- one of them is 1

Arrays do not need to have the same number of dimensions. For example, if you have a @@0@@ array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

```
1
2
3
```

```
Image (3d array): 256 x 256 x 3
Scale (1d array): 3
Result (3d array): 256 x 256 x 3
```

The numpy documentation includes several examples of what dimensions can and can not be broadcast together.

We are going to use broadcasting to define matrix multiplication.

Loading output library...

Loading output library...

We get the same answer using `torch.matmul`

:

Loading output library...

The following is **NOT** matrix multiplication. What is it?

Loading output library...

Loading output library...

Loading output library...

Loading output library...

From a machine learning perspective, matrix multiplication is a way of creating features by saying how much we want to weight each input column. **Different features are different weighted averages of the input columns**.

The website matrixmultiplication.xyz provides a nice visualization of matrix multiplcation

Draw a picture

Loading output library...

Loading output library...

Loading output library...

Loading output library...

If you want to test your understanding of the above tutorial. I encourage you to work through it again, only this time use **CIFAR 10**, a dataset that consists of 32x32 *color* images in 10 different categories. Color images have an extra dimension, containing RGB values, compared to black & white images.

(source: [Cifar 10](https://www.cs.toronto.edu/~kriz/cifar.html))

Fortunately, broadcasting will make it relatively easy to add this extra dimension (for color RGB), but you will have to make some changes to the code.

Here are some other examples of where matrix multiplication arises. This material is taken from Chapter 1 of my Computational Linear Algebra course.

The matrix below gives the probabilities of moving from 1 health state to another in 1 year. If the current health states for a group are:

- 85% asymptomatic
- 10% symptomatic
- 5% AIDS
- 0% death

what will be the % in each health state in 1 year?

Loading output library...

Loading output library...

A Tensor is a *multi-dimensional matrix containing elements of a single data type*: a group of data, all with the same type (e.g. A Tensor could store a 4 x 4 x 6 matrix of 32-bit signed integers).