In the previous homework you implemented a fully-connected two-layer neural network on CIFAR-10. The implementation was simple but not very modular since the loss and gradient were computed in a single monolithic function. This is manageable for a simple two-layer network, but would become impractical as we move to bigger models. Ideally we want to build networks using a more modular design so that we can implement different layer types in isolation and then snap them together into models with different architectures.

In this exercise we will implement fully-connected networks using a more modular approach. For each layer we will implement a `forward`

and a `backward`

function. The `forward`

function will receive inputs, weights, and other parameters and will return both an output and a `cache`

object storing data needed for the backward pass, like this:

```
1
2
3
4
5
6
7
8
9
10
```

```
def layer_forward(x, w):
""" Receive inputs x and weights w """
# Do some computations ...
z = # ... some intermediate value
# Do some more computations ...
out = # the output
cache = (x, w, z, out) # Values we need to compute gradients
return out, cache
```

The backward pass will receive upstream derivatives and the `cache`

object, and will return gradients with respect to the inputs and weights, like this:

```
1
2
3
4
5
6
7
8
9
10
11
12
13
```

```
def layer_backward(dout, cache):
"""
Receive derivative of loss with respect to outputs and cache,
and compute derivative with respect to inputs.
"""
# Unpack cache values
x, w, z, out = cache
# Use values in cache to compute derivatives
dx = # Derivative of loss with respect to x
dw = # Derivative of loss with respect to w
return dx, dw
```

After implementing a bunch of layers this way, we will be able to easily combine them to build classifiers with different architectures.

In addition to implementing fully-connected networks of arbitrary depth, we will also explore different update rules for optimization, and introduce Dropout as a regularizer and Batch Normalization as a tool to more efficiently optimize deep networks.

Open the file `cs231n/layers.py`

and implement the `affine_forward`

function.

Once you are done you can test your implementaion by running the following:

Now implement the `affine_backward`

function and test your implementation using numeric gradient checking.

Implement the forward pass for the ReLU activation function in the `relu_forward`

function and test your implementation using the following:

Now implement the backward pass for the ReLU activation function in the `relu_backward`

function and test your implementation using numeric gradient checking:

There are some common patterns of layers that are frequently used in neural nets. For example, affine layers are frequently followed by a ReLU nonlinearity. To make these common patterns easy, we define several convenience layers in the file `cs231n/layer_utils.py`

.

For now take a look at the `affine_relu_forward`

and `affine_relu_backward`

functions, and run the following to numerically gradient check the backward pass:

You implemented these loss functions in the last assignment, so we'll give them to you for free here. You should still make sure you understand how they work by looking at the implementations in `cs231n/layers.py`

.

You can make sure that the implementations are correct by running the following:

In the previous assignment you implemented a two-layer neural network in a single monolithic class. Now that you have implemented modular versions of the necessary layers, you will reimplement the two layer network using these modular implementations.

Open the file `cs231n/classifiers/fc_net.py`

and complete the implementation of the `TwoLayerNet`

class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation.

In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.

Open the file `cs231n/solver.py`

and read through it to familiarize yourself with the API. After doing so, use a `Solver`

instance to train a `TwoLayerNet`

that achieves at least `50%`

accuracy on the validation set.

Loading output library...

Next you will implement a fully-connected network with an arbitrary number of hidden layers.

Read through the `FullyConnectedNet`

class in the file `cs231n/classifiers/fc_net.py`

.

Implement the initialization, the forward pass, and the backward pass. For the moment don't worry about implementing dropout or batch normalization; we will add those features soon.

As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. Do the initial losses seem reasonable?

For gradient checking, you should expect to see errors around 1e-6 or less.

As another sanity check, make sure you can overfit a small dataset of 50 images. First we will try a three-layer network with 100 units in each hidden layer. You will need to tweak the learning rate and initialization scale, but you should be able to overfit and achieve 100% training accuracy within 20 epochs.

Loading output library...

Now try to use a five-layer network with 100 units on each layer to overfit 50 training examples. Again you will have to adjust the learning rate and weight initialization, but you should be able to achieve 100% training accuracy within 20 epochs.

Loading output library...

Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net?

The comparative difficulty of training the five layer net was MUCH harder. I believe this is due to the fact that 5 layers means there's a much more complicated function(which also leads to higher chance of overfitting). Therefore, **initializing with values becomes much more sensitive for 5 layers than it is for 3 layers.**

So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD.

Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochstic gradient descent.

Open the file `cs231n/optim.py`

and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function `sgd_momentum`

and run the following to check your implementation. You should see errors less than 1e-8.

Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster.

Loading output library...

RMSProp 1 and Adam 2 are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.

In the file `cs231n/optim.py`

, implement the RMSProp update rule in the `rmsprop`

function and implement the Adam update rule in the `adam`

function, and check your implementations using the tests below.

1 Tijmen Tieleman and Geoffrey Hinton. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4 (2012).

2 Diederik Kingma and Jimmy Ba, "Adam: A Method for Stochastic Optimization", ICLR 2015.

Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:

Loading output library...

Train the best fully-connected model that you can on CIFAR-10, storing your best model in the `best_model`

variable. We require you to get at least 50% accuracy on the validation set using a fully-connected net.

If you are careful it should be possible to get accuracies above 55%, but we don't require it for this part and won't assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional nets rather than fully-connected nets.

You might find it useful to complete the `BatchNormalization.ipynb`

and `Dropout.ipynb`

notebooks before completing this part, since those techniques can help you train powerful models.

Run your best model on the validation and test sets. You should achieve above 50% accuracy on the validation set.