Backprop Workbook 00: Forward Propagation


For these questions, assume that an @@0@@ input has 1024 dimensions, that the first hidden layer should have @@1@@ units, a second layer has @@2@@ units, and that there are @@3@@ classes to choose from at the end.

Cell to run for Latex commands


Questions about shapes of @@0@@, @@1@@


1. What do the rows of @@0@@ represent? What do the columns of @@1@@ represent? What is the shape of @@2@@?

Rows represent each image in the batch.

Columns represent each pixel in the image; a column of values is the same pixel's value in each of the images.

The shape of @@0@@ is @@1@@. @@2@@ is the "batch size", or the number of @@3@@ examples we want to simultaneously train on in our "batch."

2. You have a first matrix of weights @@0@@? Why have I written these superscripts?

The shape of @@0@@.

This is the first set of weights and biases used to calculate the pre-activations for the first hidden layer. There will be more weights and biases.

3. For a single @@0@@ input of shape @@1@@, what is the formula to calculate the hidden pre-activation values @@2@@?

Formula is:


The dimensionality of @@1@@.

4. For a batch input matrix @@0@@ of shape @@1@@, what is the formula to calculate the hidden pre-activation values @@2@@?

Hint: in numpy, if @@3@@ has dimension @@4@@, then you can add @@5@@ a vector of dimension @@6@@, and @@7@@ is the @@8@@ matrix where each row of @@9@@ has the vector @@10@@ added to it. This is called a broadcasting addition. You may write the formula with @@11@@ interpreted to allow broadcasting addition.


The dimensionality of @@1@@.

Questions about the @@0@@ function


1. How do I convert a @@0@@ odds of winning to a @@1@@% probability of winning? If the odds of an event are @@2@@, how do I convert that to a percentage?

If the odds are @@0@@, then the formula for @@1@@ to @@2@@ is @@3@@.

2. The @@0@@ function turns a number from @@1@@ into a number from @@2@@ that can be used as a probability. What is the formula for @@3@@? Give me the version with @@4@@ in the numerator.


3. Let's say @@0@@?


4. If we usually interpret the input of @@0@@ as an odds, then if we try to interpret @@1@@ as an odds, what does that imply we would interpret @@2@@ as?

We interpret @@0@@ as the log of an odds.

5a. What @@0@@ value has @@1@@ (50% probability) equivalent to an odds of @@2@@?


5b. When is the probability @@0@@, when is the probability @@1@@?

When @@0@@ is negative the probability will be less than half, when @@1@@ is positive probability will be greater than half.

Note: @@2@@ isn't necessarily a probability. It can be the "percent activated."

6a. What is the problem numerically with the @@0@@ is really large? What will we calculate on our CPU? What do we want to calculate theoretically?

When @@0@@ is really large then the floating point representation of @@1@@ can overflow and be @@2@@.

That's a problem because both the numerator and denominator will be @@3@@ which means their ratio is not a number. We want it to be: @@4@@.

6b. Is there a problem for very negative @@0@@s with this formula?

No, because @@0@@ will round to @@1@@ and that's not a problem because this is zero divided by one which is zero which is correct.

7. How do we fix this problem? What's the better formula for a computer?

@@0@@ values.

Activations and Pre-Activations


1. How do I calculate the activation values @@0@@?


2. What is another name for a function like @@0@@ when used to calculate hidden activations? What are other examples?

Activation function. ReLU, softmax.

3. What do I call the linear transformation of the @@0@@ values before being input into the activation function? What symbols do I use?

We wrote it as @@0@@ and it is called pre-activations.

4. What is the purpose of an activation function? What is an interpretation of the activation function @@0@@?

Without activation functions, we just have a series of linear functions, which is equivalent to just a single linear function.

We want the neurons to be able to represent binary feature detectors: the feature is present or not present. Thus it makes sense to reduce to a range of @@0@@ with zero meaning "not detected" and one meaning "detected."

Intermediate values mean "sort-of detected."

Because of the non-linearity, subsequent layers look for features in the first-layer features.

TODO: Something something universal function approximators.

Investigating columns and rows of @@0@@


1. How do we notate the @@0@@th row of @@1@@th column?


2. What is the formula for calculating a specific pre-activation @@0@@.


3. What does a column of @@0@@ represent?

The column consists of weights for each input dimension @@0@@ used to calculate the preactivation @@1@@th hidden unit.

The row consists of weights for a single input dimension @@2@@ used to compute the contribution of @@3@@ to each of the hidden pre-activations @@4@@.

2nd Hidden Layer, Output Layer


1. What are the dimensions of @@0@@?

@@0@@ and @@1@@

2. What is the dimension of the first-layer activations @@0@@?


3. What are the formulas for @@0@@?


4. What does the row @@0@@ represent?

The @@0@@th column of @@1@@th pre-activation of the second hidden layer.

The @@2@@th row of @@3@@.

What are the dimensions of @@0@@?

@@0@@ and @@1@@. @@2@@.

4. What is the formula for calculating @@0@@?


Output Layer Activations: @@0@@


1. What is the formula for calculating @@0@@ this time!


2. The @@0@@ function maps a 10-dimensional vector @@1@@ calculated?


3. How is the @@0@@ function like the @@1@@ function?


Target Outputs @@0@@, @@1@@, and @@2@@


1. In our problem, we want to classify an input @@0@@ as one of ten classes. Let's represent the correct answer in our training dataset as @@1@@. What is the shape of @@2@@? What is its range of values?

Shape of @@0@@ is () or just a scalar. The range is zero to nine.

2. We will denote the one-hot encoding of @@0@@ as simply @@1@@. What is the shape and range of the values of @@2@@?

This is a ten-dimensional vector, where all the values are zero, except at one position. At the position @@0@@, the value is @@1@@.

In a sense, the one-hot @@2@@ representation is a "perfect" probability distribution for the correct answer.

3. What is the format for @@0@@, which is the one hot encoding of the correct class @@1@@ for each example @@2@@ in the batch?

Shape is @@0@@ and each row @@1@@ is a one hot encoding of @@2@@ (@@3@@-th correct class).

Loss Function: Preliminaries


1. What properties do we want out of our last hidden layer @@0@@ (aka, the output layer)?

All values @@0@@ must be between zero and one because otherwise they're not a valid probability.

The probabilities should sum to one so that @@1@@ forms a proper probability distribution.

For an input @@2@@, we ideally want @@3@@: be all zeros except for at the position of the correct answer, where it has a value 1.0.

2. The probability we assign to the correct class is @@0@@?

1.0 or 100%.

3. What is the ideal value of @@0@@?


4. What is the worst value of @@0@@?

0.0 and @@0@@.

5. If larger values of @@0@@ are better? Why?

Larger ones. Because monotonic.

6. What are the properties of a loss function?

Loss function should be non-negative. Should be zero when perfect/correct.

The worse the prediction the greater the loss function.

7. Can we use @@0@@ as a loss function?

No. Goes negative. Greater values are better.

We use @@0@@ as the loss function.

8. Is there a deep reason for using @@0@@?

TODO: Maximum likelihood of dataset, add up cross entropy losses.

9. What do we call this loss function?

Cross entropy.

Cross Entropy Calculations


1. If I give you @@0@@ for a single example, what is the formula for the cross-entropy loss?


2. Now that you can write @@0@@.


3a. We use Python loops to implement @@0@@s. Python loops are slow. Numpy operations are fast. Let's step-by-step learn to eliminate the explicit summation for @@1@@. Okay?


3b. @@0@@. That involves take the negative log of a dot product. Let's perform this dot product using numpy. A dot product first multiplies corresponding entries in a vector. How do we get numpy to do this for a matrix?


3c. The next step of a dot product is to sum out the products. We need to do this per row. We can do this using np.sum. What named argument must we pass np.sum? What is the shape of the result? How would you describe the result in words?


This is a vector of shape @@1@@. The @@2@@-th entry is equal to the probability assigned to the correct class @@3@@.

3d. Use the above formula to calculate the cross entropies for each example in the batch @@0@@. What is the shape of this?


The shape is @@1@@.

3e. Last, use np.sum to calculate the total mean cross entropy for the batch.