The MNIST data set is a famous example in machine learning. It is a Modified version of the data from the National Institute of Standards and Technology.

The dataset is a collection of labelled handwritten digits from Census Bureau employees and High School Students - the numbers 0-9. When provided with unlabelled digits, the challenge is to determine which digits they were.

It has become something of a challenge to predict the MNIST data using a variety of techniques, and it is routinely used to validate new models and methods.

Preprocess the data. Visualize one element from each class. Visualize the mean of each class.

Let's take a look at how the dataset looks like first.

It seems that each row corresponds to an image of a number, containing 784 pixels. They have to be re-shaped to 28x28 pixels to form the image. Let's make a dataframe for easy visualization.

Loading output library...

Loading output library...

Loading output library...

Try fitting a logistic regression with its solver set to be the 'lbfgs' algorithm. (If you'd like, you can try the other solvers/optimizers and observe the differences in computation time.)

Loading output library...

For educators: These next two blocks of cells are not meant to answer the question. This is just for fun for myself, to show the misclassified digits in random.

Loading output library...

Theoretically, reducing the dimensionality will reduce the computation time because the number of dimensions are reduced and hence, essentially the number of features in the dataset to be processed is reduced as well.

Loading output library...

As can be seen from the graph above, reducing the number of components dramatically reduce the computational time (red line).

Reducing the number of data points will also reduce the computation time since again, the amount of dataset to be processed or computed is less.

In order to reduce the number of data points, I will further split the current training data into two parts, training and validation set. Training the data on the different ratio of training set, I will be able to see the different computational time it will take.

Loading output library...

As can be seen from the graph above, increasing the number of data points increase the accuracy (blue line) considerably while also increasing the run time (red line) at the same time. The run time varies linearly with the number of data points.

From question 2.1, one clear advantage of reducing dimensionality is definitely the reduced computational time while still maintaining the accuracy to a certain extent. The accuracy line shows a logistic growth with the number of components, where it rapidly increases and then plateaus, beyond which the increase in accuracy is not considerably large and even drops after the number of components is increased above 150. One disadvantage of reducing dimensionality is its inability to predict new data and since the model is not good anymore, the model loses its interpretability. Since PCA finds a new set of dimensions such that all dimensions are orthogonal and ranked according to the variance of data along them, this new set of dimension loses any interpretability to the features used.

From question 2.2., one advantage of reducing the data points is the also the reduced computational time to fit the data into the model. Unfortunately, as can be seen from the graph, one disadvantage of reducing the data points is the compromised accuracy. Unlike the case with reducing the number of dimensions, there is no distinct plateau region in reducing the data points i.e. the accuracy will keep on increasing with increasing number of data points.

Use 5-fold cross-validation with a KNN classifier to model the data. Try various values of k )e.g. from 1 to 15) to find the ideal number of neighbors for the model.

From question 2.1 and 2.2, it seems like the using PCA from 50 and above with number of data points more than half of the data is sufficient to give 85% accuracy and above. So for question 3, I'm using PCA with 50 number of components and half of the data points to train and validate my data.

Loading output library...

As you can see from the two graphs above, the best cross-validation accuracy (95.05%) is when number of neighbors is 3.

Also, we can see that fitting the data doesn't take a significant amount of time. Cross-validation is the cause of high computational time. This is because the training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the validation/testing phase, the new datapoint is classified by assigning the label which are most frequent among the k training samples nearest to that query point – hence higher computation.

Due to time constraint, I will test it on the maximum number of training points = 25000.

Loading output library...

From the graph above, increasing the number of neighbors from 1 to 25,000 dramatically reduce the cross-validation accuracy of the model. Cross-validation accuracy accounts both the training accuracy and the validation accuracy of the model. When k = 1, the bias is 0, which means that the training accuracy is 100%. However, when we cross-validate the model, since k = 1, the variance might be high and the hence, the cross-validation accuracy is not 100% and only 90% and above.

When k is very high, for example 5000 and above, the bias of the model increases and the training accuracy decreases. Since the model is very biased, the model loses its interpretability on new data and hence, the cross-validation accuracy decreases tremendously.

Loading output library...

From the graph above, the cross-validation accuracy plateaus at 81.77% after number of depths is greater than 30. When depth = 1, the cross-validation score is very low since it's a very simple model (there are only 2 possible classes out of the 10 classes that we have). A complicated decision tree, one that is very deep, has a low bias and high variance since it has more decision nodes to go through. In such a case, the training accuracy will be 100% but when we introduce a new datapoint to it, if the parameter is slightly different, the datapoint might be classified to a different class resulting in a decreased validation accuracy. Hence, with this bias-variance trade-off in mind, the cross-validation accuracy reaches a plateau once the tree has a certain depth, as seen in the graph.

In both of these models, similar to all machine learning models, there is always a bias-variance trade-off. The trade-off can actually be illustrated more clearly if we use a separate training score and validation score instead of using the cross-validation score. But since the question asks for cross-validation score, I only display that here.

Loading output library...

Loading output library...

For the best model I get from logistic regression, the class confusion matrix is actually pretty good. The diagonal is the highest number there is, which means that the predicted label does match with the actual label. Let's now fill in the diagonals with 0 so that we can the patterns of misclassified labels. Since the diagonals are all greater than 800, let's replace those with 0.

Loading output library...

From the "normalized" confusion table, we can now clearly see which numbers are often misclassified. Let's focus on numbers that have been misclassified more than 30x. Number 3 is often mistaken for 2 (35 times) and 5 (46 times). Number 4 is often mistaken for 4 (42 times). Number 5 is mistaken for 4 (35 times) and 8 (41 times). Number 7 is mistaken for 9 (42 times). Number 8 is mistaken for 5 (32 times). Number 9 is often mistaken for 4 (37 times) and 7 (44 times).

First of all, let's visualize the weights of all the coefficients corresponding to each class. This means, we would need a (10 x 784) pixel matrix, where the 10 corresponds to the total number of classes and 784 corresponds to the individual pixel belonging to that matrix.

If we perform PCA on this, we can get this matrix by doing a vector multiplication on the logistic coefficients (10 classes x number of PCA-transformed coefficients matrix) and the PCA components (number of PCA components matrix x 784 pixels). Visualizing the weights shows us how logistic regression classifies the data.

Let's also apply different regularization value to show the corresponding coefficient weights. From question 4.2, we can focus on c_values = 1e-10, 1e-9, 1e-8 and 1 since they give us different range of accuracy.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

The coefficient weights represent the size and direction of the relationship between a predictor and the response variable. In logistic regression, a change in the predictor variable decides whether an event is more or less likely to happen. From the images above, we can see that when the accuracy is low, the coefficient weights look very distinctly like the label number. This is expected, since this means that to be classified as that particular number, each pixel data needs to look as close to the true number (as visualized by the coefficient weights). This results in the model being able to predict the ones who look more like the number but not the ones which are the number but are slightly 'off' in the handwritten, hence reducing the accuracy. Increasing the accuracy makes the coefficient weights visualization more blurry, allowing the number that is badly handwritten to be classified correctly. The better model, in a way, allows for more 'variance' in the handwritten digits.

Now, let's investigate only label 4 and 9.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

As before, increasing accuracy results in a blurred image of the coefficient weight. Now, to investigate this better, let's look at the difference in the mean of 4.0 and 9.0

Loading output library...

They both look exactly the same! This means that when the model is doing an OK job in differentiating between 4 and 9 (82%), the image of coefficient weights cannot distinguish the 'troublemaker' 4 and 9 because they are equally likely to be either 4 or 9 (as shown by the image of the difference in 4 and 9).

The difference in the mean of 4 and 9 is part of the formula in calculating variance. Now, if you square this difference (which is also a part in calculating variance):

Loading output library...

Those pixels are the 'trouble-maker' area in differentiating between 4 and 9. If we think intuitively, the difference between 4 and 9 is pretty much the area in the region.

Your goal is to train a model that maximises the predictive performance (accuracy in this case) on this task.

Optimise your model's hyperparameters if it has any. Give evidence why you believe the hyperparameters that you found are the best ones.

Provide visualizations that demonstrate the model's performance.

For this question, I will use only 20% of my data for training and validating the model so that I can cut down the computational time in narrowing down my search for the hyperparameters. I will use 50 number of components for PCA to also cut down the computational time.

To optimize the hyperparameter, I will use a grid search first so that I can narrow down the hyperparameter further and further.

Loading output library...

From the first grid search, we can see that the best result is when C = 10 and gamma = 0.001. Let's narrow down the search C-values around 10 and gamma-values around 0.0001. Furthermore, let's run the model and see what kind of accuracy we're getting.

Loading output library...

Now, let's save this model into a pickle file and then load them back to see whether the model runs.