Introduction

#Introduction

The aim of this notebook is to demonstrate if a sample of 36 MRI images of the lumbosacral spine (lower back) can be used to build a predictive model using machine learning (or artificial intelligence - AI) that can predict abnormality (normal vs. abnormal) with high accuracy.

I will first load the libraries required and then walk through a simple explanation of the task. I will conclude with some evaluation merics and short discussion.

Dataset

#Dataset

The sample was recieved in compressed format. After unzipping, the result included two folders: abnormal and normal containing images classified respectively by a medical expert.

First let's look at some simple statistics about our sample. How many total images? how many abnormal? and normal?

Loading output library...

Load Sample Image

#Load-Sample-Image

Ok, now let's load one sample image from our data (see below). Each image can be represented in 2-dimensional space (rows, columns) where the rows start from the top-left corner at location (0,0) then increase downward in the y-axis while columns increase to the right direction in the x-axis. Therefore, each horizontal line is a row and each vertical line is a column. The points of intersection of rows and columns at each location is known as pixels which carry color information.

Since our image is in grayscale (black-and-white), the colors are encoded as integers between 0-255. Therefore each pixel is eventually stored as an integer in the computer memory which represents a color in grayscale.

The dimensions of the image below are 384 rows by 384 columns (384 x 384). This means that there are 384 x 384 = 147,456 pixel points each carrying color intensity information in the range 0-255 where 0 is dark black and 255 is bright white (see the color bar to the right of the image). The computer store this data as 2d array as shown in the print out below after the image.

This means that what we as humans percieve as colors in images is basically bunch of numbers stored in the computer where each number encodes a specific color in each point (or pixel) in the image.

Loading output library...
Loading output library...
Loading output library...

Scaling image dimesions

#Scaling-image-dimesions

Because each image dimensions might be different, we need to scale these images to have similar dimensions for our task.

Below are the dimensions of each image in the dataset.

We can scale each image to have similar dimensions. This is necessary for later processing since the algorithms we will use require uniform input of equal sizes. Luckily this is easy to do with grayscale images and without much loss of quality.

Below is the code to scale each image to 256 x 256 = 65,536.

At this stage we can check the result of our scaling on one image as below. The left is the original (unscaled) image and to the right is the scaled version. Notice the range on the x and y axes reflect the dimensions for each image.

Loading output library...

Image histograms

#Image-histograms

Now since each image is a 2-dimensional matrix with numbers encoding colors, we can show a histogram that captures the frequency of each color (or in other words frequency of each color encoding number). This could be done for both the original (unscaled) images and the new scaled images.

Below is a histogram for the above image again with original unscaled on the left and scaled on the right then the corresponding historgrams is below each image version.

Do you notice the difference in color intensity range (x-axis) between the two histograms? (continue reading below ;)

Loading output library...
Loading output library...

Notice the above histograms differ in the color range (as diplayed; the left histogram has range between approx. 0-255 and the right histogram has range between approx. 0.0-0.99). The reason for this difference is that when the image was scaled (resizing dimensions from 512 to 256), the function that was used to transform the images included the application of a normalization method. The idea of this normalization is to also scale the range of colors from 0-255 to 0-0.99 range. It is well known that machine learning (AI) algorithms perform better when the range of data elements is scaled into 0-1 scale. Therefore the input to the classifier now will be the range of colors encoded in 0 to 0.99 range.

The 2 histograms above for the same image look very similar overall but not identical because the image was resized as well as the range collapsed to 0-0.99 so there are less number of color intensity frequencies and variations.

Classification task

#Classification-task

Let's move now to the most interesting questions:

  • What is the classification accuracy that we can get given this small smaple size of MRI images?
  • And how reliable are the accuracy results? Can we trust the AI model to make this critical task usually performed by medical experts with years of experience and trainining?

Fot this task the aim of classification or predictive models developed using AI is to discriminate the input and assign the most probable outcome label (normal vs. abnormal). At the lowest level, discrimination here means that when the numbers representing each image are input, the algorithm will try to find if there are any patterns of numbers that can distinguish between normal vs. abnormal images. At higher levels, distorted or curved lines of white color or presence of black regions within white regions inside images may indicate abnormality. Therefore, ideally, the role of an AI algorithm is to associate high-level image features (or low-level patterns of numbers) with the correct outcome class or label of an image.

In order to appreciate the complexity of this task, I would like to start first by plotting all the histograms of the scaled images. See if you are able to spot any difference between histograms of abnormal vs. normal images. In other words, can we use frequency patterns of color intensities between images alone as predictive feature to assign the correct label?

The plot below contain each scaled image and its corresponsing historgram below it. It might be easier to right-click then save the plot as image then use external software to zoom in and check the histograms for abnormal images on the top (first row) and normal images below them (second row of images).

Loading output library...
Loading output library...

I am sure now after checking the plots above you will come to the conclusion that the histograms are not useful features to use for assigning normal and abnormal labels to images. In fact, some histograms for the abnormal images look very different from each other. Worse, some historgrams for abnormal images look very similar to histograms for normal images.

This should give you a flavor of how difficult it is to come up with an algorithm that will take a bunch of numbers and their locations in images and try to discover regions of color or paterns of numbers that indicate abnormality. The task for a human is completely different since our perception of reality is different than computers. Obviously, there is no easy way to design a step-by-step algorithm that can take an image and discover a region or bunch of numbers indicative of a particular outcome label.

AI algorithms are powerful for this complex task because they are not expicitly told how to solve a problem. Rather, an AI algorithm (and more specifically Machine Learning algorithms we use here) are designed to automatically fit a model (find patterns) given a sample of data and corresposning classification labels. In general, the algorithm learns patterns from data that can be used to assign an outcome label. This is why the availability of large sample of data is important because the algoirthm is likely to work well if certian features are repeated in one given class (e.g. abnormal images) vs. the other class labels (e.g. normal images) such as distorted lines in images. In addition, the role of an AI programmer is very important in making the sample of data ready for further processing. This role involve, for instance, taking the sample of data and applying a number of transformations or preproeccsing steps to highlight or produce those features that are likely to distinguish between outcome labels. This is refered to as feature engineering. For the more latest AI technologies such as deep learning nueral networks (Convolutional Nueral Networks or CNN), researchers claim that feature engineering which can be very difficult could be avoided entirely given larger sample of data (such as thousands of images) because CNNs can automatically discover high-level features from images and use them during final classification.

For this small sample of data, I would like first to begin with the common AI algorithms and without feature engineering i.e. I will use the raw color encodings from each image as input. First, we need to discuss, cross validation strategy that can be used to obtain accuracy scores for AI algorithms.

Cross Validation

#Cross-Validation

To do unbiased performance measure for machine learning algorithms, we need to split the data into 2 sets of samples: training and testing. This is standard practice for machine learning algorithms where a small collection of data is being held away from the algorithm (will not be seen) and reserved for final validation and accuracy measurement or scoring the algorithm. Training is the process of tuning the algorithm paramters to the sample such that a model can be obtained which predict outcome labels for the task. The training samples will be seen by the algorithm while testing samples will be used to obtain an accuracy score. Using the same sample of data for training and measuring accuracy is considered a major flow and introduces bias in reporting AI algorithms performance.

Because I have a very small sample of images, I can use cross validation(https://en.wikipedia.org/wiki/Cross-validation_(statistics) strategy in which data is split into small sets of equal sizes called folds. For instance, we can use 3-folds cross validation where we use 2 folds for training and 1 fold for testing (scoring algorithm). Each fold will contain approximately equal number of samples i.e. in this case 36/3 = 12. The training/testing can be repeated 3 times such that in each iteration 2 folds are used for training the algorithm while the rest unused (1 fold) is utilized for testing (or scoring the algorithm). Then the average (mean) of the scores between folds can be obtained as well as standard deviation.

Below I will train/test different algorithms, then collect accuracy scores for each and plot the accuracy results.

Results using 3-folds cross validation

#Results-using-3-folds-cross-validation

Below I use 11 different algorithms to classify the images using 3-folds cross validation. The printed results in each line describe the trial number or run number followed by the name of algorithm followed by 3 scores obtained from 3-folds. The trial/run is repeated 5 times (0 to 4) because in each run we may get a different train/test split of data. This is to see rubustness of each algorithm against changes in data splits.

After the print you will find a plot containing a chart of the mean scores from each trial for each algorithm. For most algorithms, the mean is constant (does not change).

Loading output library...

The plot shows that algorithms mean scores are below 60% (0.6 in the plot's y-axis scale). The highest mean score is obtained using the RBF SVM algorithm. The parameters of each of these algorithms may be tuned further to try to improve the above scores but one must be careful not to overfit to the given sample. Overfitting is well known drawback of machine learning algorithms in which we obtain high scores for training samples and low scores for testing samples. Overfitting requires an invistigation that is beyond the scope of this data sheet.

Notice that the mean scores for some algorithms change between different runs/trials while it is constant for others. Below I will compare the predictions between RBF SVM (constant mean scores) and Decision Tree (changing mean scores) algorithms.

SVM vs. Decision Tree classification scores

#SVM-vs.-Decision-Tree-classification-scores

Below I print the confusion matrices for both SVM and Decision Tree algorithms again using 3-folds cross validation. You can see clearly that SVM is acting like a dummy classifier where it predicts all samples as abnormal and because there are more abnormal labels than normal (abnormal =7, normal = 5), we get 58% accuracy (7/12=0.58). In comparsion, Decision Tree is picking some nuances in color features between samples and is able to associate them with outcome labels. In this sense, the accuracy I obtianed for SVM despite being higher can be dismissed as bogus. We can focus on the Decision Tree algorithm now to see if we can improve the scores. In the next section, I will try doing this.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Improved Accuracy with Decision Tree

#Improved-Accuracy-with-Decision-Tree

One trick to obtain a higher scores is to increase the number of folds. The reason why this might yield higher accuracy scores is because the algorithm will have fewer samples for testing/scoring in each fold. For instance, in 3-folds the split is 36/3 = 12 samples in each fold (7 abnormal and 5 normal). If we, however, increase the number of folds to 5 then we will have 36/5 @@0@@ 7. Below is the resulting plot with increasing number of folds from 3 to 10.

Loading output library...

The above figure shows that in general the mean score increase with more folds. The highest score is obtained with 10-folds which is about 72%.

The last thing I will do now is to repeat the 10-folds cross validation 10 times, and calculate the mean score obtained from each trial/run then plot results.

Loading output library...

Discussion and Conclusion

#Discussion-and-Conclusion

The last accuracy scores using 10-folds and Decision Tree algorithm for this small sample show that in principle achieving an accuracy score above 80% is possible but rather unreliable. The reason is simply because after repeating the scoring expreiments we get large difference of mean scores (from 64% to 84% - 20% error margin!). This is unacceptable purely from statitical point of view and can be rejected.

In addition, the features we used to classify images are only greyscale colors and their positions in each image. Looking at the histograms we can see clearly that such features are unlikely to discriminate between outcome labels. What I believe is happening here is that the algorithm is picking nuances of color regions between images that has nothing to do with actual outcome label.

In order to increase confidence of reported accuracy we need to obtain more sample data that is much more represntable of the abnormal vs normal classifications. Furthermore, the input to the machine learning algorithms must be preprocessed to highlight regions or sections of the image which can be used to obtain abnormal classification. There are a number of image processing algorithms(https://en.wikipedia.org/wiki/Featuredetection(computer_vision) that may be used to segment an image and obtain regions of colors which can be used for this purpose.

Finally, although I am skeptical that such a small sample may be used to confidently obtain reliable accuracy scores using AI, it is still worth while experimenting with advanced and more recent deep learning technology such as Convolutional Neural Networks which can spare you the effort of extensive feature engineering or image processing tasks.

1
                                                    Author: Abdulrahman Khalifa (abdulrahman.k.rus@cas.edu.om)