Programming Assignment 1: k-Nearest Neighbor Model for Binary Classification

#Programming-Assignment-1:-k-Nearest-Neighbor-Model-for-Binary-Classification

Part A: Model Code

#Part-A:-Model-Code

5. Euclidean distance of two vectors

#5.-Euclidean-distance-of-two-vectors
  • Formula: @@0@@

Example:

Loading output library...

6. Manhattan distance of two vectors

#6.-Manhattan-distance-of-two-vectors
  • Formula: @@0@@

Example:

Loading output library...

7. The accuracy and generalization error of two vectors

#7.-The-accuracy-and-generalization-error-of-two-vectors

Example:

8. Function: Precision, Recall and F1 Score

#8.-Function:-Precision,-Recall-and-F1-Score
  • Precision scores: tp/(tp+fp). Intuitively, the precision scores tell out of all the correct prediction what percentage is the classifier correct true class and ground truth is true
  • Recall scores: tp/(tp+fn). Intuitively, the recall scores tell the ability of classifier to find positve samples out of the data space
  • F1 scores: weighted average of the precision and recall @@0@@

Example (Comparing result with scikit function):

#Example-(Comparing-result-with-scikit-function):

9. Confusion matrix of two vectors

#9.-Confusion-matrix-of-two-vectors

  • True Positive(TP): predicted yes and ground truth = yes
  • True Negative(TN): predicted no and ground truth = no
  • False Positive(FP): predicted yes and ground truth = no
  • False Negative(FN): predicted no and ground truth = yes
  • NOTE: we try to match our implementation with sk_learn therefore, the order of the confusion matrix is as follow
1
2
3
4
Predicted 0   1 
True      
0         tn  fp
1         fn  tp
1
2
np.array([[tn, fp],
            [fn, tp]])

Example:

Loading output library...

10. Function to generate ROC curve

#10.-Function-to-generate-ROC-curve
  • The receive operating characteristic (ROC) is a diagnotics tool as it probability thresh varies. This is a tool to select the best probability threshold for model when create a binary classification model.
  • created by plotting the TPR and FPR at various thresh hold setting.
  • Formula link
  • True positive rate (TPR) @@0@@

  • False positive rate (FPR): @@1@@

Example:

Loading output library...

11. Compute area under curve (AUC) for the ROC curve

#11.-Compute-area-under-curve-(AUC)-for-the-ROC-curve

Example:

12. (BONUS) Function to generate the precision-recall curve

#12.-(BONUS)-Function-to-generate-the-precision-recall-curve

Example:

Loading output library...
Loading output library...

13. kNN model class

#13.-kNN-model-class

Example (Check difference model metric)

#Example-(Check-difference-model-metric)

Example (Check model performance on 'uniform' weight metric)

#Example-(Check-model-performance-on-'uniform'-weight-metric)

Example (Check model performance on 'uniform' weight metric)

#Example-(Check-model-performance-on-'uniform'-weight-metric)

Example (Sanity check the score board when y_labels is not given)

#Example-(Sanity-check-the-score-board-when-y_labels-is-not-given)
Loading output library...

B. Data processing

#B.-Data-processing

14. Read in white wine portion of the wine quality dataset from UCI's repository

#14.-Read-in-white-wine-portion-of-the-wine-quality-dataset-from-UCI's-repository
Loading output library...

15. Convert target to two-category class

#15.-Convert-target-to-two-category-class
Loading output library...

16. Explore and summarize the dataset

#16.-Explore-and-summarize-the-dataset

Describe the numerical record

#Describe-the-numerical-record
Loading output library...

Dimension of the data

#Dimension-of-the-data
Loading output library...

Visualize the data

#Visualize-the-data
Loading output library...

17. Shuffle the data

#17.-Shuffle-the-data
  • Use def = df.sample(frac=1)
Loading output library...

18. Generate pair plot

#18.-Generate-pair-plot
Loading output library...

19. Drop the redundant features

#19.-Drop-the-redundant-features

List out columns name

#List-out-columns-name
Loading output library...

Correlation between label and features

#Correlation-between-label-and-features
Loading output library...

Choose the most correlated features

#Choose-the-most-correlated-features
Loading output library...

20. Partition the data into train and test set

#20.-Partition-the-data-into-train-and-test-set

Create dataset

#Create-dataset

Sanity check partition function

#Sanity-check-partition-function

21. Naively run kNN model on train dataset with k=5 using L2

#21.-Naively-run-kNN-model-on-train-dataset-with-k=5-using-L2
Loading output library...

Define function to evaluate implemented and sklearn model

#Define-function-to-evaluate-implemented-and-sklearn-model

21a. Use Accuracy and F1 score to compare predictions to the expected variable

#21a.-Use-Accuracy-and-F1-score-to-compare-predictions-to-the-expected-variable

Performance of implemented model (uniform weight) - unscaled

#Performance-of-implemented-model-(uniform-weight)---unscaled

Performance of Scikit-learn model (uniform weight) - unscaled

#Performance-of-Scikit-learn-model-(uniform-weight)---unscaled

Visualize Precision-Recall Curve

#Visualize-Precision-Recall-Curve
Loading output library...
Loading output library...

21b. Standardized data (subtract mean and divide by standard deviation)

#21b.-Standardized-data-(subtract-mean-and-divide-by-standard-deviation)
Loading output library...

21c. Rerun the KNN model on the standardized data

#21c.-Rerun-the-KNN-model-on-the-standardized-data

21d. Compare the two accuracy values and the F1 scores

#21d.-Compare-the-two-accuracy-values-and-the-F1-scores

Performance of implemented model (uniform weight) - scaled

#Performance-of-implemented-model-(uniform-weight)---scaled

Performance of Scikit-learn model (uniform weight) - scaled

#Performance-of-Scikit-learn-model-(uniform-weight)---scaled

Visualize Precision-Recall Curve

#Visualize-Precision-Recall-Curve
Loading output library...
Loading output library...

21d. Whether to normalize data or not

#21d.-Whether-to-normalize-data-or-not
  • After the comparison above, standardize data is used for the remainder of the data as it gives a better performance.

21e. Evaluate model with inverse distance weight metric

#21e.-Evaluate-model-with-inverse-distance-weight-metric

Perform on vanilla data

#Perform-on-vanilla-data

Performance of implemented model (distance weight) - unscaled

#Performance-of-implemented-model-(distance-weight)---unscaled

Performance of Sckit-learn model (distance weight) - unscaled

#Performance-of-Sckit-learn-model-(distance-weight)---unscaled

Performance on Normalized data

#Performance-on-Normalized-data

Performance of implemented model (distance weight) - scaled

#Performance-of-implemented-model-(distance-weight)---scaled

Performance of Sckit-learn model (distance weight) - scaled

#Performance-of-Sckit-learn-model-(distance-weight)---scaled

After the comparison above, distance weightage is preferred as it gives a better performance.

Part C: Model Evaluation

#Part-C:-Model-Evaluation

22. Implement K-fold(S-fold) cross validation function

#22.-Implement-K-fold(S-fold)-cross-validation-function

23. Use S-fold function to evaluate the performance of models

#23.-Use-S-fold-function-to-evaluate-the-performance-of-models

Create performance table for report with respect to the different levels of neighbor and the different distance metrics

#Create-performance-table-for-report-with-respect-to-the-different-levels-of-neighbor-and-the-different-distance-metrics
Loading output library...

23d. Determine the best model based on the overall performance (lowest average error)

#23d.-Determine-the-best-model-based-on-the-overall-performance-(lowest-average-error)
Loading output library...
Loading output library...
Loading output library...
Loading output library...

24. Evaluate and report performance of model

#24.-Evaluate-and-report-performance-of-model
Loading output library...
Loading output library...

25. Calculate and report the 95% confidence interval on the generalization error estimate (5 pts)

#25.-Calculate-and-report-the-95%-confidence-interval-on-the-generalization-error-estimate-(5-pts)