Machine learning

#Machine-learning

Regression

#Regression
Loading output library...
Loading output library...

We need to measure an error

#We-need-to-measure-an-error

So we will be able to compare models by how well they fit a data.

Loading output library...
Loading output library...

Algorithm

#Algorithm
  • Initialize parameter(s) with some random state (p = 42)
  • Calculate an error/cost for a function with this parameter
  • Change parameter, so error/cost will decrease
  • Repeat until error/cost stop decreasing

Minimizing an error function

#Minimizing-an-error-function

minimize.gif

Classification

#Classification
Loading output library...

Using regression for classification

#Using-regression-for-classification
  • Using the same approach from optimization
  • Different error measure
  • Map output of regression into probability: 0,1
Loading output library...

What if data is not easly separable?

#What-if-data-is-not-easly-separable?
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Regularization

#Regularization
Loading output library...

So what is regularization?

#So-what-is-regularization?
  • It's a penalty for parameters
  • Adding penalty to the cost function:
1
2
    Cost = MSE(y, f(w)) + 1/C * sum(w),
    where f(x) = w*x
  • Bigger parameter - larger cost function
  • Optimizator will minimize parameters

Two main types of regularization

#Two-main-types-of-regularization
  • L1 or sum(|w|) - will push parameters to zero
  • L2 or sum(w^2) - will limit parameter growth

Coming back to the dataset

#Coming-back-to-the-dataset
  • Let's use L1 regularization
  • This will eliminate unuseful polynomial features
Loading output library...
Loading output library...
Loading output library...
Loading output library...

SVM

#SVM
  • Linear model
  • Transforms data into higher dimentions
  • More efficient than manual transform
  • Optimizes margin between classes
Loading output library...

Let's try some real problem

#Let's-try-some-real-problem

kaggle.png

Loading output library...

pr

f1

Area under receiver operating characteristic curve (ROC curve)

#Area-under-receiver-operating-characteristic-curve-(ROC-curve)

roc.png

Loading output library...
Loading output library...

Let's use logistic regresion

#Let's-use-logistic-regresion
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

tree1

tree1

What about more than two dimentions?

#What-about-more-than-two-dimentions?

Getting rid of useless features

#Getting-rid-of-useless-features
Loading output library...

How to visualize

#How-to-visualize
Loading output library...
Loading output library...

Summary

#Summary
  • Know your data
  • Visualize everything
  • Start from simple models
  • Choose right metrics for evaluation
  • Simple heristics are usually good start