Part 1 - Gentle Introduction to Pipeline


Scikit-learn's Pipeline class is designed as a manageable way to apply a series of data transformations followed by the application of an estimator.

Useful for:

  • Convenience in creating a coherent and easy-to-understand workflow
  • Enforcing workflow implementation and the desired order of step applications
  • Reproducibility
  • Value in persistence of entire pipeline objects (goes to reproducibility and convenience)

Build 3 pipelines, each with a different estimators, using default hyperparameters:

  • Logistic Regression
  • Support Vector Machine
  • Decision Tree

Build a pipeline for transform, consisting of:

  • Feature Scaling
  • Dimensionality Reduction (PCA)

Then, the data is fitted to the final estimators.

To simulate a full-fledged workflow:

  • Followup with scoring test data
  • Compare pipeline model accuracies
  • Identify the "best" model, meaning that which has the highest accuracy on our test data
  • Persist (save to file) the entire pipeline of the "best" model


We will construct pipelines for Logistic Regression, Support Vector Machine and Decision Tree

Part 2 - Integrating Grid Search


Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.

Loading output library...

As summary, we applied feature scaling (scaler), dimensionality reduction (pca), and applied the final estimator (clf)

Part 2.1 Adding Grid Search to the Pipeline


The purpose of grid search is to locate the optimal hyperparameters to optimize the model's accuracy. Grid Search will be applied to optimize the following hyperparameters:

  • criterion - This is the function to evaluate the quality of the split; Both options are available in Scikit-learn: Gini impurity and information gain (entropy)
  • min_samples_leaf - This is the minimum number of samples required for a valid leaf node; we will use the integer range of 1 to 5
  • max_depth - The is the maximum depth of the tree; we will use the integer range of 1 to 5
  • min_samples_split - This is the minimum number of samples required in order to split a non-leaf node; we will use the integer range of 1 to 5
  • presort - This indicates whether to presort the data in order to speed up the location of best splits during fitting; this does not have any effect on the resulting model accuracy (only on training times), but has been included for the benefit of using a True/False hyperparameter in our grid search model.