Data Analysis in Python

Loading output library...
Loading output library...
Loading output library...

Data Wrangling

  • Indentify and handle missing values
  • Data Formatting
  • Data Normalization (Centering scaling)
  • Data binning
  • Turning categorical values to numeric values

How to deal woth missing data?

  • Check with the data collection source
  • Drop the missing variable or just the single data entry
  • Replace the missing values
    • Replace it with an average (of similar data points)
    • Replace categorial variables with the mode (Most frequent observation in the available data)
    • Replace it based on other functions (Context, reasons for missing values, common patterns in missing values)
  • Leave it as missing data
Loading output library...
Loading output library...
Loading output library...

Data Formatting

  • Data are usually collected from different places and stored in different formats.
  • Bringing data into a common standard of expression allows users to make meaningful comparison

Data types in pandas

  • Objects: Letters or words
  • Int64: Integers
  • float: Real numbers

Data normalization


We may want to normalize these variables so that the range of the value is consistent.

Normalization techniques

  • Range from 0 to 1. "simple feature scaling"


  • Min-max. Range from 0 to 1.


  • Z-score or Standard score, typically between -3 to 3. but can be higher or lower


Binning data

  • Binning: Grouping data of values into "bins"
  • Converts numeric into categorical values
  • Group a set of numerical values into a set of bins

Turning categorical variables into quantitative

  • Problem: Most statistical models cannot take in the objects/strings as input.
  • Solution: Add dummy variables for each unique category, Assign 0 or 1 in each category

Exploratory Data analysis

  • Summarize main characteristics of the data
  • Gain better understading of the data set
  • Uncover relationships between variables
  • Extract importat variables

What are the characteristics that have the most impact on the car price?


  • Upper and lower extremes are calculated as 1.5 times the interquartile range
Loading output library...
Loading output library...
Loading output library...
Loading output library...


Loading output library...
Loading output library...

Analysis of variance ANOVA

  • statistical comparison of groups
  • Finding correlation between different groups of a categorical variable
  • F-test score: variation between sample group means divided by variation within sample group. Small F imply poor correlation between variable categories and target variable
  • P-value Confidence degree


Loading output library...
Loading output library...

Correlations statistics


Model development

  • A model can be thought of as a mathematical equation used to predict a value given one or more other values
  • relating one or more independent variables to dependent variables

Linear regression

  • Linear regression will refer to one independent variable to make a prediction
  • Multiple linear regression will refer to multiple independet variables to make a prediction

Simple linear regression



Where the parameter @@1@@ is the slope

  • The uncertainty is taken into account by assuming small random value is added to the point on the line; this is called noise.
Loading output library...

Multiple linear regression



Loading output library...

Model evaluation using visualization


Regression plot

Loading output library...
Loading output library...

Residual plot

  • The residual plot represents the error between the actual values.


  • Then we plot that value on the vertical axis, with the dependant variable as the horizontal axis
  • We expect to see the results to hace zero mean, Distributed evenly around the x axis with similar variance; there is no curvature (This type of residual plot suggets a linear plot is appropiate)
Loading output library...
Loading output library...

Distribution plot

  • Counts the predicted value versus the actual value
  • This plots are extermely useful for visualizing models with more than one independet variable or feature
Loading output library...
Loading output library...

polynomial regression and pipelines

  • A special case of the general linear regression model
  • Useful for describing curvilinear relationships
  • curvilinear relationship: By squaring or setting gigher-order terms of the predictor variables

  • Cuadratic model:


  • Cubic model:


  • Higher order models:


  • The degree of the regression makes a big difference and can result in a better fit if you pick the right value.


  • There are many steps to getting a prediction -Normalization-Polynomial transformation - linear regression

Measures for In-sample evaluation

  • Measures for in-sample evaluation
  • Mean squared error (MSE)


  • R-squared: Coefficient of determination.
  • Is a measure to determine how to close the data is to the fitted regression line.
  • @@1@@: The percentage of variation of the target variavle (Y) that is explained by the linear model


  • If the @@3@@ is negative it can be due to overfitting.
Loading output library...

Prediction and decision making


To determine final best fit, we look at a combination of:

  • Do the predicted values make sense.
  • Visualization
  • Numerical measures for evaluation
  • Compating Models
Loading output library...

Model Evaluation

  • In-sample evaluation tells us how our model will fit the data used to train it
  • Problem: The model doesn't tell us how well the trained model can be used to predict new data.
  • Solution: Split the data in train-set and test-set (70% - 30%)

Generalization performance

  • Generalization error is measure of how well our data does at predicting previously unseen data.
  • The error we obtain using our testing data is an approximation of this error

Cross validation

  • Most common out-of-sample evaluation metrics
  • More effective use of data (Each observation is used for both training and testing)
    • In this method, the data set is split inyo k-equal groups; each group is referred to as a fold. Some of the folds can be used as a training set, which we use to test the model.
    • For example, we can use 3 folds for training; thenuse one fold for testing. This is repeated until each partition is used for both training and testing. At the end, we use the average results as the estimate of out-of-sample error.
    • The evaluation metric depends on the model.
Loading output library...

Over fitting, underfitting and model selection

  • Underfitting: The model is too simple to fit the data, if we increase the order of the polynomial, the model fits better.
  • Overfitting: Where the model is too flexible and fits the noise rather than the function.
  • The training error decreases with the order of the polynomial ans the test error is a better means of estimating the error of a polynomial.
Loading output library...

Ridge Regression

  • Ridge regression prevents overfitting
  • In high order polynomial regressions the coefficientes usually have a very large magnitude
  • Ridge regression controls the magnitude of the polynomial coefficientes by introducing the parameter alpha
  • As alpha increases, the parameters get smaller
  • If alpha is too large, the coefficients will approach zero and underfitting the data
  • In order to select alpha, we use cross validation

To determine Alpha, we use some data for training. We use a second set called validation data; this is similar to test data, but it is used to select parameters like alpha

We select the value of alpha that maximizes the R-squared. We can use other metrics to select the value of alpha like MSE

Grid Search

  • Parameters like alpha term discussed in the previous video are no part of the fitting or training process. These values are called hyperparameters.

  • Grid Search takes the model objects you would like to train and different values of hyperparameters.

  • To select the hyperparameter that minimizes the error of the model, we split our data set into three parts, the training set, validation set and test set.

    • We train the model for different hyperparameters.
    • We use the the desired permormance metric
    • We select the hyperparameter that minimizes the MSE or maximizes @@0@@ on the validation set
    • We finally test our model performance using the test data
  • The grid search takes on the scoring method, in this case @@1@@, the number of folds, the model or object, and the free parameter values