A Few Practical Thoughts on Supervised Learning


Caution: I am newbie for Kaggle and also real-world machine learning problems. So this notebook contains kind of "failure story" and some questions to you, ladies and gentlemen. I will appreciate your answer!



There are many precious notebooks in Kaggle, (although I have read only a few of most-voted ones), and many of them just used categorical variables, not using dummy variables. And then the comment pointing the use of dummies also got many votes. This is my motivation; how the dummies improve the model?

References so far:

Since I think that other similar topics could be included here, below are the main questions:

  • Nominal variables vs. Dummies
  • Does scaling really have effects?
    Then which variable is appropriate to be applied?
  • How much does the cross-validation improve the model accuracy?
  • Is the regularization really helpful for the generalization?

Here the only logistic regression is used for the training method and there is no comparison among various ones, simply because I do not know much detail of other sophisticated methods. Logistic regression is chosen since it is simple and easy to understand, for me.

The Copied and Pasted Contents from Kaggle


FYI, the dataset is from Kaggle. For those who see this outside Kaggle, below is the copied from https://www.kaggle.com/c/titanic.

Data Description




The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the "ground truth") for each passenger. Your model will be based on "features" like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

survivalSurvival0 = No, 1 = Yes
pclassTicket class1 = 1st, 2 = 2nd, 3 = 3rd
AgeAge in years
sibsp# of siblings / spouses aboard the Titanic
parch# of parents / children aboard the Titanic
ticketTicket number
farePassenger fare
cabinCabin number
embarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

  • pclass: A proxy for socio-economic status (SES)
    1st = Upper
    2nd = Middle
    3rd = Lower
  • age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
  • sibsp: The dataset defines family relations in this way...
    Sibling = brother, sister, stepbrother, stepsister
    Spouse = husband, wife (mistresses and fiancés were ignored)
  • parch: The dataset defines family relations in this way...
    Parent = mother, father
    Child = daughter, son, stepdaughter, stepson
    Some children travelled only with a nanny, therefore parch=0 for them.

Part 1. Sightseeing


Here I will do almost the same as other notebooks do. I would like to check the training data where the preprocessing is necessary. I will also follow some preprocessing strategies from other notebooks (e.g. a strategy to fill NAs for Age).

Let's check the data type first, and see the first few lines.

Loading output library...

Summary statistics next, for numerical and categorical variables respectively.

Loading output library...
Loading output library...

And check the missing data.



Q. Since the goal of this notebook is to compare the effect of some practices, I try to keep the data as raw as possible. For instance, I won't make a new variable Family by adding SibSp and Parch. Any thoughts will be welcome!

From the simple glance above, I figure the variables as...

  • Numerical variables:
  • Age with missing data
    We fill the missing data with the mean w.r.t. their Title group
  • Sibsp
  • Parch
  • Fare (Note that there are missing data in the test set!)
  • Ordinal variables:
  • Pclass
  • Nominal variables: All the nominal converted to their K-1 dummies.
  • Sex (female or male)
  • Title from Name (Master, Miss, Mrs, Mr, Rare)
  • Embarked (Q, S, C): Missing data are filled with S, the mode of the variable.
  • Discarded variables: I couldn't convince myself on how to use it.
  • PassengerId
  • Name
  • Ticket
  • Cabin

Below is the function that extracts Title information from Name column, stolen from other famous Kaggle notebooks.

Outline for Simulations

  • No regularization
  • Dummy variables, or Not
  • Scaling only the numeric, or scaling all, or not at all
  • Regularization with cross-validation
  • Dummy variables, or Not
  • Scaling only the numeric, or scaling all, or not at all

It will be a quite repeated work! To simulate these with less code, below are the universal functions for preprocessing and training. It's quite long but can be used conveniently.

Part 2. No Regularization, No CV


Without the duty of concerning the regularization parameter, the training procedure gets simpler.

Q. Honestly I am not sure to omit CV here.
What would be the merit of CV for such simpler models, having no regularization parameters?

Split the Data for Validation


The raw data are split into training and validation set for assessment of the model.

1. Dummies or Not


Training and Validation


Let's apply the simple logistic regression, and check the accuracy on the validation set.

Oh, dummies do some help.

2. Standardize or Not


I will use dummy variables from the above results.

I found there that the effect of standardization on dummy variables may be different case-by-case. So the goal is to check that effect for this Titanic data set. Therefore I conduct the experiment to standardize

  • nothing (already computed above),
  • only the numeric
  • all (the numeric + dummies)

Then the similar training-and-validation cycle goes on.

Unfortunately, the standardization shows no effect here. So I continue with regularization.

Part 3. Regularization with Cross-Validation


So far our experiments did not give any implication on standardization. However I continue to test it with more general situation, by adding the regularization.

I use the cross-validation technique for catching the best regularization parameter @@0@@. I leave the reference from scikit-learn here and there.

1. Dummies or Not

Loading output library...
Loading output library...

Again, dummies do good.

2. Standardize or Not


Since it is confirmed that dummies do help, I will use dummy variables for the next experiment.

Loading output library...
Loading output library...

I see from this figure that "Standardize-all" strategy is better than others.

Final Tests


I will pass the final test data to every model that trained so far. To list them,

  • logistic1: No regularization, no CV, with dummies, no standardization
  • logistic2: No regularization, no CV, no dummies, no standardization
  • logistic3: No regularization, no CV, with dummies, standardized numeric variables
  • logistic4: No regularization, no CV, with dummies, all standardized
  • logistic5: Regularization with CV, with dummies, no standardization
  • logistic6: Regularization with CV, no dummies, no standardization
  • pipe7: Regularization with CV, with dummies, standardized numeric variables
  • pipe8: Regularization with CV, with dummies, all standardized

Q. Coding such experiments makes me duplicated works. I think it is not a fault of scikit-learn, and it seems quite inevitable. I am curious of how others do such things.



Note that the demo gender_submission.csv provided by Kaggle gave me 76.555%.

No dummies, No std76.555%76.076%
Dummies, No std78.947%78.468%
Dummies, Numeric std78.947%78.468%
Dummies, All std78.947%78.468%

What I learned


Feeling bad to say, but I couldn't say any implication from this experiments.

  • Dummies surely help to make the model better.
  • I need to compare with more complex models!
    I am not sure but simpler models may be robust on scaling or standardization.

    Q. Really?

  • Plotting with CV makes me feel comfortable to visually compare some models.
    But from the final test results, I may say that it does no help sometimes.

Any feedback, suggestion, pointing my mistakes is welcome.