Caution: I am newbie for Kaggle and also real-world machine learning problems. So this notebook contains kind of "failure story" and some questions to you, ladies and gentlemen. I will appreciate your answer!
There are many precious notebooks in Kaggle, (although I have read only a few of most-voted ones), and many of them just used categorical variables, not using dummy variables. And then the comment pointing the use of dummies also got many votes. This is my motivation; how the dummies improve the model?
References so far:
Since I think that other similar topics could be included here, below are the main questions:
Here the only logistic regression is used for the training method and there is no comparison among various ones, simply because I do not know much detail of other sophisticated methods. Logistic regression is chosen since it is simple and easy to understand, for me.
FYI, the dataset is from Kaggle. For those who see this outside Kaggle, below is the copied from https://www.kaggle.com/c/titanic.
The data has been split into two groups:
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the "ground truth") for each passenger. Your model will be based on "features" like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
|survival||Survival||0 = No, 1 = Yes|
|pclass||Ticket class||1 = 1st, 2 = 2nd, 3 = 3rd|
|Age||Age in years|
|sibsp||# of siblings / spouses aboard the Titanic|
|parch||# of parents / children aboard the Titanic|
|embarked||Port of Embarkation||C = Cherbourg, Q = Queenstown, S = Southampton|
Here I will do almost the same as other notebooks do. I would like to check the training data where the preprocessing is necessary. I will also follow some preprocessing strategies from other notebooks (e.g. a strategy to fill
Let's check the data type first, and see the first few lines.
Summary statistics next, for numerical and categorical variables respectively.
And check the missing data.
Q. Since the goal of this notebook is to compare the effect of some practices, I try to keep the data as raw as possible. For instance, I won't make a new variable
Family by adding
Parch. Any thoughts will be welcome!
From the simple glance above, I figure the variables as...
C): Missing data are filled with
S, the mode of the variable.
Below is the function that extracts Title information from Name column, stolen from other famous Kaggle notebooks.
It will be a quite repeated work! To simulate these with less code, below are the universal functions for preprocessing and training. It's quite long but can be used conveniently.
Without the duty of concerning the regularization parameter, the training procedure gets simpler.
Q. Honestly I am not sure to omit CV here.
What would be the merit of CV for such simpler models, having no regularization parameters?
The raw data are split into training and validation set for assessment of the model.
Let's apply the simple logistic regression, and check the accuracy on the validation set.
Oh, dummies do some help.
I will use dummy variables from the above results.
I found there that the effect of standardization on dummy variables may be different case-by-case. So the goal is to check that effect for this Titanic data set. Therefore I conduct the experiment to standardize
Then the similar training-and-validation cycle goes on.
Unfortunately, the standardization shows no effect here. So I continue with regularization.
So far our experiments did not give any implication on standardization. However I continue to test it with more general situation, by adding the regularization.
Again, dummies do good.
Since it is confirmed that dummies do help, I will use dummy variables for the next experiment.
I see from this figure that "Standardize-all" strategy is better than others.
I will pass the final test data to every model that trained so far. To list them,
logistic1: No regularization, no CV, with dummies, no standardization
logistic2: No regularization, no CV, no dummies, no standardization
logistic3: No regularization, no CV, with dummies, standardized numeric variables
logistic4: No regularization, no CV, with dummies, all standardized
logistic5: Regularization with CV, with dummies, no standardization
logistic6: Regularization with CV, no dummies, no standardization
pipe7: Regularization with CV, with dummies, standardized numeric variables
pipe8: Regularization with CV, with dummies, all standardized
Q. Coding such experiments makes me duplicated works. I think it is not a fault of scikit-learn, and it seems quite inevitable. I am curious of how others do such things.
Note that the demo
gender_submission.csv provided by Kaggle gave me 76.555%.
|No dummies, No std||76.555%||76.076%|
|Dummies, No std||78.947%||78.468%|
|Dummies, Numeric std||78.947%||78.468%|
|Dummies, All std||78.947%||78.468%|
Feeling bad to say, but I couldn't say any implication from this experiments.
Any feedback, suggestion, pointing my mistakes is welcome.