Titanic: Machine Learning from Disaster

#Titanic:-Machine-Learning-from-Disaster

Titanic Survivor Prediction through Data Analysis and Machine Learning Algorithm

#Titanic-Survivor-Prediction-through-Data-Analysis-and-Machine-Learning-Algorithm

Competition Description

#Competition-Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Data Dictionary

#Data-Dictionary

Data is available at kaggle Titanic Competition.

survival: Survival 0 = No, 1 = Yes
pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex: Sex Age: Age in years sibsp: # of siblings / spouses aboard the Titanic parch: # of parents / children aboard the Titanic ticket: Ticket number fare: Passenger fare cabin: Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

#Variable-Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Load Dataset

#Load-Dataset
Loading output library...
Loading output library...

Exploratory Data Analysis

#Exploratory-Data-Analysis

For the visualization I am going to seaborn and matplotlib

Sex

#Sex

The first variable that I am looking into is sex.

Loading output library...
Loading output library...

From the graph it is obvious that women were more likely survived the incidents than men.

To see the exact number,

Loading output library...

The result shows that the survival rate for women is 74.2% and for men is 18.9%. We can conclude that women

Pclass

#Pclass

The next variable is Pclass which represent the ticket class. The hypothesis is that the passengers with higher ticket class would have more likely survived. Let's see if the hypothesis is valid.

Loading output library...
Loading output library...

From the graph we can see:

  • More people survived than perished in the first class.
  • Slighly more people perished than survived in the second class.
  • Considerable number of people perished than survived in the third class.
Loading output library...

More than 62.9% of the first class passengers survived, 47.3% for the second class passengers, 24.2% for the third class passengers. Therefore the hypothesis is valid.

Embarked

#Embarked

Let's see if where they went on-board affects their survivality.

Loading output library...
Loading output library...
Loading output library...

Survival Rate

#Survival-Rate

Cherbourg, C: 55.4%
Queenstown, Q: 39%
Southampton, S: 33.7%

Age & Fare

#Age-&-Fare

Let's looking at age and fare if there's correlation between them and if it affect the survivality using scatter plot.

Loading output library...
Loading output library...

We can see there are outliers who paid the ticket for more than 500 when average ticket price is much lower. I'm going to remove the outliers and look further into the data.

Loading output library...
Loading output library...
Loading output library...

Without outliers, we can see the pattern that the passengers who paid more than 100 dollars are more likely to survived, which is not surprising that it is already the case the higher class ticket they had the more likely they survived. There doesn't seem to have strong correlation between age and fare.

Now to look deeper, I'm going to dive into the passenger who paid less than 100 dollars.

Loading output library...
Loading output library...
Loading output library...

There's a pattern that the passengers who paid less than 20 dollars and under 15 are mostly survived.

SibSp, Parch

#SibSp,-Parch

SibSp is the number of spouses and sibling on board together and Parch the number of parants and children on board together. When looked at them separately, however there might be relationship when they are combined.

Loading output library...
Loading output library...
Loading output library...

Looking at the result we can conclude that:

  • If you are alone, you unlikely survived.
  • Compare to that, if you have a family with you up to 3, you would more likely have survived.
  • However this trend stops, when there are more than 4 of your family with you.

I'm going to devide those into 3 groups. 'Single', 'Nuclear', 'Big'

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Name

#Name

Now looking into name columns. Name column might not seem like important. However, if you look closely you notice there's title in the middle of their names. Which means it might affect the survival rate considering it's class based society when the incident happened.

Loading output library...
Loading output library...

Let's visualize four most common title. 'Mr','Miss','Mrs','Master'

Loading output library...
Loading output library...
Loading output library...

Mr, Mrs, and Miss follow the trend that the female more likely survived than the male. However, when it comes down to Master, which is unmarried male back then, have doesn't follow the trend.

Loading output library...

You can see Masters survival rate is 57.5% while Mr.'s 15.8%.

Preprocessing

#Preprocessing

Now I need to preprocess the data to fit in the machie learning algorithm. I'm going to use Decision Tree Classification in scikit-learn for this analysis.

In order to do that,

  1. The data should be in numeric. So, I will need to encode the character data, such as Sex and Embarked, into numeric data.
  2. There should not be null value.

Encode Sex

#Encode-Sex

There are only two classes, male and female. I'm going to assign 0 for male and 1 for female

Loading output library...
Loading output library...

Fill in missing fare

#Fill-in-missing-fare

There should not be null value if the column is to be used for the training.

Loading output library...
Loading output library...

There is null value ini test data in Fare column. There's only one row that missing the data. It wouldn't be too much of loss even if I fill in 0.

Loading output library...
Loading output library...
Loading output library...

Encode Embarked

#Encode-Embarked

Embarked column also needs to be processed by encoding it. We could encode it like,

  • C == 0
  • S == 1
  • Q == 2

However, if this is input in to Decision Tree Algorithm, the algorithm might get confused like the following scenario.

If S is 1, Q is 2, then 2 * S == Q? or S + S == Q? When the column is just representation of classification, it might interpret as quntitative value.

To avoid that I'm going to use One Hot Encoding Technique.

Now even if the algorithm add or substract, it won't get confused.

In Python, True == 1, False == 0. So it can be interpreted as the following.

Pandas offers dummification function and I'm going to use it to create the dummy columns

Loading output library...
Loading output library...
Loading output library...

Age

#Age

From the result before, we know that if you are under 15, your chance to survive goes up. However, when the decision tree make the node, it doesn't always pick up the best range. To help this problem, I'm going to create Child column, which if it's true the passenger is under 15, and if it's false the passenger is over 15.

Loading output library...
Loading output library...

FamilySize

#FamilySize

Adding FamilySize column for the same reason above.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Name

#Name

Lastly, I found that passengers have Master as their title is different from the tendency that the males didn't likely survived. Therfore, I'm gonna make a column that distinguish them from other groups.

Loading output library...
Loading output library...

Train

#Train

Now based on the findings and preprocessing I'm going to train the algorithm.

Creating sets

#Creating-sets

X_train = training feature table y_train = training label X_test = prediction feature table

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Use Decision Tree

#Use-Decision-Tree
Loading output library...
Loading output library...

Visualize

#Visualize

Decision Tree visualization through graphviz module.

Loading output library...

Predict

#Predict

Now that training is done, we can predict whether someone survived or not with test set.

Loading output library...

Submit

#Submit

Now, create submission file.

Loading output library...
Loading output library...