The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Data is available at kaggle Titanic Competition.
survival: Survival 0 = No, 1 = Yes
pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex: Sex Age: Age in years sibsp: # of siblings / spouses aboard the Titanic parch: # of parents / children aboard the Titanic ticket: Ticket number fare: Passenger fare cabin: Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
For the visualization I am going to seaborn and matplotlib
From the graph it is obvious that women were more likely survived the incidents than men.
To see the exact number,
The result shows that the survival rate for women is 74.2% and for men is 18.9%. We can conclude that women
The next variable is Pclass which represent the ticket class. The hypothesis is that the passengers with higher ticket class would have more likely survived. Let's see if the hypothesis is valid.
From the graph we can see:
More than 62.9% of the first class passengers survived, 47.3% for the second class passengers, 24.2% for the third class passengers. Therefore the hypothesis is valid.
Let's looking at age and fare if there's correlation between them and if it affect the survivality using scatter plot.
We can see there are outliers who paid the ticket for more than 500 when average ticket price is much lower. I'm going to remove the outliers and look further into the data.
Without outliers, we can see the pattern that the passengers who paid more than 100 dollars are more likely to survived, which is not surprising that it is already the case the higher class ticket they had the more likely they survived. There doesn't seem to have strong correlation between age and fare.
Now to look deeper, I'm going to dive into the passenger who paid less than 100 dollars.
There's a pattern that the passengers who paid less than 20 dollars and under 15 are mostly survived.
SibSp is the number of spouses and sibling on board together and Parch the number of parants and children on board together. When looked at them separately, however there might be relationship when they are combined.
Looking at the result we can conclude that:
I'm going to devide those into 3 groups. 'Single', 'Nuclear', 'Big'
Now looking into name columns. Name column might not seem like important. However, if you look closely you notice there's title in the middle of their names. Which means it might affect the survival rate considering it's class based society when the incident happened.
Mr, Mrs, and Miss follow the trend that the female more likely survived than the male. However, when it comes down to Master, which is unmarried male back then, have doesn't follow the trend.
You can see Masters survival rate is 57.5% while Mr.'s 15.8%.
Now I need to preprocess the data to fit in the machie learning algorithm. I'm going to use Decision Tree Classification in scikit-learn for this analysis.
In order to do that,
There are only two classes, male and female. I'm going to assign 0 for male and 1 for female
There should not be null value if the column is to be used for the training.
There is null value ini test data in Fare column. There's only one row that missing the data. It wouldn't be too much of loss even if I fill in 0.
Embarked column also needs to be processed by encoding it. We could encode it like,
Q == 2
However, if this is input in to Decision Tree Algorithm, the algorithm might get confused like the following scenario.
If S is 1, Q is 2, then 2 * S == Q? or S + S == Q? When the column is just representation of classification, it might interpret as quntitative value.
To avoid that I'm going to use One Hot Encoding Technique.
Now even if the algorithm add or substract, it won't get confused.
In Python, True == 1, False == 0. So it can be interpreted as the following.
Pandas offers dummification function and I'm going to use it to create the dummy columns
From the result before, we know that if you are under 15, your chance to survive goes up. However, when the decision tree make the node, it doesn't always pick up the best range. To help this problem, I'm going to create Child column, which if it's true the passenger is under 15, and if it's false the passenger is over 15.
Lastly, I found that passengers have Master as their title is different from the tendency that the males didn't likely survived. Therfore, I'm gonna make a column that distinguish them from other groups.
Now based on the findings and preprocessing I'm going to train the algorithm.
X_train = training feature table y_train = training label X_test = prediction feature table
Now that training is done, we can predict whether someone survived or not with test set.