This report aims to provide insights to the New York City Council's Transportation Committee on vehicle collisions in New York City (NYC). The report was conducted by the Data Operations team and the findings are derived from analyzing the dataset NYPD Motor Vehicle Collisions. With the Validate, Clean, Assess and Relate (VCAR) framework in mind, the analysis is divided as follows:
Analyzing collisions in NYC is a complex challenge that can be modeled using different approaches. With that in mind, we decided to tackle the analysis as a classification problem. Consequently, the report leaves out approaches that could potentially be convenient, such as statistical tests or prediction. We decided to use a classification setting for pratical reasons: we are more insterested in which collisions result in injuries and deaths than predicting the exact number of injuries and deaths given a collision.
In this section, we will collect the raw data from NYC Open Data. A few things to know about the raw data are:
1. The total number of observations is 1,315,244, ranging from July 01, 2012 to July 29, 2018.
2. Each observation is one vehicle collision and each observation has 29 features.
3. The 29 features can be divided in the following groups:
borough, cross_street_name, latitude, location, longitude, zip_code, off_street_nameand
number_of_cyclist_injured, number_of_cyclist_killed, number_of_motorist_injured, number_of_motorist_killed, number_of_pedestrians_injured, number_of_pedestrians_killed, number_of_persons_injuredand
contributing_factor_vehicle_1, contributing_factor_vehicle_2, contributing_factor_vehicle_3, contributing_factor_vehicle_4, contributing_factor_vehicle_5, vehicle_type_code1, vehicle_type_code2, vehicle_type_code_3, vehicle_type_code_4, vehicle_type_code_5.
The following table allows us to identify the proportion of NaN observations per feature in our data.
Seven out of the 29 features have more than 83% observations as NaNs. One of 7 features with high percentage of NaNs,
cross_street_name, do not concern us because its information can be substituted with other location features. The other 6 features with high percentage of NaNs are the result of the fact that the features
vehicle_type_codeX allow NYPD officers to record up to 5 contributing factors or vehicles types. For example, if one collision involved only two vehicle types, three
vehicle_type_codeX features will be NaN.
Since most classification models do not work well NaN features, we will consolidate
vehicle_type_codeX into single features. In other words, instead of having up to five contributing factors per observation, we will have one list of contributing factors per observation. Consolidating these variables into single features has two main benefits: 1) It corrects the potential NYPD officer's error of reporting the same contributing factor or vehicle type per observation, and 2) it helps us generate one hot encodings for these features in the following sections.
Additionally, we will drop the observations where
contributing_factor_vehicle_1, vehicle_type_code1, latitude, and
longitude are NaN. This will ensure all observations have information aboout contributing factors, vehicle types, and location.
After dealing with NaNs, let's make sure the data type of our features is correct. We will pay particular attention to the
time features. Converting these feature to datetime type will be helpful in the following sections.
As a final step in our collectiono and validation process, we will create the response features that will be used throughout the analysis. Specifically, we will create three binary variables indicating whether the collision resulted in 1) injuries, 2) deaths, or 3) death or injuries. Additionally, we will create feature indicating the count of injuries, deaths, and injuries and deaths per collision. These features will be helpful in the data exploration section.
Once we have collected and validated our data, it is time to clean it and conduct feature engineering. In general terms, this is what we are going to do:
borough, cross_street_name, off_street_name, on_street_name, zip_code, latitude, and
datetimeto get year, month, weekday, day, and hour. We will also create sine and cosine transformations for weekday, hour, and minutes elapsed. These features are useful for data modeling because they capture the cyclical nature of the variables. Without transformation, 11:00 pm would be farther away from 2:00 am than 11:00 am.
vehicle_types. The one hot encoodings will help us to use these features as inputs for our classification models.
Since we are generating one hot encodings for a fairly large dataset, this function takes a few minutes to run. Perhaps a more efficient implementation would be to usse Scikit-learn Multilabel Binarizer. However, we prefer to use native Python or Pandas to make the logic behind the process clearer.
After validating and cleaning our data, we are ready to start exploring it. Given the limited scope of this report, we will only explore the following visualizations:
Since we are using interactive graphs, you can show or hide each series by clickling in the name of the series located the legend box.
Let's start by exploring the sum of injuries and deaths by 1) year and month, 2) weekday, 3) hour of the day. Here are some takeaways derived from these plots:
After exploring the sum onf injuries and death by date and time variables, let's move on to explore the top 10 factors that contributed to collisions by hour of the day. Here are some takeaways derived from the plots:
Driver Inattention/Distraction, Failure to Yield Right-of-Way, Fatigued/Drowsy, Following Too Closely, Backing Unsafely, Other Vehicular, Turning Improperly, Lost Consciousness, Passing or Lane Usage Improper, and
Traffic Control Disregarded.
Driver Inattention/Distractions, followed by Failure to Yield Right-of-Way and
Let's also check the top 10 vehicle types involved in collision by hour of the day. A few takeaways are the following:
passenger vehicle, sport utility / station wagon, unknown, taxi, van, pick-up truck, other, sedan, station wagon/sport utility vehicle, bicycle, and
small com veh(4 tires).
sport utility / station wagon.
sport utility / station wagon, unknownand
station wagon/sport utility vehicle. While natural language processing to correct this error is beyond the scope of this report, it is worth noting that a more exhaustive analysis would have to address this problem.
Maps are extremely helpful tools to visualize the frequency of injuries or deaths across NYC. The following map shows all locations (determined by latitude and longitude) where there was more than one death from June 2012 to July 2018. In the map, green circles mean between 2 and 4 deaths, while yellow mean between 5 and 9 deaths, and red circles mean more than 10 deaths. While creating this map, this post was extremely useful. A few take aways are the following:
Choosing the right model is always challenging. Given the limited scope of this report, we will utilize Logistics Regression and Gradient Boosting Classifier. We arbitrarily chose these models to utilize one of the classical classification algorithms (Logistic Regression) and one of the most widely used complex black box algorithms (Gradient Boosting Classifier).
Before jumping to modeling, it is important to check for unbalanced class labels. Since checking for unbalanced labels depends on the response variable that we choose, we conduct the checking in this section of the report rather than in the validation or cleaning section. This will allows us to implement our modeling pipeline to each of our potential response variables. Let's start by checking how unbalanced is our data for each response variable:
As we can see, our labels are extremely unbalanced. For the
is_injured_killed response variables, there are about 81% observations without injuries or deaths and 19% observations with either injuries or deaths. The case of the
is_killed response variable is even more extreme: there are about 99.895% observations without injuries or deaths and 0.105% observations with either injuries or deaths. The problem with unbalanced data is that metrics like precision must be taken with a grain of salt and, most importantly, unbalanced data can bias our classifiers because most classifiers rely on a 50% probability cut off.
Working with unbalanced data is a big topic in data science, which has resulted in several alternatives to address the challenge. However, we will balance our data using subsampling, the easier approach. In a nutshell, subsampling randomly selects a subset of the majority class that matches the number of observations of the minority class. Consequently, subsampling will result in data with the same number of observations with and without injuries and deaths.
We will use subsampling mainly because of two reasons. On the one hand, it takes advantage of the considerably large dataset that we have (+1 million observations). On the other, exploring other methods require further analysis, which is outside the scope of this report. Finally, it is important to be aware that subsampling has an important downside: we are losing information from the majority class.
Before subsampling our data, let's select the features that we will use to fit our models. As a basic approach for the purposes of this report, let's remove all features that are not part of the four following categories:
So far, we have decided to use Logistic Regression and Gradient Boosting Classifier, we have realized that we need to undersample our data depending on the response variable, and we have also chosen our features. Now, let's create a class that fits each model to each reponse variable utilizing the given features. Each instance of the class will have the following attributes:
X_train, X_validation, X_test, y_train, y_validation, y_test
ccuracy, precision, recall, f1
It is worth noting that exhaustive hyper parameter tuning with cross-validation should be conducted while choosing the right model for the classification problem. However, in order to keep the scope of this report limited, we will proceed with most of the default hyper parameters for our two models.
We are ready to comapre the performance of our models on our set of response variables. Let's write a nice little wraper to fit the models and make the comparison easier in a dataframe and in a grid of subplots.
Once we have fitted all our model, we can also access all them using the dictionary
trained_models, where the keys are the following:
As we can see in the results shown above, the model that had the best performance was Logistic Regression using "is_killed" as response. The bright side is that our best classifier gives us information on mortality rate, which is the main interest of the staffer from the Council’s Transportation Committee. The downside is that subsampling
is_killed resulted in losing about 99.8% of our data. In any case, exploring other methods to balance the data are beyond the scope of this report. Let's move on to analize the results of our best classifier.
Keeping the confusion matrix in mind, let's interpret each of the performance metrics:
From a public policy perspective, we are most concerned about correctly classifying collisions that actually resulted in deaths. Therefore, recall is the most important metric for our classifier. Unfortunately, it is also the metric where our classifier performs worst with 70%.
On the bright side, if we compare our baseline accuracy of 50% with our overall accuracy of 75%, our model allows us to correctly classify 25% more collisions resulting in death than random classification.
After evaluating the performance of our classifier, let's take a look at the features ranked by our model as most important to classify collisions resulting in deaths. However, before jumping to interpreting the coefficients, it is important to note that ideally the interpretation of these coefficients should be conducted more in depth. This is because some of our features have different scales and, since our data is not normalized, the coefficients may be overestimating or underestimating the impact on the response variable.
However, for the purposes of this report, let's assume that the coefficients are reliable. The logic behind the interpretation is the following: the higher the weight, the higher the impact, and positive weight means the feature increases the likelihood of death given a collision, while negative weight means the feature decreases the likelihood of death given a collision.
Here are the main takeways from analyzing the features that increase the likelihood of death given a collision:
Here are the main takeways from analyzing the features that decrease the likelihood of death given a collision:
With the information gathered from our models and the data that we cleaned, we are now ready to answer the questions posted by the Council’s Transportation Committee. Let's go one by one.
Unfortunately, answering that question with robust evidence will be hard. Since all the observations in our data represent collisions in NYC, we cannot dirctly compare observation that resulted in a collision against observations that did not result in a collision. We can, however, search for patterns among our collision data.