This report aims to provide insights to the New York City Council's Transportation Committee on vehicle collisions in New York City (NYC). The report was conducted by the Data Operations team and the findings are derived from analyzing the dataset NYPD Motor Vehicle Collisions. With the Validate, Clean, Assess and Relate (VCAR) framework in mind, the analysis is divided as follows:

Analyzing collisions in NYC is a complex challenge that can be modeled using different approaches. With that in mind, we decided to tackle the analysis as a classification problem. Consequently, the report leaves out approaches that could potentially be convenient, such as statistical tests or prediction. We decided to use a classification setting for pratical reasons: we are more insterested in which collisions result in injuries and deaths than predicting the exact number of injuries and deaths given a collision.

1. Data collection and validation

#1.-Data-collection-and-validation

In this section, we will collect the raw data from NYC Open Data. A few things to know about the raw data are: 1. The total number of observations is 1,315,244, ranging from July 01, 2012 to July 29, 2018. 2. Each observation is one vehicle collision and each observation has 29 features.
3. The 29 features can be divided in the following groups:

  • Primary key identifier (unique_key)
  • Location features (borough, cross_street_name, latitude, location, longitude, zip_code, off_street_name and on_street_name.
  • Injuries or deaths (number_of_cyclist_injured, number_of_cyclist_killed, number_of_motorist_injured, number_of_motorist_killed, number_of_pedestrians_injured, number_of_pedestrians_killed, number_of_persons_injured and number_of_persons_killed).
  • Categorical variables (contributing_factor_vehicle_1, contributing_factor_vehicle_2, contributing_factor_vehicle_3, contributing_factor_vehicle_4, contributing_factor_vehicle_5, vehicle_type_code1, vehicle_type_code2, vehicle_type_code_3, vehicle_type_code_4, vehicle_type_code_5.
  • The contributing factor variables refer to the factors that the NYPD officer registered as contributing to the collision. There are up to 5 factors per collision in order to give the NYPD officer opportunity to list more than one contributing factor. There are 59 different contributing factors in the data.
  • Similar to the contributing factor variables, the vehicles types variables refer to the vehicles that the NYPD officer registered as involved in the collision. There are up to 5 vehicles type per collision in order to give the NYPD officer opportunity to list more than one vehicle type. There are 402 different vehicle types in the data.

1.1 Data collection

#1.1-Data-collection
Loading output library...

1.2 Dealing with NaN

#1.2-Dealing-with-NaN

The following table allows us to identify the proportion of NaN observations per feature in our data.

Loading output library...

Seven out of the 29 features have more than 83% observations as NaNs. One of 7 features with high percentage of NaNs, cross_street_name, do not concern us because its information can be substituted with other location features. The other 6 features with high percentage of NaNs are the result of the fact that the features contributing_factor_vehicle_X and vehicle_type_codeX allow NYPD officers to record up to 5 contributing factors or vehicles types. For example, if one collision involved only two vehicle types, three vehicle_type_codeX features will be NaN.

Since most classification models do not work well NaN features, we will consolidate contributing_factor_vehicle_X and vehicle_type_codeX into single features. In other words, instead of having up to five contributing factors per observation, we will have one list of contributing factors per observation. Consolidating these variables into single features has two main benefits: 1) It corrects the potential NYPD officer's error of reporting the same contributing factor or vehicle type per observation, and 2) it helps us generate one hot encodings for these features in the following sections.

Additionally, we will drop the observations where contributing_factor_vehicle_1, vehicle_type_code1, latitude, and longitude are NaN. This will ensure all observations have information aboout contributing factors, vehicle types, and location.

Loading output library...

1.3 Parsing data types

#1.3-Parsing-data-types

After dealing with NaNs, let's make sure the data type of our features is correct. We will pay particular attention to the date and time features. Converting these feature to datetime type will be helpful in the following sections.

1.4 Generating response variables

#1.4-Generating-response-variables

As a final step in our collectiono and validation process, we will create the response features that will be used throughout the analysis. Specifically, we will create three binary variables indicating whether the collision resulted in 1) injuries, 2) deaths, or 3) death or injuries. Additionally, we will create feature indicating the count of injuries, deaths, and injuries and deaths per collision. These features will be helpful in the data exploration section.

Loading output library...

2. Clean data

#2.-Clean-data

Once we have collected and validated our data, it is time to clean it and conduct feature engineering. In general terms, this is what we are going to do:

  • We will not change the locationo variables borough, cross_street_name, off_street_name, on_street_name, zip_code, latitude, and longitude.
  • We will not change the numerical variables number_of_cyclist_injured, number_of_cyclist_killed, number_of_motorist_injured, number_of_motorist_killed, number_of_pedestrians_injured, number_of_pedestrians_killed, number_of_persons_injured, and number_of_persons_killed.
  • We will also not change the primary key unique_key.
  • We will process the datetime to get year, month, weekday, day, and hour. We will also create sine and cosine transformations for weekday, hour, and minutes elapsed. These features are useful for data modeling because they capture the cyclical nature of the variables. Without transformation, 11:00 pm would be farther away from 2:00 am than 11:00 am.
  • We will create one hot encodings of the consolidated variables contributing_factors and vehicle_types. The one hot encoodings will help us to use these features as inputs for our classification models.

2.1 Clean datetime variable

#2.1-Clean-datetime-variable
Loading output library...

2.2 Generate one hot encodings for contributing factors and vehicles types

#2.2-Generate-one-hot-encodings-for-contributing-factors-and-vehicles-types

Since we are generating one hot encodings for a fairly large dataset, this function takes a few minutes to run. Perhaps a more efficient implementation would be to usse Scikit-learn Multilabel Binarizer. However, we prefer to use native Python or Pandas to make the logic behind the process clearer.

Loading output library...

3. Asses data: exploration

#3.-Asses-data:-exploration

After validating and cleaning our data, we are ready to start exploring it. Given the limited scope of this report, we will only explore the following visualizations:

  • Sum of injuries and deaths by year and month, weekday, and hour of the day.
  • Top 10 contributing factors to collisions by hour of the day.
  • Top 10 vehicle types involved in collisions by hour of the day.
  • Map of locations where there was at least more than one death.

Since we are using interactive graphs, you can show or hide each series by clickling in the name of the series located the legend box.

3.1 Sum of injuries and deaths by year and month, weekday, and hour of the day

#3.1-Sum-of-injuries-and-deaths-by-year-and-month,-weekday,-and-hour-of-the-day

3.1.1 Sum of injuries and deaths by year and month

#3.1.1-Sum-of-injuries-and-deaths-by-year-and-month

Let's start by exploring the sum of injuries and deaths by 1) year and month, 2) weekday, 3) hour of the day. Here are some takeaways derived from these plots:

  • There seems to be a trend of declining accidents from December to February across all years.
  • The number of people killed in accidents slightly decreased from 2013 to early 2017. However, this number has been increasing from early 2017 to today.
  • Friday, Saturday, and Monday are the weekdays with more deaths with Friday and Sunday having the higher number of deaths. Friday is also the day of the week with more injuries, followed by Thursday and Saturday.
  • The distribution of deaths and injuries across hours of the day are considerably different. The hours with more deaths are between 5:00 pm and midnight with 4:00 am being the hour with more deaths. In contrast, the hours with more injuries are between 2:00 pm and 7:00 pm with a peak at 5:00 pm.
Loading output library...

3.1.2 Sum of injuries and deaths by weekday

#3.1.2-Sum-of-injuries-and-deaths-by-weekday
Loading output library...

3.1.2 Sum of injuries and deaths by hour of the day

#3.1.2-Sum-of-injuries-and-deaths-by-hour-of-the-day
Loading output library...

3.2 Top 10 contributing factors to collisions by hour of the day

#3.2-Top-10-contributing-factors-to-collisions-by-hour-of-the-day

After exploring the sum onf injuries and death by date and time variables, let's move on to explore the top 10 factors that contributed to collisions by hour of the day. Here are some takeaways derived from the plots:

  • The top ten contributing factors to collisions are Driver Inattention/Distraction, Failure to Yield Right-of-Way, Fatigued/Drowsy, Following Too Closely, Backing Unsafely, Other Vehicular, Turning Improperly, Lost Consciousness, Passing or Lane Usage Improper, and Traffic Control Disregarded.
  • By far, the contributing factor associated with more collisions is Driver Inattention/Distractions, followed by Failure to Yield Right-of-Way and Fatigued/Drowsy.
  • All top ten contributing factors by hour of the day follow a similar distribution than injuries by hour of the day. They increase after 5:00 am with a peak between 4:00 and 5:00 pm.
Loading output library...

3.3 Top 10 vehicle types involved in collisions by hour of the day

#3.3-Top-10-vehicle-types-involved-in-collisions-by-hour-of-the-day

Let's also check the top 10 vehicle types involved in collision by hour of the day. A few takeaways are the following:

  • The top ten contributing factors to collisions are passenger vehicle, sport utility / station wagon, unknown, taxi, van, pick-up truck, other, sedan, station wagon/sport utility vehicle, bicycle, and small com veh(4 tires).
  • By far, the vehicle types more frequently involved in collisions were passenger vehicle and sport utility / station wagon.
  • It is clear that two of the top ten vehicles types invovled in collisions are equal: sport utility / station wagon, unknown and station wagon/sport utility vehicle. While natural language processing to correct this error is beyond the scope of this report, it is worth noting that a more exhaustive analysis would have to address this problem.
Loading output library...

3.4 Map of locations where there was at least more than one death

#3.4-Map-of-locations-where-there-was-at-least-more-than-one-death

Maps are extremely helpful tools to visualize the frequency of injuries or deaths across NYC. The following map shows all locations (determined by latitude and longitude) where there was more than one death from June 2012 to July 2018. In the map, green circles mean between 2 and 4 deaths, while yellow mean between 5 and 9 deaths, and red circles mean more than 10 deaths. While creating this map, this post was extremely useful. A few take aways are the following:

  • Special priority should be given to the two locations with red circles, where there have been 10 and 16 deaths respectively.
  • The 16 locations showing between 5 and 9 deaths (yellow circles) should also be carefully analyzed.
  • Another insightful map would be to color the circles depending on the hour of the day. This would allow us to map the locations with more deaths by hour.
Loading output library...

4. Asses data: modeling

#4.-Asses-data:-modeling

Choosing the right model is always challenging. Given the limited scope of this report, we will utilize Logistics Regression and Gradient Boosting Classifier. We arbitrarily chose these models to utilize one of the classical classification algorithms (Logistic Regression) and one of the most widely used complex black box algorithms (Gradient Boosting Classifier).

4.1 Checking for unbalanced labels

#4.1-Checking-for-unbalanced-labels

Before jumping to modeling, it is important to check for unbalanced class labels. Since checking for unbalanced labels depends on the response variable that we choose, we conduct the checking in this section of the report rather than in the validation or cleaning section. This will allows us to implement our modeling pipeline to each of our potential response variables. Let's start by checking how unbalanced is our data for each response variable:

  • is_injured
  • is_killed
  • is_injured_killed

As we can see, our labels are extremely unbalanced. For the is_injured and is_injured_killed response variables, there are about 81% observations without injuries or deaths and 19% observations with either injuries or deaths. The case of the is_killed response variable is even more extreme: there are about 99.895% observations without injuries or deaths and 0.105% observations with either injuries or deaths. The problem with unbalanced data is that metrics like precision must be taken with a grain of salt and, most importantly, unbalanced data can bias our classifiers because most classifiers rely on a 50% probability cut off.

Working with unbalanced data is a big topic in data science, which has resulted in several alternatives to address the challenge. However, we will balance our data using subsampling, the easier approach. In a nutshell, subsampling randomly selects a subset of the majority class that matches the number of observations of the minority class. Consequently, subsampling will result in data with the same number of observations with and without injuries and deaths.

We will use subsampling mainly because of two reasons. On the one hand, it takes advantage of the considerably large dataset that we have (+1 million observations). On the other, exploring other methods require further analysis, which is outside the scope of this report. Finally, it is important to be aware that subsampling has an important downside: we are losing information from the majority class.

4.2 Determining the features for our models

#4.2-Determining-the-features-for-our-models

Before subsampling our data, let's select the features that we will use to fit our models. As a basic approach for the purposes of this report, let's remove all features that are not part of the four following categories:

  • Numerical location variables (latitude and longitude)
  • Datetime variables (sine and cosine transformations, as weel as year and month number)
  • One hot encoding of contributing factors
  • One hot encoding of vehicle types

4.3 Fitting models

#4.3-Fitting-models

So far, we have decided to use Logistic Regression and Gradient Boosting Classifier, we have realized that we need to undersample our data depending on the response variable, and we have also chosen our features. Now, let's create a class that fits each model to each reponse variable utilizing the given features. Each instance of the class will have the following attributes:

  • Original number of observations as original_num_obs
  • Number of observations lost due to subsampling as obs_lost_due_subsampling
  • Percentage of observation lost due to subsampling as prop_obs_lost_due_subsampling
  • Dataframe balanced with subsampling as df_subsampling
  • X_train, X_validation, X_test, y_train, y_validation, y_test as X_train, X_validation, X_test, y_train, y_validation, y_test
  • Accuracy, precision, recall, f1 as ccuracy, precision, recall, f1
  • Dataframe with features ranked by importance as features_df
  • Confusion matrix seaborn object as confusion_matrix

It is worth noting that exhaustive hyper parameter tuning with cross-validation should be conducted while choosing the right model for the classification problem. However, in order to keep the scope of this report limited, we will proceed with most of the default hyper parameters for our two models.

4.4 Comparing performance

#4.4-Comparing-performance

We are ready to comapre the performance of our models on our set of response variables. Let's write a nice little wraper to fit the models and make the comparison easier in a dataframe and in a grid of subplots.

Loading output library...

Once we have fitted all our model, we can also access all them using the dictionary trained_models, where the keys are the following:

  • 'LogisticRegression-is_killed'
  • 'LogisticRegression-is_injured'
  • 'LogisticRegression-is_injured_killed'
  • 'GradientBoostingClassifier-is_killed'
  • 'GradientBoostingClassifier-is_injured'
  • 'GradientBoostingClassifier-is_injured_killed'.

4.5 Analyzing performance metrics of the final model

#4.5-Analyzing-performance-metrics-of-the-final-model

As we can see in the results shown above, the model that had the best performance was Logistic Regression using "is_killed" as response. The bright side is that our best classifier gives us information on mortality rate, which is the main interest of the staffer from the Council’s Transportation Committee. The downside is that subsampling is_killed resulted in losing about 99.8% of our data. In any case, exploring other methods to balance the data are beyond the scope of this report. Let's move on to analize the results of our best classifier.

Keeping the confusion matrix in mind, let's interpret each of the performance metrics:

  • The accuracy achieved was 75%. In other words, if we input our classifier with features describing a new collision, we should expect that the classifier correctly predicts whether the collision resulted in one or more death 75 out of 100 times.
  • The precision achieved was 79%. In other words, among the collisions that the classifier predicted with deaths, about 79% actually resulted in death.
  • The recall achieved was 70%. In other words, among the collisions that actually resulted in deaths, the classifier correctly predicted 70 ouf 100 cases.
  • The baseline accuracy is 50%. It rrefers to number of correct labels that we would predict if we randomly assigned labels to our collisions. Since our classification problem is binary, our baseline accuracy is equal to the mean of the response variable.It is woorth noting that we have a baseline accuracy of 50% because we balanced our data.

From a public policy perspective, we are most concerned about correctly classifying collisions that actually resulted in deaths. Therefore, recall is the most important metric for our classifier. Unfortunately, it is also the metric where our classifier performs worst with 70%.

On the bright side, if we compare our baseline accuracy of 50% with our overall accuracy of 75%, our model allows us to correctly classify 25% more collisions resulting in death than random classification.

4.6 Analyzing most important features

#4.6-Analyzing-most-important-features

After evaluating the performance of our classifier, let's take a look at the features ranked by our model as most important to classify collisions resulting in deaths. However, before jumping to interpreting the coefficients, it is important to note that ideally the interpretation of these coefficients should be conducted more in depth. This is because some of our features have different scales and, since our data is not normalized, the coefficients may be overestimating or underestimating the impact on the response variable.

However, for the purposes of this report, let's assume that the coefficients are reliable. The logic behind the interpretation is the following: the higher the weight, the higher the impact, and positive weight means the feature increases the likelihood of death given a collision, while negative weight means the feature decreases the likelihood of death given a collision.

Here are the main takeways from analyzing the features that increase the likelihood of death given a collision:

  • By far, riding a motorcycle is the most important factor that increases the likelihood of death if involved in an accident.
  • Traffic control diregard, unsafe speed, illness, and passenger distraction are also key factors that increases the likelihood of death if involved in an accident.
  • With less magnitude, location variables are related with increased likelihood of death given a collision.
  • Time elapsed in a day also increases likelihood of death given a collision. In other words, the later the collision happens, the more likely it will result in deaths.
Loading output library...

Here are the main takeways from analyzing the features that decrease the likelihood of death given a collision:

  • If there is a collision in NYC, it is less likely that the collision will result in deaths if the reason of the collision was turning improperly, fatigued or drowsy, passing or lane usage improper, and reaction to other uninvolved vehicle.
  • If there is a collision in NYC, it is less likely that the collision will result in deaths if the vehicle type involved in the accident is a sedan, other vehicle, a passenger vehicle, or a livery vehicle.
  • Location and time variables do not seem to have much negative impact in decreasing the likelihood of death given a collision.
Loading output library...

5. Relate data: Answering the questions

#5.-Relate-data:-Answering-the-questions

With the information gathered from our models and the data that we cleaned, we are now ready to answer the questions posted by the Council’s Transportation Committee. Let's go one by one.

5.1 What factors impact the likelihood of collisions?

#5.1-What-factors-impact-the-likelihood-of-collisions?

Unfortunately, answering that question with robust evidence will be hard. Since all the observations in our data represent collisions in NYC, we cannot dirctly compare observation that resulted in a collision against observations that did not result in a collision. We can, however, search for patterns among our collision data.

5.2 What factors impact the likelihood of mortality rate from collisions?

#5.2-What-factors-impact-the-likelihood-of-mortality-rate-from-collisions?

5.3 What measures the public can take to avoid collisions?

#5.3-What-measures-the-public-can-take-to-avoid-collisions?

5.4 Is mortality rate impacted by the type of vehicles driven?

#5.4-Is-mortality-rate-impacted-by-the-type-of-vehicles-driven?

5.5 What are the best and worst times to be driving when collision safety is your only consideration?

#5.5-What-are-the-best-and-worst-times-to-be-driving-when-collision-safety-is-your-only-consideration?

5.6 What factors impact injury and mortality rates, and how so?

#5.6-What-factors-impact-injury-and-mortality-rates,-and-how-so?

5.7 What effect does vehicle type have on mortality rate given an accident?

#5.7-What-effect-does-vehicle-type-have-on-mortality-rate-given-an-accident?