The information in this project was obtained from the Stanford Open Policing Project. This is made available under the Open Data Commons Attribution License. Their working paper can be found at this link.
I have recently begun learning about data science. This project is merely to demonstrate some of the things that I have learned. This will explore the data from the state of Vermont. Most of the learning that I have done is in Python, so I will use Python to explore the data.
First, we do some preliminary exploratory data analysis. The
.info method from Pandas will show us how many non-null entries each column has. Since there are 10000 total rows, anything less will indicate some data is missing from that column. We can see that most columns have very little missing data, with the exception of the
search_type column, which appears to only have 91 non-null entries.
We will examine the
search_type column to see why it has so many missing values. This may be due to the fact that searches were only conducted in 91 of the 10,000 total stops in this data. We see that
search_type_raw has more non-null entries, so it might contain more information. From the head printed above, we see that
No Search Conducted seems to be one common entry.
value_counts() method, we see that there are 9800 rows where no search was conducted, which explains why
search_type has so many missing entries. Also, of the rows where a search was conducted, there were only 3 different types of searches--namely, Probably Cause, Resonable Suspicion, and Warrant. To examine the searches in more detail, we will subset the data to select just those rows.
There are many questions we could ask about these searches. How many resulted in arrest? In how many searches was contraband found? For these searches, how many drivers were male/female? What was the race of the driver? Etc.
We see that most of the drivers were not arrested after the search was performed. How many had contraband?
It is interesting that most searches turned up contraband, but also most searches did not end with arrest. We could use a pivot table to determine the breakdown of contraband and arrests.
We can see, then, that there were only 4 drivers who were arrested after having their car searched and not having any contraband. On the other hand, 29 drivers were not arrested despite having contraband found upon being searched.
Next, let's look into some demographic information about the drivers. We examine how many drivers of each race were stopped and how many of each race were searched.
These values only sum to 89, but there were 91 searches conducted, so it must be that the race is not known for 2 of the drivers who were searched. Of the 89 known drivers, we calculate the percentage for each race.
So 89.9% of searched drivers were White, 4.5% were Black, and 4.5% were Hispanic.
.info() method call above, we see the driver race is known for 9773 drivers. Using these value counts, we can calculate the percentages for each race.
We see that of the 9773 drivers whose race is known, 95.8% were White, 1.9% were Black, 1.3% were Asian,and only 0.7% were Hispanic. So we see that White and Asian drivers are underrepresented and Black and Hispanic drivers were overrepresented in vehicles which were searched after being stopped.
In the bar graph above, we see that the percentage of White drivers is so much greater than the rest that it is hard to see the comparison for the non-White drivers. We will drop them from the plot to get a better idea of the misrepresentation among minorities.
When looking only at the percentages of minority drivers, we see that Black, Hispanic, and "Other" drivers had a much higher chance of being searched.