Analysis of St. Louis City neighborhoods based on data related to demographic, crime and places of interest.
St. Louis is a midwestestern city in the state of Missouri. It has a population of around 350,000 with the city divided into 79 neighborhoods. The goal of this project is to gather, explore and apply machine learning techniques in order to cluster these neighborhoods according to similarities. Specifically we will be looking at places of interest in a neighborhood (data provided by FourSquare API), as well as crime rate in the city(data provided by St. Louis Metropolitan Police Department). The result of the analysis is targeted towards city dwellers and business owners who can make informed decisions when it comes to their day to day life in the city.
The data for this project is being gathered from three main sources and will be combined together for final exploration and analysis :
The data sourced from these places are combined and analyzed in following manner: Wikipedia's article provides neighborhood names. These names will be used to get longitude and latitude information. After some correlating of data between Police Dept data and Wikipedia, a merged view will be created where crime and demographic data can be seen together. Finally, based on coordinates of the neighborhood calculated before, Foursquare API will be used to source venues of interest which will become main input for clustering.
This section describes in detail the the steps performed to extract meaningful information from the data sources mentioned above.
As with any data science project working with data, the first step was to import necessary python packages and define constants which will be used afterwards.
Once packages were imported next step is to fetch the Wikipedia article and extract the table with relevant data. Python package Beautifulsoup and requests were used for it.
The dataframe built has following columns -
The sample data extract is as shown below:
Next step was to use the neighborhood data to extract longitude and latitude information and append it to the dataframe. For this GeoPy was used. The GeoPy Data is noisy and missing data for 22 neighborhoods. The code and output below shows list of neighborhoods where no data was found.
The data collected so far was written to a CSV file for further processing.
For the neighbourhoods where geopy geolocator could not find the coordinates, manual Google search was done to get the coordinates. After the Goole search 2 out of 79 neighborhoods (
The Gate District,
Peabody Darst Webbe) didn't have results in google. For these two random coordinates were taken from related geojson file. Once the cleanup was done the final CSV was read from disk for further analysis and processing. The sample data is shown as below:
To verify that the coordinates and neighborhood data covers whole of St. Louis, the data was plotted on a map using Folium. Minor adjustments had to be done to couple of areas to match dataframe and geojson files. The output is as shown below. The data mostly seem accurate except couple of coordinates where center of neighborhoods seem off, and one of the neighborhood entry is completely missing from the geojson. For the scope of this project, it was decided that we can tolerate these anomalies.
St. Louis Crime Report data (available at this link) has
Neighborhood as numeric value. They have a reference sheet in FAQ document which lists ID to Neighborhood mapping. The next step in this project is to load the mapping from this document, join with
df_merged and clean any data which is missing/different.The detail of steps can be seen in the code section below.
For the year of 2018, crime data was obtained from St. Louis Metro Police Dept website. In total there were 46,742 crimes of various categories. We will get into details of it as we go. Monthly crime records were downloaded and are available in Github in crime_records folder. These will be verified for shape and mergeability into a single dataset.
df_merged- Neighborhood data with long/lat, demographics and NeighborhoodIds
merged_crime_df- 2018 Crime Data, not grouped by neighbourhood yet.
But before that we should perform exploratory data analysis of crime vs neighborhood for St. Louis and show the results in a choropleth map. The crime dataframe was joined and updated with Neighborhood names so that descriptive data can be presented while visualizing the data in tabular format and in the maps. The table below shows data in tabular format after this processing.
This data was plotted on a choropleth map to show the crime statistics of various neighborhoods in graphic way. Below map can be used to see various neighborhood and their crime statistics.In the final map of this project this data will be combined with venue data to represent a unified picture of classification.
It is expected that higher the population, more number of crimes there are. Here we will validate/examine the assumption by plotting a scatter plot between independent variable population and dependent variable crime.
While it is evident that number of crimes in general linearly related to population; there are some outliers which show interesting fact that there are some neighborhoods where number of crime is too high compared to the population. Sorting the data by number of crimes; it an be seen that Downtown lies in this category where population is low compared to number of crimes reported on yearly basis.The last column in this table is the crime count while second column is population.Please note while the crime count is from year 2018, population figures are from 2010 census. However we do not expect a significant change in this pattern as overall population of St. Louis has not changed drastically in recent years.
FourSquare is being used in this project to extract neighbourhood venues of interest based on latitude and longitude of neighborhoods. The number of venues for each service call to foursquare was limited to 100. Unlike the projects in coursera, there was no radius parameter specified while making service call. This way, the radius parameter is automatically determined by foursquare API based on density of data for any neighborhood.Below are the steps in brief that were performed and how the data was transformed to analysis by K-Means clustering algorithm.
In this step neighborhood data for very first neighborhood was extracted and validated for correctness.
Here we repeated the steps for all 79 neighborhoods and extracted the data in dataframe
stl_venues for further processing
The table below shows sample of extracted data from stl_venues dataframe.
Similar to Toronote/New York analysis, one-hot encoding was applied to the dataframe to prepare the data to fit using K-Means algorithm. The modified dataframe had a dimension of 214 attributes. That's the number of categories for which FourSquare API returned data. Sample data for this transformed dataframe is as shown below.
Functions were created to visualize most common category for each neighborhood.This was done by sorting features for each neighborhood according to their frequency as shown in previous table and then taking out top 10 of those and show it in table as displayed below.
After some analysis(described in section 5), it was decided to use
6 for k-means clustering. The higher number of cluster was chosen based on the fact that we did not see a sharp drop in elbow method of analysis. The Silhouette score was also best for k = 6, even though it didn't have an optimum value of >0.5. More details of this analysis can be found section 5. The cluster labels for neighborhoods and sample data with Cluster Label attached can be seen below.
Finally the cluster label data was joined together with master merged data to present a unified view as shown below. This view has all the parameters of interest to us -
We prepared data for visualizing on map with cluster labels, Crime count, and top two places of interest categories. The below interactive map can be browsed to go over this data. We retained the choropleth map from before to classify neighborhood in two fashions.
It's also observed that as expected prominent place of interest around downtown areas are Hotels and Bars.In a setup like downtown this is to be expected.
As K is considered hyperparameter, Elbow Method and Silhouette Score was used to determine value which will help the most with this dataset.As shown below, I did not see a significant drop when using Elbow method to correct specify which number will be best suited.Based on Silhouette Score of 0.25 and the fact that after that drop in elbow method was low, the value of k = 6 was chosen. The Elbow method and Silhouette Score charts are as show below
Based on analysis of the data we can summarize that St. Louis with population of ~350,000 people is a closely clustered city. The neighborhoods are comparatively small when compared to bigger cities such as New York. The places of interest classify the neighborhood into same cluster based on their geographical proximity. The crime data sheds an important light on the fact that Downtown even with smaller population size remains one of the most crime prone area. Dutchtown with comparatively higher population comes second when it comes to crime rate. Central West End which is classified as label 5(Bar,Restaurant) is one of the largest neighborhood and upcoming district where Cortex Innovation Center is located. Unfortunately this area is also high when it comes to number of crimes.
The foursquare data itself seem insufficient for most accurate clustering of cities. Data such has crime rate, median house prices, taxes, business revenue will be of further interest to refine the model further for classification. Unfortunately not all of such data is available in public domain to do such an effort. But the knowledge gained here will be good starting point to perform such an analysis further.
Overall I found this project to be very interesting. After Capstone completion, I intend to further work on it to add more data points as I discover them. There is a effort going on currently to merge St. Louis county and City together known as Better Together. I intend to perform some analysis related to that based on knowledge gained in this project in order to predict if such a measure will be beneficial to the community.