Coursera Capstone Project - IBM Data Science Specialization

#Coursera-Capstone-Project---IBM-Data-Science-Specialization

Project Title

#Project-Title

Analysis of St. Louis City neighborhoods based on data related to demographic, crime and places of interest.

Introduction

#Introduction

St. Louis is a midwestestern city in the state of Missouri. It has a population of around 350,000 with the city divided into 79 neighborhoods. The goal of this project is to gather, explore and apply machine learning techniques in order to cluster these neighborhoods according to similarities. Specifically we will be looking at places of interest in a neighborhood (data provided by FourSquare API), as well as crime rate in the city(data provided by St. Louis Metropolitan Police Department). The result of the analysis is targeted towards city dwellers and business owners who can make informed decisions when it comes to their day to day life in the city.

Data

#Data

The data for this project is being gathered from three main sources and will be combined together for final exploration and analysis :

  • WikiPedia article about neighborhoods of St. Louis and their demographic data
  • St. Louis Metropolitan Police Department 2018 Crime Data
  • FourSquare API for gathering venues given neighborhood coordinates

The data sourced from these places are combined and analyzed in following manner: Wikipedia's article provides neighborhood names. These names will be used to get longitude and latitude information. After some correlating of data between Police Dept data and Wikipedia, a merged view will be created where crime and demographic data can be seen together. Finally, based on coordinates of the neighborhood calculated before, Foursquare API will be used to source venues of interest which will become main input for clustering.

Methodology

#Methodology

This section describes in detail the the steps performed to extract meaningful information from the data sources mentioned above.

1. Data Extraction and cleaning from wikipedia and geopy API.

#1.-Data-Extraction-and-cleaning-from-wikipedia-and-geopy-API.

As with any data science project working with data, the first step was to import necessary python packages and define constants which will be used afterwards.

Once packages were imported next step is to fetch the Wikipedia article and extract the table with relevant data. Python package Beautifulsoup and requests were used for it.

The dataframe built has following columns -

  • Neighborhood
  • Population
  • White population %
  • Black population %
  • Hispanic/Latino2 population %
  • AIAN1 population %
  • Asian population %
  • Mixed Race population %
  • Corridor

The sample data extract is as shown below:

Loading output library...

Extraction of longitude and latitude data and appending to the dataframe

#Extraction-of-longitude-and-latitude-data-and-appending-to-the-dataframe

Next step was to use the neighborhood data to extract longitude and latitude information and append it to the dataframe. For this GeoPy was used. The GeoPy Data is noisy and missing data for 22 neighborhoods. The code and output below shows list of neighborhoods where no data was found.

Loading output library...
Loading output library...

The data collected so far was written to a CSV file for further processing.

Manual Data cleaning

#Manual-Data-cleaning

For the neighbourhoods where geopy geolocator could not find the coordinates, manual Google search was done to get the coordinates. After the Goole search 2 out of 79 neighborhoods (The Gate District,Peabody Darst Webbe) didn't have results in google. For these two random coordinates were taken from related geojson file. Once the cleanup was done the final CSV was read from disk for further analysis and processing. The sample data is shown as below:

Loading output library...

Plotting the neighbourhoods on map using geojson and Folium

#Plotting-the-neighbourhoods-on-map-using-geojson-and-Folium

To verify that the coordinates and neighborhood data covers whole of St. Louis, the data was plotted on a map using Folium. Minor adjustments had to be done to couple of areas to match dataframe and geojson files. The output is as shown below. The data mostly seem accurate except couple of coordinates where center of neighborhoods seem off, and one of the neighborhood entry is completely missing from the geojson. For the scope of this project, it was decided that we can tolerate these anomalies.

Loading output library...

2. St. Louis Crime Data Extraction

#2.-St.-Louis-Crime-Data-Extraction

St. Louis Police Department ID to Neighborhood mapping

#St.-Louis-Police-Department-ID-to-Neighborhood-mapping

St. Louis Crime Report data (available at this link) has Neighborhood as numeric value. They have a reference sheet in FAQ document which lists ID to Neighborhood mapping. The next step in this project is to load the mapping from this document, join with df_merged and clean any data which is missing/different.The detail of steps can be seen in the code section below.

Loading output library...
Loading output library...

St. Louis City Crime data for the year 2018

#St.-Louis-City-Crime-data-for-the-year-2018

For the year of 2018, crime data was obtained from St. Louis Metro Police Dept website. In total there were 46,742 crimes of various categories. We will get into details of it as we go. Monthly crime records were downloaded and are available in Github in crime_records folder. These will be verified for shape and mergeability into a single dataset.

Loading output library...

So far we have following data frames -

#So-far-we-have-following-data-frames--
  • df_merged - Neighborhood data with long/lat, demographics and NeighborhoodIds
  • merged_crime_df - 2018 Crime Data, not grouped by neighbourhood yet.
  • The next step will be collect neighborhood data from foursquare API.

Exploratory Data Analysis of St. Louis Crime Data

#Exploratory-Data-Analysis-of-St.-Louis-Crime-Data

But before that we should perform exploratory data analysis of crime vs neighborhood for St. Louis and show the results in a choropleth map. The crime dataframe was joined and updated with Neighborhood names so that descriptive data can be presented while visualizing the data in tabular format and in the maps. The table below shows data in tabular format after this processing.

Loading output library...

This data was plotted on a choropleth map to show the crime statistics of various neighborhoods in graphic way. Below map can be used to see various neighborhood and their crime statistics.In the final map of this project this data will be combined with venue data to represent a unified picture of classification.

Loading output library...

Population vs Crime relation

#Population-vs-Crime-relation

It is expected that higher the population, more number of crimes there are. Here we will validate/examine the assumption by plotting a scatter plot between independent variable population and dependent variable crime.

Loading output library...

While it is evident that number of crimes in general linearly related to population; there are some outliers which show interesting fact that there are some neighborhoods where number of crime is too high compared to the population. Sorting the data by number of crimes; it an be seen that Downtown lies in this category where population is low compared to number of crimes reported on yearly basis.The last column in this table is the crime count while second column is population.Please note while the crime count is from year 2018, population figures are from 2010 census. However we do not expect a significant change in this pattern as overall population of St. Louis has not changed drastically in recent years.

Loading output library...

3. Venues data for location from foursquare

#3.-Venues-data-for-location-from-foursquare

FourSquare is being used in this project to extract neighbourhood venues of interest based on latitude and longitude of neighborhoods. The number of venues for each service call to foursquare was limited to 100. Unlike the projects in coursera, there was no radius parameter specified while making service call. This way, the radius parameter is automatically determined by foursquare API based on density of data for any neighborhood.Below are the steps in brief that were performed and how the data was transformed to analysis by K-Means clustering algorithm.

Extract Location data for single neighborhood:

#Extract-Location-data-for-single-neighborhood:

In this step neighborhood data for very first neighborhood was extracted and validated for correctness.

Loading output library...

Repeat for all neighborhood of St. Louis

#Repeat-for-all-neighborhood-of-St.-Louis

Here we repeated the steps for all 79 neighborhoods and extracted the data in dataframe stl_venues for further processing

The table below shows sample of extracted data from stl_venues dataframe.

Loading output library...

One hot encoding

#One-hot-encoding

Similar to Toronote/New York analysis, one-hot encoding was applied to the dataframe to prepare the data to fit using K-Means algorithm. The modified dataframe had a dimension of 214 attributes. That's the number of categories for which FourSquare API returned data. Sample data for this transformed dataframe is as shown below.

Loading output library...

Most common venues for each neighborhood

#Most-common-venues-for-each-neighborhood

Functions were created to visualize most common category for each neighborhood.This was done by sorting features for each neighborhood according to their frequency as shown in previous table and then taking out top 10 of those and show it in table as displayed below.

Loading output library...

4. Cluster Neighborhoods with K-Means Algorithms

#4.-Cluster-Neighborhoods-with-K-Means-Algorithms

After some analysis(described in section 5), it was decided to use hyperparameter of 6 for k-means clustering. The higher number of cluster was chosen based on the fact that we did not see a sharp drop in elbow method of analysis. The Silhouette score was also best for k = 6, even though it didn't have an optimum value of >0.5. More details of this analysis can be found section 5. The cluster labels for neighborhoods and sample data with Cluster Label attached can be seen below.

Loading output library...
Loading output library...

Finally the cluster label data was joined together with master merged data to present a unified view as shown below. This view has all the parameters of interest to us -

  • Name of neighborhood
  • Population Distribution.
  • Crime statistics
  • Coordinates
  • Cluster Labels
  • 10 most common categories of interest
Loading output library...

Neighborhood map with cluster label, Name, Crime Count and its analysis.

#Neighborhood-map-with-cluster-label,-Name,-Crime-Count-and-its-analysis.

We prepared data for visualizing on map with cluster labels, Crime count, and top two places of interest categories. The below interactive map can be browsed to go over this data. We retained the choropleth map from before to classify neighborhood in two fashions.

  • Based on crime rate the color of Choropleth map classifies Downtown and Dutchtown as similar (high crime).
  • Based on cluster labeling which is calculated based on venue data, they have different labels - 4 for downtown, and 2 for Dutchtown. This gives user ability to analyze the data further to find out similar neighborhood. It's also observed that the algorithm is clustering neighborhood together based on their geographic proximity. For example all the neighborhood near Forest Park are being classified together with label 3 with Zoo being most important feature that distinguish them from rest of neighborhoods.

It's also observed that as expected prominent place of interest around downtown areas are Hotels and Bars.In a setup like downtown this is to be expected.

Loading output library...

5. Finding Optimal K for K-Means algorithm

#5.-Finding-Optimal-K-for-K-Means-algorithm

As K is considered hyperparameter, Elbow Method and Silhouette Score was used to determine value which will help the most with this dataset.As shown below, I did not see a significant drop when using Elbow method to correct specify which number will be best suited.Based on Silhouette Score of 0.25 and the fact that after that drop in elbow method was low, the value of k = 6 was chosen. The Elbow method and Silhouette Score charts are as show below

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Results

#Results

Based on analysis of the data we can summarize that St. Louis with population of ~350,000 people is a closely clustered city. The neighborhoods are comparatively small when compared to bigger cities such as New York. The places of interest classify the neighborhood into same cluster based on their geographical proximity. The crime data sheds an important light on the fact that Downtown even with smaller population size remains one of the most crime prone area. Dutchtown with comparatively higher population comes second when it comes to crime rate. Central West End which is classified as label 5(Bar,Restaurant) is one of the largest neighborhood and upcoming district where Cortex Innovation Center is located. Unfortunately this area is also high when it comes to number of crimes.

Discussion

#Discussion

The foursquare data itself seem insufficient for most accurate clustering of cities. Data such has crime rate, median house prices, taxes, business revenue will be of further interest to refine the model further for classification. Unfortunately not all of such data is available in public domain to do such an effort. But the knowledge gained here will be good starting point to perform such an analysis further.

Conclusion

#Conclusion

Overall I found this project to be very interesting. After Capstone completion, I intend to further work on it to add more data points as I discover them. There is a effort going on currently to merge St. Louis county and City together known as Better Together. I intend to perform some analysis related to that based on knowledge gained in this project in order to predict if such a measure will be beneficial to the community.

Credit

#Credit

Much of the work/code in this project is based on learning from IBM Data Science specialization offered on Coursera. I would like to thank team at IBM for coming up with such an amazing introductory course on Data Science.