Postal codes clustering of the city of Toronto, Canada, using the venues density

Introduction

In this notebook is presented the segmentation and clustering of the Postal codes division of the city of Toronto in the province of Ontario, Canada, extracted from "List of postal codes of Canada: M" in Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). The Foursquare API was used to find the venues on each postal code zone using a radius based on the area cover by each postcode without overlapping between them and a maximum number of venues per postal code of 100. Using K-Means clustering algorithm, the postal codes were grouped based on the venues density (venues/area) and the result was showed on a map of Toronto.

Table of Contents

  1. Extract data of Toronto neighborhoods from Wikipedia

  2. Explore and clean neighborhoods dataset

  3. Get venues

  4. Analyze venues dataset

  5. Cluster Postcodes

  6. Examine Clusters

1. Extract data of Toronto neighborhoods from Wikipedia

BeautifulSoup library is used to scrape the Wikipedia's article that contains the Toronto neighborhood. The neighborhood data presented in a Table on the article is parsed and stored in a list that contains each row of the table, that is the Postcode, Borough and Neighborhood name.

Loading output...

Transform the data into a pandas dataframe

Then the neighborhood_info list is passed to pandas to create a DataFrame

Loading output...

2. Explore and clean neighborhoods dataset

The data returned has missing info like "Not assigned" boroughs and neighborhoods.

The rows with "Not assigned" Boroughs will be eliminated

Loading output...

The "Not assigned" values in the Neighborhood column will be replace with the Borough name in that cell

Loading output...

The dataframe has 103 Postal codes but it has 212 rows, because each Postal code can present more than one neighborhood (210 in total). Therefore, the dataframe should be group by the Postal code, ending with a dataframe with 103 rows.

Loading output...
Loading output...

To add the coordinates to the neighborhood dataframe, a join is performed using the postcodes as keys

Loading output...

With the coordinates of each postal code, a map of Toronto with markers indicating the Postcode position is generated

Loading output...

The map shows that the Postal codes are not evenly spaced, and the area cover by some of them, using a radius of 500 meters, overlaps. A different radius for each postcode results in a better venues search because that will avoid misrepresentation of the number of venues per postcode caused by too large or low radius values.

Loading output...

To define the radius use with foursquare it's necessary to find the closest points for each postcode.

To explore the distance function, the closest postcode to the first example in the dataframe is found

Loading output...
Loading output...

A distant column is added to the DataFrame and is used as the radius cover for each postcode

Loading output...

The map is plotted using different radius for each postal code. Now not only overlapping was avoided but more area of the city is cover, consequently, more venues are retrieved

Loading output...

Next thing to do is explore each Postcode to get venues using the Foursquare API. For that, the credential must be declared

3. Get Venues

In order to get the venues in the perimeter of each Postal code, it is necessary to get the geographical coordinates (lat and lng) of each one of those and add them to the dataframe. The geopy library is not compatible with Canada's postcode and geocoder is an unreliable library. For that reason the coordinates are in the csv file 'Geospatial_Coordinates.csv".

To explore the data returned by the Foursquare API, a maximum of 100 venues from the first postcode are requested in a radius of 500 meters.

Loading output...
Loading output...

In this case, the relevant information is venue.categories, venue.location.lat, venue.location.lng and venue.name

Loading output...
Loading output...

It is necessary to extract the Category (shortName) of the JSON data

Loading output...

Next step is to get venues for each postal code

Loading output...

There is one postal code with no venues returned from the Foursquare API

Loading output...

4. Analyze venues data

In order to get a better sense of the best way of clustering the postalcodes, it's necessary to analyze the venues data returned by Foursquare.

Loading output...
Loading output...

The minimum amount of venues present on a postcode is 0, as we add M5E, and the maximum is 100, expected given the limit of venues set on the request sent to the Foursquare API. 50% of the venues presents 26 or less venues.

The venues Frequency Distribution of the number of venues is presented next

Loading output...

Given that each postcode has a different radius passed to the venues request, it's better to represent the venues per postcode in terms of density, that's venues per are cover for each postcode, in this case the area cover in the venues search defined by the distance to the closest postcode.

Loading output...
Loading output...
Loading output...

THe histogram shows that 60% of the postcodes presents a density between 0 and 30 venues per area (expressed as radius). That is expected given that Toronto has a low population density. The last three bars on the plot have very low values, it could be possible to merge that data and use 5 venues density ranges for the clustering

5. Cluster Postcodes

Next the postcodes are clustered based on venues density. One important hyperparameter is the number of clusters and based on previous analysis a tentative value is five clusters. Next the elbow method is used to have a better sense of the optimal number.

Loading output...

Using the elbow method, the optimal value of the number of cluster was defined as 5, which match with the value based on the histogram analysis.

Loading output...

6. Examine clusters

Check the centroids values of venues density and postcodes per cluster

Loading output...

Based on the centroids of each cluster, the cluster names can be defined as: 1. 'Low Venues Density': Centroid equal to 11 2. 'Medium-Low Venues Density' with a centroid equal to 33 3. 'Medium-High Venues Density' with a centroid equal to 72 4. 'High Venues Density' with a centr0id equal to 114 5. 'Very High Venues Density' with a centroid equal to 211

Loading output...
Loading output...

Usage

The results showed on the map could be useful, among others, in: 1. Real estate: as part of property cost model (venues density could be related to the cost of a property) or as a tool for property search. 2. Epidemiology research: venues density could be related with noise, pollution or crime.