by We You Toh
In my casual daily observation, I find that where there is a bookstore, I am usually able to find a coffee shop nearby pretty easily. Putting this into the perspective of business intelligence, it could be a useful proposition to find out whether it is a common phenomenon that bookstores and coffee shops are close to each other. Since my focus will be placed on bookstores, I also plan to test out if it is common that bookstores are near each other. Thease are the objectives of this project.
In this project, we will use the Foursquare API to build a "venue profile" for the bookstores in the city of San Francisco. With the venue information gathered and with the help of the Folium library, we will place markers on the map of San Francisco to visualize the bookstore locations. We even go further to compare San Francisco's profile for the bookstores against New York City's. Finally, we will run a few statistical tests to compare the significance of differences in venue profiles.
Note: This notebook is publicly shared on github repository. The Folium interactive maps and some Markdown features don't display on github the same way as they do on a local host. To interact with these features, it may be necessary to download the jupyter notebook and host it locally. If you wish, you may re-run the whole notebook on your own. You may also wish to provide your own Foursquare API credentials.
Before we get the data and start exploring it, let's download all the dependencies that we will need.
As mentioned earlier, I've chosen to work on the questions of whether it is a common phenomenon that
1. bookstores are near one another.
2. bookstores and coffee shops are close to one another
The quantifiable way of using data to look at the questions will be to find out the proportion of bookstores near coffee shops, and the proportion of bookstores near one another.
This analysis will follow a descriptive approach to provide the information we want.
To achieve the analysis objectives, I will be gathering the following data: 1. Venue/location data. - This will come from Foursquare and the data collected will be stored in a dataframe. 2. Proximity data. - This feature will be engineered using the venue/location data. Each venue will be checked to determine separately if other bookstores, or coffee shops, are nearby. The results will be stored in separate columns, which will then be appended to the dataframe.
For consistency in the data collection, the following quantifiable specifications are determined: 1. How near is "near"? - We'll quantify this to be 250m radius (equivalent to about 1 to 2 city block). 2. What is the specific set of latitude and longitude to use? - We will rely on Foursquare's defined location for San Francisco city. 3. How far should the search cover? - To ensure consistency in the search process, we'll set the coverage to a 4000m radius, which should cover San Francisco city substantially.
The steps taken in the can be summarized as such: 1. Data gathering/storing - Gather venue data from Foursquare and store in a dataframe. 2. Data preparation - Carry out logical tests if there are other bookstores or coffee shops nearby for each bookstore, and append the results to the dataframe. 3. Data analysis - Compute the statistics: Calculate the proportion of bookstores near each other/coffee shops. - Data visualization: Place the bookstores and coffee shops on a map to provide a visual sense of their closeness to one another.
Similarly, we repeat the steps to gather another dataset for New York city.
In the final step, we'll carry out a few tests to compare the results between San Francisco and New York. Since we are working with proportions, and we'll be sampling two sets of data, it will be appropriate to carry out two-sample proportions Z-test on the datasets.
To use Foursquare to get the venue information we need, we require three values for authentication in an API call: Client ID, Client Secret and Version.
A python dict named 'creds' is used to store Client ID and Client Secret. It contains two keys: 'id' and 'secret'. To avoid misuse of these credentials, the contents of 'creds' have been hidden. Reviewers of this project are encouraged to use their own Foursquare credentials to follow along this project. Foursquare credentials can be obtained through https://developer.foursquare.com.
'20181201' is the version used in this project.
These intermediary functions are defined to support the
get4SquareVenues is the defined below. Calls are made through this function to retrieve the list of Foursquare venues.
It is brought to my attention that there is variability in the
For the purpose of this project, we shall assume that the results returned from Foursquare is good, accurate and consistent, in this case, 'Coffee Shop' and 'Café' are equivalent, and the data collected do not need additional cleaning/scrubbing.
These functions are related to gathering information about a venue within its proximity.
A quick check back on the dataframe.
We'll be using the Folium library to create a couple of maps to help us visualize the dataset.
This heat map indicates to me that the bookstores are concentrated in certain areas within the city.
This map gives me a visual indication that it is indeed not uncommon to find bookstores near coffee shops.
Next, I will be comparing San Francisco with another city. I have chosen New York because the number of bookstores and coffee shops in New York recorded in the Foursquare database will be sufficient for the analysis.
dataset_by_city is essentially a wrapper function that assembles the steps taken in the analysis we've just done above. Creating this function allows me to update the
city parameter and store the results conveniently.
We're able to reproduce the same numbers for San Francisco, so the
dataset_by_city function is working fine. We'll proceed to run the same queries for New York.
The proportions collected for New York seems to be quite close to those collected for San Francisco. In the next segment, the statistical tests are conducted to tell us whether the difference in these proportion values are significant.
The two-sample proportion Z-test is deployed to compare the proportions of the bookstore data in the two cities. The following assumptions allows the statistical test to be carried out meaningfully:
Statistically based on the two-sample proportion Z-test, using the Foursquare data collected with a coverage radius of 4000m, the proportion of bookstores that are within 250m from another bookstore in San Francisco (.707) is significantly different from that of the bookstores in New York (.814), z = -2.19, p = .03.
Statistically based on the two-sample proportion Z-test, using the Foursquare data collected with a coverage radius of 4000m, the proportion of bookstores that are within 250m from a coffee shop in San Francisco (.707) is not statistically significantly different from that of the bookstores in New York (.706), z = 0.01, p = .99.
This analysis has provided a brief look at the venue profile of bookstores in San Francisco. While the data collected are sampled proportions, I would argue that we may use the proportion values as estimates in predicting the probability of finding another bookstore or a coffee shop nearby. For business owners, I think the proportion values are useful indicators on 1. the level of competition from neighboring stores offering the same service. 2. the level of complementary services around a target location.
If we have profit/loss data and foot traffic data to go with the data collected, we could inspect further at what level of competition is beneficial to the service provided, or at what kind of complementary services are good to have around.
In this analysis, I've only provided the proportion of bookstores near other bookstores or coffee shops. It is clear that the analysis need not be just bookstores and coffee shops, but can be expanded to include other choices of store type.
As I compare between San Francisco and New York, I learn through the statistical tests that it's a mixed bag in terms of the similarity in the venue profiles of bookstores in these two cities. It should be noted that this analysis only holds under the conditions of a 4000m coverage radius and a proxmity distance of 250m. It is also subjected to the limitations held by the Foursquare data.
While this project shall conclude here, it is greatly encouraged for users to vary the parameters used, and extend the comparison exercise to include other cities or localities. It may be present many opportunities to discover new interesting information and insights.
Thank you for checking out this project. I hope it has been enjoyable.
-This project has been done by We You Toh.-