1 Your work as an analyst has been noticed. You have been asked to join the Governor's re-election effort!
2 For a re-election campaign, the Governor wants to tell a story of the "Tale of Two States" (the Governor got this idea from New York City Mayor Bill de Blasio's campaign). She has asked you to brief the rest of the staff on the differences between eastern and western portions of Washington.
The primary goal of this notebook is to provide useful comparative analyses upon which to build the Governor's re-election campaign strategy.
A. Do you agree with the Governor that there are two Washingtons?
a. What are the characteristics of the two Washingtons? (or)
b. Why do you disagree?
B. What strategies might the Governor want to employ to address your findings and better her chances for re-election?
The secondary goal of this notebook is to explore the uses of Unsupervised Machine Learning and Hacker Statistics techniques in finding value in our data.
The tertiary goal of this notebook is to showcase data visualisation using the Bokeh visualisation library.
The Governor of Washington asked that we examine the differences between the Eastern and Western regions of Washington.
We need to find out if there are indeed two Washingtons when split between East and West.
After importing the data, and before performing analyses on the data, it is important to explore the very condition of our dataset first.
Our dataset has 40 columns, 36 of which are numeric.
The non-numeric columns are:
1. ID (GEOID) 2. density_group 3. county 4. region
Notice that there are missing values. We will deal with these later.
The pandas module easily provides summary statistics of our dataset, which has been converted to a pandas DataFrame.
Simply performing descriptive statistics doesn't tell us much, but a quick look tells us that:
It will be useful to visualize the distribution later.
This will probably have to be taken into serious consideration while comparing other features of the eastern and western regions.
Considering the differences in geographic size and population density, we will have to be careful when dealing with these values.
Area and Population
East Washington, despite having much less GEOIDs belonging to it, is still geographically larger (58% of Washington's area).
East Washington also has more of GEOIDs with larger areas, with roughly a fifth of them (20%) greater than 300 sqkm in area. Only roughly 5% of the West (57 GEOIDs) have an area greater than 300 sqkm.
Looking at the total population, the distribution in the East is skewed towards large-population GEOIDs, but 78.53% of the total population of Washington resides in the West.
To take our analysis to another level, let us look at the relationship of these two features in each region.
Density = Population / Area
If you only look at the shape of the distribution and ignore the sheer number of GEOIDs assigned in the West, the ECDF and Bar chart for density groups reveal that the East is dominantly low-density and the West is dominantly high-density.
But, the visualizations also reveal that while most of the denser GEOIDs belong to the West, most of the less dense ones also belong to the West. Moreover, there is still a portion of very dense populations that belong to the East.
Lower = median < $19,000
Lower Middle = @@0@@35,000
Middle Middle = @@1@@100,000
Upper Middle = @@2@@350,000
Upper = median > $350,000
Gap = Percent; Male - Percent; Female
-A positive value indicates that males, proportionally, have a higher rate of higher education.
-A negative value indicates that females, proportionally, have a higher rate of higher education.
The GEOID with the highest poverty density is in King County, Seattle in the West (city).
Notice how they have nearly the same population size but one is packed in 0.65% of the area of the other.
They almost have the same Poverty rate, with the Eastern one slightly higher by just 6%. Notice, however, that in the Western one, only we do not know the poverty status (below or above the poverty line) of 34% of its population. There are 90.29 times more constituents below poverty line per square kilometer (poverty density) in the Western one and there are 71.10% more constituents under poverty line in the Eastern one.
Both their median incomes belong to the Low income group, and they have almost similar education levels.
In the Eastern one, the rate of higher education (at least a Bachelor's degree) for males is higher than for females by 14%, while it's only higher by 0.8% in the Western one.
We must realize, however, that GOEIDs with similar to the two above can be found on either size of the geographic divide.
Since we are not too sure about our problem at hand yet even after exploring and processing our data, we need to return to our data, best from a different angle.
The question now is, how should we look at the data this time?
We want to eliminate geographic location from the equation.
The decision tree classifier is able to predict the clustering assigned by our k-means algorithm with very low false positive and very low false negative rates.
It is not an easy task to understand the result of a clustering when numerous features are involved.
We can use the set of rules used by the decision tree to guide us in exploring the intra-cluster commonalities and inter-cluster differences of our two clusters.
Note that in comparing clusters, we cannot refer to the specific cluster labels, as the labels of the clusters change every time the algorithm is run. What is important, however, is that the compositions of the clusters remain the same. The clusters are 0 and 1, staying true to Python's zero-based indexing system wherein the first is always index 0.
Area and Density
Poverty **The high-density cluster has higher poverty densities and more constituents below poverty line. **
However, it has a lower poverty rate, which is the percentage of the population below poverty line.
Looking at them separately, the low-density cluster has more low-income GEOIDs than high-income GEOIDs,
while the high-density cluster has more high-income GEOIDs than low-income GEOIDs.
Unemployment This is where it gets interesting. While **the cluster with most of the Low-income GEOIDs also has a higher distribution of unemployment rate, the other cluster, with most of the High-income GEOIDs, has most of the unemployed in Washington (54%),** although it could be because the latter has a higher total population.
Higher Education **The high-density, high-income cluster also has the higher rates of higher education and higher total number of those with higher education.**
Women in the low-density cluster seems to have a higher rate of higher education in relation to men compared to women in the high-density cluster.
Transportation **The high-density cluster has a higher rate of public transportation usage as well as a higher number of total carpoolers and commuters.**
The low-density cluster might have a slightly higher rate of driving alone, but the high-density cluster includes the huge majority of solo drivers in Washington.
NYC Mayor De Blasio's Tale of Two Cities contextualized his fight against inequality.
Which narrative best frames our battle?
East and West
The most glaring difference between the two is that the overwhelming majority of the affluent (middle middle, upper middle) GEOIDs are in the West.
Also, the East is much more vast than the West, despite the West being home to a huge majority of the population, which is also dominantly middle middle class. Another way to describe it is that a huge majority of Washington's voter base is middle middle class, and almost all of them are in the West.
However, one must remember that the East vs. West narrative is not just an arbitrary divide but a profoundly political one. It is a not-so-subtle clash between the Conservative East, with the city of Spokane at its core, and the Liberal West, strongest in the Seattle Metropolitan Area. This must be taken into great consideration as political or even ideological leaning significantly affects a voter's preferences for economic (macro and micro) policies.
West: Overpopulation, inflation in the housing market, huge number of poor and unemployed, transportation
East: High poverty and unemployment rate, low rate of higher education, transportation
If this narrative is to be utilized, there are two ways to go about it:Strategy: Capital Movement
One way is presenting the ideological divide as something that need not prevent innovation and progress.
Generate hype with big, transformational, and mutually-beneficial projects that will require full cooperation between the two sides.
The following order of solutions can be offered in the campaign platform: 1. Development: take advantage of unused space in the East
Improve infrastructure and provide incentives like tax breaks to attract investment. The campaign could be to further expand Washington's aerospace manufacturing and software development industries in the East, as they are already among the top drivers of Washington's economy. 2. Job creation: made possible by investment and industrial expansion in the East This will attract the huge number of unemployed and even the highly educated from the West. This should also help the unemployed already living in the East.
This will draw and further produce a more skilled workforce and ensure that they stay, relieving pressure on the West. Note, however, that the increase in demand for gentrified housing must be due to poverty reduction, and not due to the rich getting richer. Furthermore, the increase in housing prices and strain on provision of public services must not be at the detriment of the poor already living in the areas. This will be discussed further on the chapter on densification.
The premise of the strategy is that recreating the successes of the West in the East entails benefits to Washington as a whole.
Remember that the West, while home to the most affluent, is still largely comprised by a massive middle middle class. Reducing the population in the West will also reduce inflationary pressures particularly in the housing market. It also provides new investment opportunities for the burgeoning middle class in the West eager to let their money work for them.
One way of marketing this is bringing up the spirit of cooperation without the need to break down boundaries and values.
Overall, this dichotomy risks ignoring these groups of constituents:
By focusing only on the lower income in the East, for example, the dire needs of the lower-income constituents in the West will be set-aside in agenda-setting. Not only that, but the very conditions that lead to poverty in the West will be ignored entirely. Simply luring the less-fortunate out of the West into the East is a short-term fix. If these problems endemic in the West are not tackled directly, then there won't be a solution when the same problems start to arise in the East.
Not only that, but it clashes with sentiments for preserving the forests and farmlands in the East.
Concerning the governor's political career: this is a long-term project and thus final results will not yet be visible by the end of the governor's term.Strategy: Targeted Campaigning
Another is simply acknowledging the divide and make reconciliations or adjustments amicable to either parties.
What is amicable for both would probably involve promising changes that wouldn't impose on either side. It can be acknowledged that Eastern Washington travels a separate economic road.
The goal, however, must still be on fighting inequality.
Tailor-made campaigns for each side:
In the East, the lack of higher education is not necessarily a problem. In fact, the low rates of higher education could be a reflection of the needs of the job market in the East. Most jobs in the agricultural sector do not require a college education. However, if more advanced technogies are to be used, relevant higher education qualifications will probably be necessary, and easier access to such education, meaning making it accessible in the East itself, is certainly an attractive improvement.
Improving average household income in the East could be a matter of improving the returns of working in the agricultural sector to its workers. Policies that protect and improve the lives of their workers come into mind.
Overall, this caters to conservative desire for autonomy, as they paradoxically do not care for how the liberal world imposes its values on them. It is important to take note that conservatives tend to be against socialist economic policies.
Furthermore, an important line of investigation that must be pursued is finding out the reasons behind high unemployment rates in the East.
Regions that hold the highest potential for a thriving solar energy industry are in East Washington.
As home to Washington's top manufacturing, technological, professional, medical, educational, and financial companies, West Washington needs its high population. It is the fuel that runs this engine of the economy. It is the government's job to take care of this population, the main problems of which is caused by crowding.
With a higher population comes higher housing prices due to higher demand, exacerbated by the redevelopment of many areas to cater to middle class tastes (Gentrification and urban sprawl), rendering previously accessible housing unaffordable to some. This is one of the problems faced in New York, which has been tackled by regulation of land development in order to keep some housing affordable. This will be discussed further in the chapter on densification.
Ease of Transport
The improvement of traffic and public transportation is vital not only to the economy, but also to the wellbeing of citizens.
Air pollution, noise pollution, and garbage disposal. Initiatives to address these issues are sure to win the support of constituents.
The growth of industry requires a population that can meet its growing demands for skilled and professional labor. Job openings are useless to the populace if they are not qualified for them. Take advantage of the huge population by making them more skilled. Its Liberal constituents believe that it is the responsibility of the government to secure this.
While fighting crime is high on the agenda, it must not be seen as a war against those in poverty. Crime must be fought at the roots, as Liberals would argue. It is due to rising inequality and the failure of social support. Increased policing is reactive and only really helps those already financially secure. While a majority of West Washington is relatively affluent, their liberal values must be considered.
It taps on pre-existing political divide. The targeted responses to differentiated demands maximizes the likelihood for gaining support from both sides without compromising ideological leanings. However, making everyone happy will require careful and transparent resource allocation.
This dichotomy still risks ignoring this subset:
Only looking at median income distributions discounts the varying densities within each side.
Middle middle class constituents, who comprise the biggest chunk of our voters, are situated in different contexts, and therefore have differing needs and concerns.
Note that we are not literally using the labels assigned by the clustering as the two cities We are simply using the results of the clustering algorithm to guide our data analysis.
While the West, as a whole, is denser than the East, the aggregative nature of the East vs. West narrative does not strongly reflect distributions of different population densities within the two divides.
When our Machine Learning algorithm was "asked" to divide Washington, given our limited data, it gave us two different mixtures of the East and the West. Differences are magnified when geographic location is ignored. Inequalities are more visible when more similar parts from each side are put together.
It showed us more significant differences in population density, income, poverty, unemployment, education and means of transportation.
Instead of utilizing and thus enforcing the politicized tension between Eastern and Western Washington, one can instead aim to break down this barrier and claim that party politics is hindering progress and is not really what matters. The flagship of the campaign can not just be cooperation like in our first strategy, but ditching of party politics. A bit of a stretch to do this especially when it comes to the degree of conservativeness of the East, but definitely not impossible.
Our two cities now become the rural and the urban. But like in the East vs. West narrative, there are two ways to go about it:Strategy: Urbanization
This strategy is similar our Capital Movement strategy, but now disregarding the East vs. West narrative and instead adopting a more politically-neutral sense of inequality.
The promise is to spreading out jobs while also providing affordable housing for mostly middle middle class without really shunning the lower middle and lower class because the liberal middle class would most likely be concerned with such affairs as well.
Population density is a reflection of underlying conditions.
While increasing density seems to increase high-income households, does it also translate to more households escaping poverty or does it just mean that higher-income households tend to flock together in the city?
Which brings us to the important question: why do people flock to the city?
They key to this is employment. What our clustering data ultimately reveals is the difference between the Rural and the Urban when it comes to unemployment.
Rural: High unemployment rate. Further research points to a lack of jobs.
Urban: Lower unemployment rate but has more of the unemployed. Unemployed move to the city in hopes for a better life. Poverty also leads to lack of education and thus difficulty finding employment.
While urban sprawl, which is the spread of population outwards to suburban areas, does reduce excessive inflationary pressures in the housing market, it creates a transportation problem, as people in the suburbs still working in the city have to travel longer distances. The lack of public transportation increases private vehicle use, and thus also carbon emissions. In other words, making more people live in the suburbs simply transfers costs to transportation.
Promoting and encouraging new industrial activity and thus job creation outside the current centers (Seattle area and Spokane) is therefore the better option in the long run.
As mentioned earlier, this has to be backed up by a capable work force. Ultimately, the goal is to decrease poverty by combating unemployment and improving income through education and social support.
Akin to the earlier targeted campaigning strategy, the strategy is to avoid major changes in industry and infrastructure, and instead focus on appeasing the existing segments in society.
This strategy utilizes dichotomies within our primary dichotomy. 1. Rural
a. Rural upper class
b. Rural working class
a. Urban upper class
b. Urban working class
Let constituents self-identify to which category they feel they belong. What's important is that there are satisfying promises ready for each niche. Also note that socialist policies that help the lower class may also please the liberal middle class, as liberal attitudes towards poverty suggest.
The premise is to cater to different needs of different segments of the population. The campaign platform should be able to cater to both the poor and the affluent by identifying that they have different needs.
Plays on the growing discourse of Urban Intellectuals versus Rural Morals.
Before diving into the importance of density, let's take a look at factors that correlate with income.
Let us now isolate population density and its relationship to other factors.
Density and poverty rate have a rather weak positive (+) correlation.
Density and unemployment rate have a very weak negative (-) correlation. (-0.0260)
Using scatter plots, we can see that visually our GEOIDs together barely resemble the best-fit lines.
Even if the R-squared is low, the value is negative. This means that it fits worse than a horizontal line. Density alone is very bad as a predictor for income.
Null Hypothesis: **Low-density and high-density cases have similar distributions.**
The permutation test for distributions of poverty rates confirms that low-density and high-density GEOIDs have considerably different distributions.
(Reject Null Hypothesis)
The permutation test for distributions of unemployment rates indicates that low-density and high-density GEOIDs have considerably similar distributions.
(Keep Null Hypothesis)
The bootstrap test for distributions of poverty rates confirms that low-density and high-density GEOIDs have considerably different distributions.
(Reject Null Hypothesis)
The bootstrap test for distributions of unemployment rates indicates that low-density and high-density GEOIDs have considerably similar distributions.
(Keep Null Hypothesis)
Conclusion on Density
Instead, what density provides is context necessary to targeted policy-making.
Most notable are the effects on housing for the poor, which is explained next.
Time series data Time series data would allow us to detect specific trends in Washington's history that lead to growth. Multivariate time series data, for example, would reveal links or dependencies between key factors or industries.
Data for measuring inequality Data that allows the measurement of income inequality (GINI coefficient) should be made available in order to get a more detailed picture of the situation that would let us more directly tackle the problem of inequality.
Crime statistics This would allow an informed plan to combat crime in a less reactionary manner. Paired with the rest of the data (especially time series data), it would help shed light on the deeper causes of crime. epidemics.
Climate data Economic development policy formulation can largely benefit from climate data, as the feasibility of heavily-profitable industries can be dependent on an area's climate.
Housing market data Housing market fluctuations heavily affects the less fortunate as they are the ones less able to adapt to inflation.
Industry sector data + available jobs Paired with geography, such data is essential to the formulation of a tangible economic plan.
Its goals are: 1. To look for properties that show as much variation across instances as possible 2. To look for properties that allow to predict or "reconstruct" the original characteristics
In doing this, it effectively reduces statistical noise, or unexplained variations in the sample.
It practically made the clustered data useless.
Explanations and illustrations of machine learning techniques were from: 1. Datacamp 2. Data Science for Business by Foster Provost and Tom Fawcett
Tinsely, Karen and Bishop, Matt (2006). Poverty and Population Density: Implications for Economic Development Policy.
Whitehead, Christine. The Density Debate.