To see, or not to see,



Loading output library...

Re-election Campaign Strategy


Unsupervised Learning and Hacker Statistics


Author: Jan Erish Baluca
Portfolio on Github

This analysis is based on a presentation prompt from my time at General Assembly in London, provided by Joana Wang:

1 Your work as an analyst has been noticed. You have been asked to join the Governor's re-election effort!

2 For a re-election campaign, the Governor wants to tell a story of the "Tale of Two States" (the Governor got this idea from New York City Mayor Bill de Blasio's campaign). She has asked you to brief the rest of the staff on the differences between eastern and western portions of Washington.

The primary goal of this notebook is to provide useful comparative analyses upon which to build the Governor's re-election campaign strategy.

General prompts:

A. Do you agree with the Governor that there are two Washingtons?

a. What are the characteristics of the two Washingtons? (or)

b. Why do you disagree?

B. What strategies might the Governor want to employ to address your findings and better her chances for re-election?

  • The data is from the American Community Survey.
  • Each row ID represents a geographic area in the state of Washington, identified by its unique GEOID.
  • According to the government website, "GEOIDs are numeric codes that uniquely identify all administrative/legal and statistical geographic areas for which the Census Bureau tabulates data."
  • Each row belongs either to east or west.

The secondary goal of this notebook is to explore the uses of Unsupervised Machine Learning and Hacker Statistics techniques in finding value in our data.

The tertiary goal of this notebook is to showcase data visualisation using the Bokeh visualisation library.

A combination of markdown, HTML, and a little bit of Javascript was used to style this notebook.

Exploratory Data Analysis (EDA)


After importing the data, and before performing analyses on the data, it is important to explore the very condition of our dataset first.

Our dataset has 40 columns, 36 of which are numeric.
The non-numeric columns are:
1. ID (GEOID) 2. density_group 3. county 4. region

Notice that there are missing values. We will deal with these later.

Descriptive statistics


The pandas module easily provides summary statistics of our dataset, which has been converted to a pandas DataFrame.
Simply performing descriptive statistics doesn't tell us much, but a quick look tells us that:

  • The size difference between the geographically smallest area and the largest area is huge.

    It will be useful to visualize the distribution later.

  • Population densities vary greatly.

    This will probably have to be taken into serious consideration while comparing other features of the eastern and western regions.

  • There are features with minimum values of 0.00.

    Considering the differences in geographic size and population density, we will have to be careful when dealing with these values.

Loading output library...

Empirical Cumulative Distribution Function (ECDF)


Bokeh plots: Move It, Groove It

  • For this notebook, I have chosen the Bokeh visualization library in creating dynamic and interactive dashboards.
  • Hovering over a data point in the graph will reveal its X and Y axis values.
  • There is also a toolbox on the right side of the graphs that allows panning and zooming.
  • The axes can be rescaled by hovering over the respective edge and using the activated wheel zoom tool.
  • Each plot on a dashboard can be viewed by clicking on its respective tab.
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

East vs. West


Area and Population

East Washington, despite having much less GEOIDs belonging to it, is still geographically larger (58% of Washington's area).
East Washington also has more of GEOIDs with larger areas, with roughly a fifth of them (20%) greater than 300 sqkm in area. Only roughly 5% of the West (57 GEOIDs) have an area greater than 300 sqkm.

Looking at the total population, the distribution in the East is skewed towards large-population GEOIDs, but 78.53% of the total population of Washington resides in the West.


To take our analysis to another level, let us look at the relationship of these two features in each region.

Density = Population / Area

If you only look at the shape of the distribution and ignore the sheer number of GEOIDs assigned in the West, the ECDF and Bar chart for density groups reveal that the East is dominantly low-density and the West is dominantly high-density.

But, the visualizations also reveal that while most of the denser GEOIDs belong to the West, most of the less dense ones also belong to the West. Moreover, there is still a portion of very dense populations that belong to the East.

Loading output library...
Loading output library...
Loading output library...
Loading output library...


  • Lower = median < $19,000

  • Lower Middle = @@0@@35,000

  • Middle Middle = @@1@@100,000

  • Upper Middle = @@2@@350,000

  • Upper = median > $350,000


Higher Education

Gap = Percent; Male - Percent; Female

-A positive value indicates that males, proportionally, have a higher rate of higher education.
-A negative value indicates that females, proportionally, have a higher rate of higher education.


The West has a higher proportion of workers who use public transportation

Highest Poverty vs. Highest Poverty Density

Loading output library...

The GEOID with the most constituents below poverty line is in Whitman County in the East (country).

The GEOID with the highest poverty density is in King County, Seattle in the West (city).
Notice how they have nearly the same population size but one is packed in 0.65% of the area of the other.

They almost have the same Poverty rate, with the Eastern one slightly higher by just 6%. Notice, however, that in the Western one, only we do not know the poverty status (below or above the poverty line) of 34% of its population. There are 90.29 times more constituents below poverty line per square kilometer (poverty density) in the Western one and there are 71.10% more constituents under poverty line in the Eastern one.

Both their median incomes belong to the Low income group, and they have almost similar education levels.

The Western one has an older population. It also has 6.6% more unemployment rate and 62% more unemployed. This could be because of the older population in the West, as the unemployment rate seems to count also those who are 65+ years old.

In the Eastern one, the rate of higher education (at least a Bachelor's degree) for males is higher than for females by 14%, while it's only higher by 0.8% in the Western one.

We must realize, however, that GOEIDs with similar to the two above can be found on either size of the geographic divide.

East and west are not the same, but--


Let's go back


Since we are not too sure about our problem at hand yet even after exploring and processing our data, we need to return to our data, best from a different angle.

The question now is, how should we look at the data this time?

Let's ask Artificial Intelligence!


What are the two cities in Washington's tale?


Unsupervised Learning: KMeans Clustering


Removing unwanted features


We want to eliminate geographic location from the equation.

Loading output library...

Can I see?


t-SNE: t-distributed stochastic neighbor embedding

Loading output library...
Loading output library...
Loading output library...

Understanding the results of the clustering


Decision Tree


Decision Tree Performance

Loading output library...

!!! Looks like our classifier performs very well.


The decision tree classifier is able to predict the clustering assigned by our k-means algorithm with very low false positive and very low false negative rates.

New Rules


It is not an easy task to understand the result of a clustering when numerous features are involved.
We can use the set of rules used by the decision tree to guide us in exploring the intra-cluster commonalities and inter-cluster differences of our two clusters.

Loading output library...
Loading output library...
Loading output library...

Extracting the rules:


The following is the complete set of rules that define our decision tree, formatted as a Python function:

Loading output library...
  • Total; Estimate; Workers 16 years and over <= 2199.5
  • Total with Bachelor's Degree <= 2444.89599609375
  • Total; Estimate; AGE - 18 to 64 years <= 3705.0
  • Median income (dollars); Estimate; Households <= 95534.5
  • Total; Estimate; Percent bachelor's degree or higher <= 0.6549999713897705

Clustering: Composition


Note that in comparing clusters, we cannot refer to the specific cluster labels, as the labels of the clusters change every time the algorithm is run. What is important, however, is that the compositions of the clusters remain the same. The clusters are 0 and 1, staying true to Python's zero-based indexing system wherein the first is always index 0.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Area and Density

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Unsupervised clustering stats


Poverty **The high-density cluster has higher poverty densities and more constituents below poverty line. **

However, it has a lower poverty rate, which is the percentage of the population below poverty line.


Looking at them separately, the low-density cluster has more low-income GEOIDs than high-income GEOIDs,
while the high-density cluster has more high-income GEOIDs than low-income GEOIDs.

Unemployment This is where it gets interesting. While **the cluster with most of the Low-income GEOIDs also has a higher distribution of unemployment rate, the other cluster, with most of the High-income GEOIDs, has most of the unemployed in Washington (54%),** although it could be because the latter has a higher total population.

Higher Education **The high-density, high-income cluster also has the higher rates of higher education and higher total number of those with higher education.**

Women in the low-density cluster seems to have a higher rate of higher education in relation to men compared to women in the high-density cluster.

Transportation **The high-density cluster has a higher rate of public transportation usage as well as a higher number of total carpoolers and commuters.**

The low-density cluster might have a slightly higher rate of driving alone, but the high-density cluster includes the huge majority of solo drivers in Washington.

So, how shall we spin this tale?


NYC Mayor De Blasio's Tale of Two Cities contextualized his fight against inequality.

Which narrative best frames our battle?

East and West

The most glaring difference between the two is that the overwhelming majority of the affluent (middle middle, upper middle) GEOIDs are in the West.

Also, the East is much more vast than the West, despite the West being home to a huge majority of the population, which is also dominantly middle middle class. Another way to describe it is that a huge majority of Washington's voter base is middle middle class, and almost all of them are in the West.

However, one must remember that the East vs. West narrative is not just an arbitrary divide but a profoundly political one. It is a not-so-subtle clash between the Conservative East, with the city of Spokane at its core, and the Liberal West, strongest in the Seattle Metropolitan Area. This must be taken into great consideration as political or even ideological leaning significantly affects a voter's preferences for economic (macro and micro) policies.

West: Overpopulation, inflation in the housing market, huge number of poor and unemployed, transportation
East: High poverty and unemployment rate, low rate of higher education, transportation

If this narrative is to be utilized, there are two ways to go about it:

Strategy: Capital Movement

One way is presenting the ideological divide as something that need not prevent innovation and progress.
Generate hype with big, transformational, and mutually-beneficial projects that will require full cooperation between the two sides.

The following order of solutions can be offered in the campaign platform: 1. Development: take advantage of unused space in the East

Improve infrastructure and provide incentives like tax breaks to attract investment. The campaign could be to further expand Washington's aerospace manufacturing and software development industries in the East, as they are already among the top drivers of Washington's economy. 2. Job creation: made possible by investment and industrial expansion in the East This will attract the huge number of unemployed and even the highly educated from the West. This should also help the unemployed already living in the East.

  • Densification leading to Gentrification: improve access to education and conditions of living

    This will draw and further produce a more skilled workforce and ensure that they stay, relieving pressure on the West. Note, however, that the increase in demand for gentrified housing must be due to poverty reduction, and not due to the rich getting richer. Furthermore, the increase in housing prices and strain on provision of public services must not be at the detriment of the poor already living in the areas. This will be discussed further on the chapter on densification.

The premise of the strategy is that recreating the successes of the West in the East entails benefits to Washington as a whole.



Remember that the West, while home to the most affluent, is still largely comprised by a massive middle middle class. Reducing the population in the West will also reduce inflationary pressures particularly in the housing market. It also provides new investment opportunities for the burgeoning middle class in the West eager to let their money work for them.

One way of marketing this is bringing up the spirit of cooperation without the need to break down boundaries and values.



Overall, this dichotomy risks ignoring these groups of constituents:

  • Dense populations in the East like Spokane, unless they are emphasized as the centers of growth in the East
  • Low-density populations in the West (a huge number)
  • Low-income and unemployed constituents in the West who do not have the capacity or willingness to relocate
  • Constituents that rely on and prefer Agriculture and would resist changes.

By focusing only on the lower income in the East, for example, the dire needs of the lower-income constituents in the West will be set-aside in agenda-setting. Not only that, but the very conditions that lead to poverty in the West will be ignored entirely. Simply luring the less-fortunate out of the West into the East is a short-term fix. If these problems endemic in the West are not tackled directly, then there won't be a solution when the same problems start to arise in the East.

Not only that, but it clashes with sentiments for preserving the forests and farmlands in the East.

Concerning the governor's political career: this is a long-term project and thus final results will not yet be visible by the end of the governor's term.

Strategy: Targeted Campaigning

Another is simply acknowledging the divide and make reconciliations or adjustments amicable to either parties.

What is amicable for both would probably involve promising changes that wouldn't impose on either side. It can be acknowledged that Eastern Washington travels a separate economic road.

The goal, however, must still be on fighting inequality.

Tailor-made campaigns for each side:


Agricultural Might

In the East, the lack of higher education is not necessarily a problem. In fact, the low rates of higher education could be a reflection of the needs of the job market in the East. Most jobs in the agricultural sector do not require a college education. However, if more advanced technogies are to be used, relevant higher education qualifications will probably be necessary, and easier access to such education, meaning making it accessible in the East itself, is certainly an attractive improvement.

Improving average household income in the East could be a matter of improving the returns of working in the agricultural sector to its workers. Policies that protect and improve the lives of their workers come into mind.

Overall, this caters to conservative desire for autonomy, as they paradoxically do not care for how the liberal world imposes its values on them. It is important to take note that conservatives tend to be against socialist economic policies.

Furthermore, an important line of investigation that must be pursued is finding out the reasons behind high unemployment rates in the East.

Photovoltaic Powerhouse

Regions that hold the highest potential for a thriving solar energy industry are in East Washington.


As home to Washington's top manufacturing, technological, professional, medical, educational, and financial companies, West Washington needs its high population. It is the fuel that runs this engine of the economy. It is the government's job to take care of this population, the main problems of which is caused by crowding.

Affordable Housing

With a higher population comes higher housing prices due to higher demand, exacerbated by the redevelopment of many areas to cater to middle class tastes (Gentrification and urban sprawl), rendering previously accessible housing unaffordable to some. This is one of the problems faced in New York, which has been tackled by regulation of land development in order to keep some housing affordable. This will be discussed further in the chapter on densification.

Ease of Transport

The improvement of traffic and public transportation is vital not only to the economy, but also to the wellbeing of citizens.

Pollution Reduction

Air pollution, noise pollution, and garbage disposal. Initiatives to address these issues are sure to win the support of constituents.

Public Education

The growth of industry requires a population that can meet its growing demands for skilled and professional labor. Job openings are useless to the populace if they are not qualified for them. Take advantage of the huge population by making them more skilled. Its Liberal constituents believe that it is the responsibility of the government to secure this.


While fighting crime is high on the agenda, it must not be seen as a war against those in poverty. Crime must be fought at the roots, as Liberals would argue. It is due to rising inequality and the failure of social support. Increased policing is reactive and only really helps those already financially secure. While a majority of West Washington is relatively affluent, their liberal values must be considered.



It taps on pre-existing political divide. The targeted responses to differentiated demands maximizes the likelihood for gaining support from both sides without compromising ideological leanings. However, making everyone happy will require careful and transparent resource allocation.



This dichotomy still risks ignoring this subset:

  • Dense populations in the East
  • Low-density populations in the West (a huge number)

Only looking at median income distributions discounts the varying densities within each side.
Middle middle class constituents, who comprise the biggest chunk of our voters, are situated in different contexts, and therefore have differing needs and concerns.

     Rural and Urban

Note that we are not literally using the labels assigned by the clustering as the two cities We are simply using the results of the clustering algorithm to guide our data analysis.

While the West, as a whole, is denser than the East, the aggregative nature of the East vs. West narrative does not strongly reflect distributions of different population densities within the two divides.

When our Machine Learning algorithm was "asked" to divide Washington, given our limited data, it gave us two different mixtures of the East and the West. Differences are magnified when geographic location is ignored. Inequalities are more visible when more similar parts from each side are put together.

It showed us more significant differences in population density, income, poverty, unemployment, education and means of transportation.

Instead of utilizing and thus enforcing the politicized tension between Eastern and Western Washington, one can instead aim to break down this barrier and claim that party politics is hindering progress and is not really what matters. The flagship of the campaign can not just be cooperation like in our first strategy, but ditching of party politics. A bit of a stretch to do this especially when it comes to the degree of conservativeness of the East, but definitely not impossible.

"Economic reality doesn't care about red and blue. What matters is how the state overall performs against its international competition."

Our two cities now become the rural and the urban. But like in the East vs. West narrative, there are two ways to go about it:

Strategy: Urbanization

This strategy is similar our Capital Movement strategy, but now disregarding the East vs. West narrative and instead adopting a more politically-neutral sense of inequality.

The promise is to spreading out jobs while also providing affordable housing for mostly middle middle class without really shunning the lower middle and lower class because the liberal middle class would most likely be concerned with such affairs as well.

Population density is a reflection of underlying conditions.

While increasing density seems to increase high-income households, does it also translate to more households escaping poverty or does it just mean that higher-income households tend to flock together in the city?

Which brings us to the important question: why do people flock to the city?

They key to this is employment. What our clustering data ultimately reveals is the difference between the Rural and the Urban when it comes to unemployment.

Rural: High unemployment rate. Further research points to a lack of jobs.

Urban: Lower unemployment rate but has more of the unemployed. Unemployed move to the city in hopes for a better life. Poverty also leads to lack of education and thus difficulty finding employment.

While urban sprawl, which is the spread of population outwards to suburban areas, does reduce excessive inflationary pressures in the housing market, it creates a transportation problem, as people in the suburbs still working in the city have to travel longer distances. The lack of public transportation increases private vehicle use, and thus also carbon emissions. In other words, making more people live in the suburbs simply transfers costs to transportation.

Promoting and encouraging new industrial activity and thus job creation outside the current centers (Seattle area and Spokane) is therefore the better option in the long run.

As mentioned earlier, this has to be backed up by a capable work force. Ultimately, the goal is to decrease poverty by combating unemployment and improving income through education and social support.

Rural Washington needs special attention.

Strategy: Targeted Campaigning

Akin to the earlier targeted campaigning strategy, the strategy is to avoid major changes in industry and infrastructure, and instead focus on appeasing the existing segments in society.

This strategy utilizes dichotomies within our primary dichotomy. 1. Rural

a. Rural upper class
b. Rural working class
2. Urban
a. Urban upper class
b. Urban working class

Let constituents self-identify to which category they feel they belong. What's important is that there are satisfying promises ready for each niche. Also note that socialist policies that help the lower class may also please the liberal middle class, as liberal attitudes towards poverty suggest.

The premise is to cater to different needs of different segments of the population. The campaign platform should be able to cater to both the poor and the affluent by identifying that they have different needs.



Plays on the growing discourse of Urban Intellectuals versus Rural Morals.


  • Lack of data means that some factors like political differences are not taken into account.
  • Government might find roadblocks to policy not because the policy is not sound but because some other non-economic issue is being leveraged or taken into account in the power struggle.



To do or not to do?


So, why is considering population density in economic policy analyses of utmost importance?

Loading output library...
Loading output library...

Mean Income and its Correlations


Before diving into the importance of density, let's take a look at factors that correlate with income.

Let us now isolate population density and its relationship to other factors.

Loading output library...
Loading output library...

Population Density and its Correlations

Loading output library...

Pearson r correlation: Density vs. Poverty and Unemployment Rates


Density and poverty rate have a rather weak positive (+) correlation.

Density and unemployment rate have a very weak negative (-) correlation. (-0.0260)

Using scatter plots, we can see that visually our GEOIDs together barely resemble the best-fit lines.

Loading output library...
Loading output library...
Loading output library...

Linear Regression Predictor: Density and Mean Income


Even if the R-squared is low, the value is negative. This means that it fits worse than a horizontal line. Density alone is very bad as a predictor for income.

Loading output library...
Loading output library...
Loading output library...

Comparing Low-density and High-density Cases


Null Hypothesis: **Low-density and high-density cases have similar distributions.**

Loading output library...
Loading output library...
Loading output library...

Hacker Statistics: Permutation Test for Identical Distributions


Permutation Test Results:


The permutation test for distributions of poverty rates confirms that low-density and high-density GEOIDs have considerably different distributions.
(Reject Null Hypothesis)

The permutation test for distributions of unemployment rates indicates that low-density and high-density GEOIDs have considerably similar distributions.
(Keep Null Hypothesis)

Loading output library...
Loading output library...
Loading output library...

Hacker Statistics: Bootstrap Test for Identical Distributions


Bootstrap Test Results:


The bootstrap test for distributions of poverty rates confirms that low-density and high-density GEOIDs have considerably different distributions.
(Reject Null Hypothesis)

The bootstrap test for distributions of unemployment rates indicates that low-density and high-density GEOIDs have considerably similar distributions.
(Keep Null Hypothesis)

Loading output library...
Loading output library...
Loading output library...

Conclusion on Density

Instead, what density provides is context necessary to targeted policy-making.

# Policy Implications: Why is density important, then?

Most notable are the effects on housing for the poor, which is explained next.

Closer Look: Densification and Housing

Loading output library...
Loading output library...
  • More households forming from a given population
  • Average size of households falling
  • Well-organized agglomerations with diverse activities increase productivity and improve competitiveness of the economy.
  • Extension and urban crawl inefficiently adds to demands for infrastructure services in outer areas and leaves existing infrastructure underused.

Additional Data Recommendations


Time series data Time series data would allow us to detect specific trends in Washington's history that lead to growth. Multivariate time series data, for example, would reveal links or dependencies between key factors or industries.

Data for measuring inequality Data that allows the measurement of income inequality (GINI coefficient) should be made available in order to get a more detailed picture of the situation that would let us more directly tackle the problem of inequality.

Crime statistics This would allow an informed plan to combat crime in a less reactionary manner. Paired with the rest of the data (especially time series data), it would help shed light on the deeper causes of crime. epidemics.

Climate data Economic development policy formulation can largely benefit from climate data, as the feasibility of heavily-profitable industries can be dependent on an area's climate.

Housing market data Housing market fluctuations heavily affects the less fortunate as they are the ones less able to adapt to inflation.

Industry sector data + available jobs Paired with geography, such data is essential to the formulation of a tangible economic plan.

How about Principal Component Analysis (PCA)?


Its goals are: 1. To look for properties that show as much variation across instances as possible 2. To look for properties that allow to predict or "reconstruct" the original characteristics

In doing this, it effectively reduces statistical noise, or unexplained variations in the sample.

Why wasn't it used?


It practically made the clustered data useless.


  • East and West Washington have mostly different geographies, but the dichotomy remains more of a political divide between conservatives and liberals.
  • Unsupervised machine learning gives us two clusters that emphasize differences in density, unemployment and poverty when the East and West divide is ignored.
  • Our re-election campaign strategy can either take into heavy consideration the East and West political divides or prioritize density differences between the Urban and the Rural, while either promoting new centers of growth or reinforcing existing industries where they are currently located.
  • Isolating the more extreme cases of population density and comparing their unemployment and poverty statistics reveals inconsistencies with analyses that include all and suggests that aggregated data hides trends.
  • The degree of population density itself does not provide a guide for achieving our targets (improving income, reducing poverty and unemployment). Instead, it serves as context, the knowledge of which is necessary for effective policy-making.
  • Policy-makers must consider the effects of expansion and density reduction caused by improvement of income, especially the adverse effects to the lower-income households.



Explanations and illustrations of machine learning techniques were from: 1. Datacamp 2. Data Science for Business by Foster Provost and Tom Fawcett

Tinsely, Karen and Bishop, Matt (2006). Poverty and Population Density: Implications for Economic Development Policy.

Whitehead, Christine. The Density Debate.