Read in the data


Read in the surveys


Add DBN columns


Convert columns to numeric


Condense datasets


Convert AP scores to numeric


Combine the datasets


Add a school district column for mapping


New York City has published data on student SAT scores by high school, along with additional demographic data sets. Above, we combined the following data sets into a single, clean pandas dataframe:

  • SAT scores by school - SAT scores for each high school in New York City
  • School attendance - Attendance information for each school in New York City
  • Class size - Information on class size for each school
  • AP test results - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
  • Graduation outcomes - The percentage of students who graduated, and other outcome information
  • Demographics - Demographic information for each school
  • School survey - Surveys of parents, teachers, and students at each school

New York City has a significant immigrant population and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to determine whether the SAT is a fair test. For example, if certain racial groups consistently perform better on the SAT, we would have some evidence that the SAT is unfair.

Find correlations

Loading output library...

From the result above, it looks like white percentage(white_per column) has the strongest correlations with sat score floowed by asian percentage(asian_per column).

The strongest negative correlations is free-and-reduced-price lunch program (FRL) percentage(frl_percent). The more FRL eligible students there are, the less average SAT score they get.

It almost undeniable that ethnicity and family wealth come into play. Which conforms with the public opinion.

One that less obvious is saf_t_11 and saf_s_11, which measure how teacher and students perceive safety at school, correlated highly with sat_score. Let's look into more.

Plotting survey correlations

Loading output library...
Loading output library...

As you can from the above, the students who go to the school that they feel safer, generally do better in SAT.

Average Sefety Score by Districts

Loading output library...
Loading output library...
Loading output library...
Loading output library...

From the above graph, we can see the school that have higher white and asian students composition do better in terms of SAT and the opposite is true for black and hispanic students. We will investigate further on the issue.

Let's look into hispanic case first.

Loading output library...
Loading output library...

As expected, there is negative correlations. Let's look into those schools that have high hispanic percentage.

Loading output library...

A lot of them are schools for mainly immigrants.

Loading output library...

Most of them are STEM schools, which probably associated with higher school fee.


Loading output library...
Loading output library...

From the graph, the schools have more female students than male students do slightly better in SAT.

Loading output library...
Loading output library...

It is not very ovbious from the graph but the higher scores leaned toward higher percentage of female students. However, all girls schools don't do particularly well than gender mixed school.

Loading output library...

In the U.S., high school students take Advanced Placement (AP) exams to earn college credit. There are AP exams for many different subjects. Let's see if that correlate with sat score

Loading output library...
Loading output library...

It seems there is correlation, however the data seems to divide by two different groups. The upper where have strong correlation between AP Taker percentage and the bottom where they don't have correlation at all. They divide at around SAT overall 1300.