*2018 CFA Level 2 Study Session 3 Reading 9* is an introduction to covariance, linear regression, and scatter plots. For ease I've broken these up in several parts, with part 1 exploring scatter plots and correlation coefficients. The math for this lesson can be handled using either numpy or pandas, and the plotting from matplotlib. As the complexity of the examples increase, we bring in real world data using pandas-datareader to illustrate the examples.

Note: to properly view this notebook, ensure that "show output" is selected from the dropdown, and if you want to see the wizard behind the curtain, click "show code".

**Scatter Plots**

Example 1-1 concerns **scatter plots**, with a some money supply rate and an inflation rate, with the following data:

Loading output library...

With the resulting scatter plot:

Loading output library...

**Correlation analysis** describes the relationship between two data series using a * *single number* * , the **correlation coefficient**, as measured from -1 to +1, with 0 being perfectly uncorrelated. The following graphs demonstrate this effect.

Loading output library...

Loading output library...

Loading output library...

Note that Figure 1-4 is perfectly random, no correlation, and the value of X tells us nothing about the value of Y, etc.

**Calculating and Interpreting Correlation Coefficient / Covariance**

@@0@@

where: n = sample size

@@1@@th observation of variable X

@@2@@ = mean observations of variable X

@@3@@th observation of variable Y

@@4@@ = mean observations of variable X

On the **HP 12c calculator**, correlation coefficient is generated when you use the @@5@@ key to show the coefficient. (see the HP 12c Platinum manual pg 97)

The formula for the (Pearson) **correlation coefficient** is given as follows:
@@6@@
where @@7@@ is the standard deviation (recall, the standard deviation is the square root of the variance) of @@8@@, and @@9@@ is the standard deviation of @@10@@.
In this way, you can multiply @@11@@ as calculated by your financial calculator by @@12@@ to arrive at the covariance.

Using **Pandas**, calculating the covariance, variance, and correlation coefficient is straight forward, as each series has a method for each. Using the initial dataframe as an example

The long way to do this, as described in the text, is to innumerate the table to calculate the cross product and the squared deviations...

The (Pearson's) correlation coefficient

Loading output library...

still relatively easy, but unnecessary given Pandas' built in methods. It's important to note that the correlation coefficient formula given here by CFAI is the Pearson correlation coefficient. Other options available to the Pandas method are the Kendall Tau correlation coefficient, and the Spearman rank correlation coefficient.

**Limitations**

- It's important to understand that correlation, measured this way, measures
*linear*relationships. For example, let's revisit figure 1-2 again, but also look at exponentials. In figure 1-2, I used a random sequence from numpy as the X variable, and used the equation @@0@@ for the Y column, which is a linear relationship. Let's make Z a quadratic, using the equation @@1@@. For simplicity we will do this in a pandas dataframe:

Loading output library...

with the resulting graph:

Loading output library...

The Pearson's correlation coefficient for x:z is expected to be low, because the parabola is not easily explained with a line, as shown by the correlation coefficient near 0 (uncorrelated).

*Note* if we only take the right or left half of the parabola, the correlation coefficient is quite good, as the relationship appears somewhat linear. Consider price data - the prices will always be positive. So, if using real prices in a correlation, regardless if it is an exponential it might have a linear-looking correlation. On the other hand, if the dataset contains positive and negative values (such as daily or weekly price changes), the correlation would diminish with an exponential function. For example, take only the right side (positive) data:

Loading output library...

Additionally, odd numbered power functions are likely to return higher correlations, such as @@0@@

Loading output library...

- The text also suggests that data with outliers are unlikely to correlate well, and uses a scatter plot of monthly S&P 500 returns and monthly US inflation rate, 1990-2013. This data is also fairly easy to pull from Python using "pandas-datareader". We will use SPY as a proxy for the S&P 500, with data provided by Google going back to 2001:

The data appears to have no linear relationship, however the correlation coefficient of implies a weak relationship. The outliers are probably giving more of the appearance of a linear trend than actually exists. The text goes on to hypothesize about possible relationships between stock market returns and CPI. I would add that as CPI is a lagging indicator, it might be more fruitful to lag the CPI by one month when correlating to returns. One must be very cautious when analyzing datasets like these, however, as it can be very easy to fall prey to datamining biases, or other **spurious correlations**.

Even with random data, it's possible to find correlations which are meaningless. To illustrate this point, here is a code which loops random data until it finds a correlation of greater than 0.7:

**Testing the Significance of the Correlation Coefficient:**
We want to know whether our correlation (null hypothesis) is actually 0 ( @@0@@ ) and our alternate hypothesis is the correlation is not 0 ( @@1@@ ). As we learned in level 1, the T-test is a good way to evaluate the significance of a correlation, so for this we will use a 2-tailed T-test statistic.

@@2@@

however, we aren't working with sample mean data in this case. The institute provides this formula:

@@3@@

*the origin of which is somewhat uncertain.*

The decision rule for using this formula is that we reject @@4@@ if the @@5@@

Using this formula with our example 1-4 we come up with a test statistic of:

And using a student's T-table, at the 0.05 threshold our T-critical is 2.1009. Since the test statistic is greater than the critical value, we can reject the null hypothesis of no correlation at the 5% significance level.

The institute notes that with a sufficient population given a certain r value, it should be able to just meet the t-critical. Let's test this by modifying our random data loop slightly to reject the null hypothesis with a low correlation coefficient, something that you wouldn't want to trade on. Let's give it a correlation coefficient of 0.35, but while rejecting the null hypothesis. We'll go with a population of n=32 (degrees of freedom = 30). At the 5% significance level (0.05), our t-critical is 2.046, so we can run the loop until our t-stat rejects the null hypothesis like so:

As you can see, the scatter plot appears meaningless - but we've successfully rejected our null hypothesis!

This goes to show that you can lie with statistics. With our two examples we have found signals in the noise which are both 'statistically significant' based on two polulations fo random data.

These tools are a good starting point for evaluating relationships in datasets, but are limited and must be applied correctly to have meaning.

Going forward, it is important to remember that statistics is a very deep field of study, and python contains a great wealth of statistics packages and modules. Be sure to check the assumptions of every module before you use it.

In part 2 we will look at linear regression!