Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 9 Covariance & Correlation

In the pervious chapter we asked:

We answered:

However, we made one critical simplification:

The New Problem

Real data is almost never one-dimensional

instead, we observe multiple variables simultaneously:

This raises a new question

If each variable has its own distribution, how do variables behave together?

The core question of this chapter

When mulitple variables vary, what structure governs their relationship?

observing joint variation

First let’s start a simple example.

<Figure size 640x480 with 1 Axes>

We observe:

Why mean and variance are NOT ENOUGHT

From Chapter8:

But now we have two variables,questions we can not answer using only mean and variance:

This leads to a new concept:

We need a way to measure joint variation.

Introducing Covariance

We define:

Intuition:

print(np.cov(x,y))
[[1.0645908  2.12158412]
 [2.12158412 5.21164557]]

Conceptually:

If x is above its means and y is also above its mean -> positive contribution

If one is above and the other below -> negative contribution.

Key interpretation:

Why convariance alone is NOT ENOUGHT

Covariance depends on scale.

For example:

This mankes raw covariance difficult to compare.

Correlation: A standardized measure

To address this, we normalize covariance.

Properties:

np.corrcoef(x,y)
array([[1. , 0.9007027], [0.9007027, 1. ]])

Interpretation:

Visual Interpretation of correlation

We can compare different relationships

<Figure size 1200x400 with 3 Axes>

Observation:

From pairwise relationships to structure

So far, we have considered two variables,in real datasets, we often have many variables.

We can compute a covariance matrix

data = np.random.multivariate_normal(
    mean=[0, 0, 0],
    cov=[[1, 0.8, 0.5],
         [0.8, 1, 0.3],
         [0.5, 0.3, 1]],
    size=1000
)

np.cov(data.T)
array([[1.07101946, 0.87138028, 0.51823156], [0.87138028, 1.07009019, 0.35387897], [0.51823156, 0.35387897, 0.97318215]])

This Matrix summarizes:

Why this matters in Machine Learning

Many machine learning methods rely on relationshipe between variables. Example:

This leads to broader insight:

Machine learning is not only about individual variables, 
but abount the structure formed by their relationships.

Summary

We return to the central question:

The answers is:

Statistics is not only about variation, but about structured variation.