Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 10: Statistical Model as Simplifications

In the previous chapter, we established two key ideas:

However, this leads to a practical problem:

So, in this chapter we will address a core idea in statistics:

Why are we need models

Consider a real world system:

The true data-generating process is:

In monst cases, the true distribution is too complex to fully describe.

This leads to a fundamental constraint:

The idea of a Model

A statistical model is a simplified description of how data is generated.

Instead of modeling everything, we assume a structure.

Example:

import numpy as np

# True process (unknow in practice)
x = np.random.normal(0,1,1000)
y = 3 * x + 2 + np.random.normal(0,1,1000)

we do not know the exact mechanism, instead, we assume a model

yβ0+β1xy \approx \beta_0 + \beta_1x

this is a Linear model

Model as Approximations

A model is not reality, it is an approximation, A mdoel captures part of the structure while ignoring the test\textbf{A mdoel captures part of the structure while ignoring the test}.

For example:

This decomposition is fundamental.

Bias-Variance trade-off

When we building models, we face a trade-off:

Interpretation:

Key Idea:

This trade-off is core for both statistics and machine learning

Example: Fitting a Simple Model

Let us fit a linear model:

from sklearn.linear_model import LinearRegression

X = x.reshape(-1,1)
model = LinearRegression()
model.fit(X,y)
print(model.coef_, model.intercept_)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from sklearn.linear_model import LinearRegression
      3 X = x.reshape(-1,1)
      4 model = LinearRegression()

ModuleNotFoundError: No module named 'sklearn'

This procudes and estimate of the underlying relationship, however it does not recover teh true process excatly, and it only approximates it based on observed data.

Why does a Model actually learn

This a critical conceptual point.

A model does not learn:

Instead, it learns:

A structured approximation of the relationship between variables\textbf{A structured approximation of the relationship between variables}

For regression:

modelE[YX]model \approx E[Y|X]

For classificatin:

ModelP(YX)Model \approx P(Y|X)

These are properties of the distribution, not arbitrary constructs.

Underfitting & Overfitting

We can illustrate model complexity:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

model_simple = LinearRegression()
Model_complex = make_pipeline(PolynomialFeatures(5), LinearRegression())

model_simple.fit(X,y)
Model_complex.fit(X,y)

print(model_simple.coef_, model_simple.intercept_)

print(Model_complex[-1].coef_, Model_complex[-1].intercept_)

Hence:

The goal is not perfect fit, but useful approximation

Models as compression

A powerful way to think about model: A model compresses data into a smaller set of parameters. Instead of storing 1000 ovservations, we store:

This compression captures the essential structure.

Why this matters in Machine Learning

Machine Learning(ML) extends this idea:

All share a same principle: