import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
sns.set()

Motivation

The idea is that we have observed some data $D_n=\{x_i, y_i\}_{i=1}^n$, where $x_i$ contains $p$ variables observed for observation $i$. And our model for the data is

$y_i=f(x_i) +\varepsilon_i$

where $f:R^p \rightarrow R$ is some unkown function and $\varepsilon_i \sim N(0, \sigma^2)$

The goal is to estiamte $f$ with some function $\hat{f}$ such that $y_i \approx \hat{f}(x_i)$ is "close" im some sense. More precisly we want

$E(f(X) - Y)^2$

to be small. There are two components which need to be discussed. First the expected value. This basically means that we want the error to be small on average and makes intuitive sense. The second one is the distance meaure in quadratic sense. Why not use for example another norm e.e.g the absolute value ? The reasaon is that it is much nicer to work with the quadrtic function then the absolute value (differentiability for example). The next question is to ask is if there is a function $m^*$ which can minimizies the above quantitiy.

$E(m^*(X) - Y)^2 = min_f E(f(X) - Y ^2)$

one can show that the function $m^*(x) = E(Y| X=x)$ solves the mininmizaion problem so our initial problem becomes

$y_i=m^*(x_i) + \varepsilon_i = E(Y| X=x_i) +\varepsilon_i$

This means $E(y_i) = m^*(x_i)$ and $Var(y_i) = Var(m^*(x_i)) + \sigma^2$

Until now $m^*$ is a theoretical construct since we don't observe the whole distribution but only a finite sample $D_n$. Hence we need an estimate some estimate $\hat{m}^*$. Thus now we are interested in the error

$E(\hat{m}^*(x) - Y)$

x = np.linspace(-1, 1, 200)

def f(x):
    y = x.copy()
    mask1 =  (x >= -1) & (x < -.5)
    mask2 = (x >= -.5) & (x < 0)
    mask3 = (x >= 0) & (x < .5)
    mask4 = (x >= .5) & (x < .1)

    y[mask1] = ((y[mask1] + 2) ** 2) / 2
    y[mask2] = y[mask2] / 2 + .875
    y[mask3] = - 5 * (y[mask3] - .2) ** 2 + 1.075
    y[mask4] = y[mask4] + .125

    return y

mean = f(x)
variance = .2 - .1 * np.cos(2 * np.pi * x)
y = np.random.normal(mean, variance, 200)

plt.scatter(x, y, alpha = .5, label = 'data')
plt.plot(x, f(x), color = 'r', label = r'$m(x)$')
plt.legend();

Goals

There are two reasons to perform statistical learning. The first is prediction we want to use our estiamted function $\hat{f}$ to make accurate predictions for new, possibly unseen data. The second is inference where we are more concerned how the explanatoy variables influence $y$. In the inference task we may want to estimate a simple function which can be interprted easily.

Classification

Bias Variance Tradeoff

Helper Functions

Plot for the Blog Post

Sources

Chapter 2 of (James et al., 2013)
Chapter 1 and 2 of (Györfi et al., 2006)

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer.
Györfi, L., Kohler, M., Krzyzak, A., & Walk, H. (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.