Motivation/Intuition

Sometimes you are in a situation where you want to fit a linear model, say a linear regression

$y_i=\beta_0+\beta_1 x_{i1}+...+\beta_p x_{ip} + \varepsilon_i$

where $\varepsilon_i$ is the error term and with $N$ observations and $p$ variables but you have many more variables than observations $p >> N$. In this case, the OLS-Estimator for $\beta$ is not unique. To see this consider the case where $N=p$ then $y$ lives in the same spaces spanned by the columns of $X$ (assuming no linear dependents) then the orthogonal projection of OLS is not unique.

One way of dealing with this problem is to add regularization to the model. Here we will take a look at the LASSO. (Most of the following information comes from (Hastie et al., 2015) and what I can remember from lectures I took.)

The lasso estimator is the solution to the following problem

$\hat{\beta}_{Lasso} = argmin_{\beta \in R^p} \frac{1}{2 N}||Y-X\beta||_2^2 + \lambda ||\beta||_1$

The idea behind this type of penalization is that the first term wants to choose $\beta$ such that it best fits the data. The second term wants to set $\beta = $. So there are two forces working against each other. $\lambda$ controls how expensive it is to have non-zero coefficients in the model. For $\lambda = 0$ the above reduces to the regular OLS estimator. If $\lambda -> \infty$ then $\beta -> 0$. More intuition can be drawn by looking at the primal problem (the former is called the dual or Lagrangian form)

$||Y-X\beta||_2^2 \ \ \ $ s.t. $\ \ \ ||\beta||_1 \leq R $

, so we want the beta which best fits the data, but we only have a budget to spend on non-zero coefficients. Note the inverse interpretation of $R$ and $\lambda$. While an increasing $R$ allows for more zeros to be non-zero it is the opposite for $\lambda$.

When I first saw this, with my economics background this reminded me of the household problem of maximizing utility of consuming two goods given a budget constraint

<matplotlib.legend.Legend at 0x7fbb350ec828>

Not differentiable at 0

The optimization poses a challenge since the last term cointains absolute value functions which are not differentiable at $0$ in other words we can't find a unique tangent line the point at (0, 0).

Introducing subgradients

Fortunately for convex function we can extend the notion of a derivative. But first lets have a look at the derivate of the absolute function where we can compute it.

We see that for $x<0$ the derivative is simply $-1$ and for $x>0$ it is $1$.

Note that all th red points in the left plot correspond to a tangent of the absolute value function.

Not surprisingly is the derivative -1 for $x < 0$ and $1$ for $x > 0$, and we also see the problematic part at $x=0$ The idea now is to define the subgradient of the absolute function at the point $x=0$ to be the whole interval $[-1, 1]$.

Now instead of characterizing the interior arg min/max of a continuous function $f$ by requiring $f'(x^*)=0$. We now have optimally conditions depending on of the value of $x^*$.

Going back to our objective function (assuming one variable $p=1$) we first write down the optimally condition for $\beta <0$

$\frac{1}{N} \sum_i y_i x_i + \frac{1}{N} \sum_i x_i^2 \beta - \lambda = 0$

wlog assume that the data has been standardized in other words $\bar{x} = 0$ and $\frac{1}{N} \sum_i (x_i - \bar{x})^2 = 1$ and hence $\sum_i x_i^2=N$ hence if $\beta < 0$ the optimality condition is given by

$\beta = \frac{1}{N} \sum_i y_i x_i - \lambda$

further we see that this holds if $\frac{1}{N} \sum_i y_i x_i > \lambda$ similarly we get for $\beta > 0$

$\beta = \frac{1}{N} \sum_i y_i x_i + \lambda$ if $\frac{1}{N} \sum y_i x_i < - \lambda$

beta = np.concatenate([np.zeros(100), np.random.randn(10)])
np.random.shuffle(beta)

X = np.random.randn(200, 110)

Y = X @ beta + np.random.randn(200)

beta_est = np.random.randn(110)

lam =1


@gif.frame
def my_func(i):
    

    
    fig, axes = plt.subplots(figsize = (20, 10), tight_layout = True)
    
    axes.plot(np.arange(0, 110), beta_est, 'o', c = 'black')

    axes.set_xlim([0, 112])
    axes.set_ylim([-2, 2])
    
    
    
    
frames = []
for i in cycle(range(0, 110)):

    if i == 0:

        beta_compare = beta_est.copy()

    beta_est[i] = soft_threshold_v2(beta_est[i] + 1 / 200 * np.dot(X[:, i], Y - X @ beta_est), lam)

    #if coef got shrunken to zero then there is no need to add new plot -> more efficient
    if beta_compare[i] != 0:
      frame = my_func(i)
      frames.append(frame)

    if i == 109:

        print(np.sum(np.abs(beta_compare - beta_est)))

        
        
        

        if np.sum(np.abs(beta_compare - beta_est)) < 0.000005:



            break
                
gif.save(frames, "sine-wave.gif", duration = 400)
85.58810250821061
2.171925192436089
0.07008592562616456
0.00436933197735434
0.0003931219394626151
2.4317437148146936e-05
2.3642846109783733e-06

Comparison with sklearn's lasso implementation

Lets see if our algorithm does the same as sklearn

from sklearn.linear_model import Lasso

model = Lasso(alpha = 1, fit_intercept=False).fit(X, Y)
beta_my = lasso_solver(Y, X, 1)
np.allclose(model.coef_, beta_my)
True

Example of finding a minumum of a non differentiable convex function

Let's find the minimum of $(x -1) ^2 + |x|$

Note the non differentiable point (kink) at $x=0$ and that by visual inspection we see that the minmum is at $x=0.5$

The derivative is given by

$2x - 2 + sign(x)$

Lets investigate now, maybe the minum is at a point where $x < 0$ ? If so it has to hold that $2x - 2- 1=0 -> x = \frac{2}{3}$ which is a contradiction to our assumption that $x<0$.

What about $x=0$ ? Our subgradient takes on values in $[2x - 3, 2x - 1]$. Is a tangent with slope 0 in this intervall ? Clrearly not since the above is equvalent to $\frac{1}{2} <= x <= \frac{3}{2}$ which contradicts our assumption that the min is at $x=0$.

So the last case is $x>0$ which translates to $2x - 1 = 0 -> x = \frac{1}{2} > 0$ giving rise to the observed minimum.

Text(0.5, 1.0, 'Gradient/s')

Plot of the soft thresholding operator

lasso as constraint problem

# generate data
np.random.seed(123)
beta = np.array([4, 1])
X = np.random.randn(100, 2)
y = X @ beta + np.random.randn(100)


grid = np.linspace(-5, 10, 100)
XX, YY = np.meshgrid(grid, grid)

# computations
ll = []
lasso_list = []
for i in grid:

  for j in grid:

    ll.append(ols_constraint(i, j))
    lasso_list.append(lasso_constraint(i, j))

DD = np.array(ll).reshape(100, 100)
ZZ = np.array(lasso_list).reshape(100, 100)



# plotting
fig, axes = plt.subplots()
AA = DD.flatten()
AA.sort()
# big plot
axes.contour(XX, YY, DD, AA[np.arange(100, 2000, 300)])
axes.contour(XX, YY, ZZ, lasso_constraint(grid, grid)[35:36])
axes.annotate('OLS solution', [1, 4], [6, 0], arrowprops = {})
axes.scatter(1, 4, c = 'red')

# small plot
axins = axes.inset_axes([-1, -1, 1, 1])
axins.contour(XX, YY, ZZ, lasso_constraint(grid, grid)[35:36])

axins.annotate("Lasso Solution", [0, 0.6], [-1, .5], arrowprops = {})

axins.contour(XX, YY, DD, AA[1690:1691])
axins.set_xlim([-1, 1])
axins.set_ylim([-1, 1])
axes.indicate_inset_zoom(axins);

Solution Path

Helper Functions

def lasso_constraint(beta1, beta2):

  return np.abs(beta1) + np.abs(beta2)

def ols_constraint(beta_1, beta_2):

  beta11 = np.array([beta_1, beta_2])

  return 1 / 100 * (np.sum((y - X @ beta11) ** 2))



def util(x, y):

  return x ** 0.3 * y ** 0.7


def budget(x):

  return - 2.7 * x + 1


def lasso_solver(y, X, lambda_param, warm_start = False):

  p = X.shape[1]

  N = X.shape[0]

  if isinstance(warm_start, np.ndarray):
    beta_est = warm_start  
  else:
     beta_est = np.random.randn(p)


  for i in cycle(range(0, p)):
      
      if i == 0:
      
          beta_compare = beta_est.copy()
      
      beta_est[i] = soft_threshold_v2(beta_est[i] + 1 / N * np.dot(X[:, i], y - X @ beta_est), lambda_param)
      
          
      if i == p - 1:
                    
          if np.sum(np.abs(beta_compare - beta_est)) < 0.000005:
          
              return beta_est



def soft_threshold_v2(x, lambda_param):
    
    return np.sign(x) * (np.abs(x) - lambda_param) if (np.abs(x) - lambda_param) > 0 else 0

  1. Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: the lasso and generalizations. CRC press.