Open In Colab

Checking normality of the dependent variable of a linear model with a histogram is a bad idea

I always believed that the process of ending up with a linear model is to plot the distribution of $y$ see that it follows a normal distribution or a distribution which is approx normal after a transformation i.e. log transform and then use the linar model. As it turns out this is not quite right, take for example the case where $y$ is wage and $X$ cointains only a dummy for gender. Then the distribution is probaply bimodel because of the wage gap. ONly if we condition on the gender then $y$ is normal or in the case of wage transformable to normal!!!!

$y|X=women \sim N(\mu_i, \sigma^2)$

General model description of the decomposition observed_data=systematic_component + stochastic/random_component

$y_i \sim f(\theta_i , \alpha)$ stochastic component

$\theta_i=g(x_i, \beta)$ systematic component

  • $y_i$ is our dependent variable
  • $f$ is the density of our assumed distribution for $y$
  • $\theta_i$ is some feature of the distribution which we want to model, often this is the mean but it can be any moment of the ditsrbution say the variance or some quantile...which we model dependnt of our exogenous variables and which of course varies over teh data (question to myself how to model say mean and variance together?)
  • $\alpha$ is a feature of the distribution which stays constant think of $sigma$ in the linear model
  • $g$ is the link function
  • $x_i$ is the exogenous data
  • $\beta$ effect coeficients

Probability

The most intersting thing from this lecture fo me was that it reminded me that you don't just can use kernel desnity estimating to estimate the pdf but that you can actually sample from it, of course..he showed how to do this "by hand" for time purpose I will use the build in functions in python to do this

import numpy as np
from scipy.stats import gaussian_kde
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
data = np.random.normal(size = 200)
sns.kdeplot(data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f42fc364080>
test = gaussian_kde(data)
x = np.linspace(-2, 2, 200)

y = test.evaluate(x)
plt.plot(x, y)
[<matplotlib.lines.Line2D at 0x7f42fc2c1a90>]
sns.distplot(test.resample(10000))
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7f42f65125f8>
essti = KernelDensity()
essti.fit(data.reshape(-1, 1))
sns.distplot(essti.sample(10000))
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7f42f5d867b8>

8. Statistical Simulation

Starting from the general model specification

$y_i \sim f(\theta_i , \alpha)$ stochastic component

$\theta_i=g(x_i, \beta)$ systematic component

Say we found the MLE estimator for $\beta$ and $\alpha$ called $\hat{\gamma}$. For this we can estimate the variance $\hat{V}(\hat{\gamma})$ and we want to make inference for soe datapoint $X_c$. We can approcxciamte the asymptoic distribution by $N[\hat{\gamma} ,\hat{V}(\hat{\gamma})] $ With this we draw and calculate $\theta_{sim}=g(x_c, \beta_{sim})$ with this, teh draw from $\alpha$ we can use the stochastic compnent to simulate $y$

Example simple linear model

$y_i \sim N(\mu_i, \sigma^2)$ where the systematic component $\mu_i=\beta_0 + \beta_1 x_i$. The MLE gives estimates for $\beta_0$, $\beta_1$ and $\sigma^2$ and the correspodning covariance matrix for the estimates. Then draw from the multivariate normal the two betas and a sigma, calculate for some x the mean, with the calculated mean and the simulated sigma draw a y. Do this say 10000 times and look at the resulting distibution of y. This the uncertanity around the point estimate

10 more on simukation

https://gking.harvard.edu/files/gking/files/making.pdf

  • simulate predictede value fix x and draw coefficients (estimation unceratny) then calc/draw from model (fundamental uncertany)

  • draw expected value for one draw of coef draw m predicted values, take the mean of that to wash out estimation uncertainy

Helper Functions

Plot for the Blog Post

Sources

  • Hello This is a markdown page (missing reference)

References