Gary King G2001 Course Notes
My notes on his video lecture series.
- 2. Statistical Models
- Probability
- 8. Statistical Simulation
- 10 more on simukation
- Helper Functions
- Plot for the Blog Post
- Sources
- References
Checking normality of the dependent variable of a linear model with a histogram is a bad idea
I always believed that the process of ending up with a linear model is to plot the distribution of $y$ see that it follows a normal distribution or a distribution which is approx normal after a transformation i.e. log transform and then use the linar model. As it turns out this is not quite right, take for example the case where $y$ is wage and $X$ cointains only a dummy for gender. Then the distribution is probaply bimodel because of the wage gap. ONly if we condition on the gender then $y$ is normal or in the case of wage transformable to normal!!!!
$y|X=women \sim N(\mu_i, \sigma^2)$
General model description of the decomposition observed_data=systematic_component + stochastic/random_component
$y_i \sim f(\theta_i , \alpha)$ stochastic component
$\theta_i=g(x_i, \beta)$ systematic component
- $y_i$ is our dependent variable
- $f$ is the density of our assumed distribution for $y$
- $\theta_i$ is some feature of the distribution which we want to model, often this is the mean but it can be any moment of the ditsrbution say the variance or some quantile...which we model dependnt of our exogenous variables and which of course varies over teh data (question to myself how to model say mean and variance together?)
- $\alpha$ is a feature of the distribution which stays constant think of $sigma$ in the linear model
- $g$ is the link function
- $x_i$ is the exogenous data
- $\beta$ effect coeficients
Probability
The most intersting thing from this lecture fo me was that it reminded me that you don't just can use kernel desnity estimating to estimate the pdf but that you can actually sample from it, of course..he showed how to do this "by hand" for time purpose I will use the build in functions in python to do this
import numpy as np
from scipy.stats import gaussian_kde
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
data = np.random.normal(size = 200)
sns.kdeplot(data)
test = gaussian_kde(data)
x = np.linspace(-2, 2, 200)
y = test.evaluate(x)
plt.plot(x, y)
sns.distplot(test.resample(10000))
essti = KernelDensity()
essti.fit(data.reshape(-1, 1))
sns.distplot(essti.sample(10000))
8. Statistical Simulation
Starting from the general model specification
$y_i \sim f(\theta_i , \alpha)$ stochastic component
$\theta_i=g(x_i, \beta)$ systematic component
Say we found the MLE estimator for $\beta$ and $\alpha$ called $\hat{\gamma}$. For this we can estimate the variance $\hat{V}(\hat{\gamma})$ and we want to make inference for soe datapoint $X_c$. We can approcxciamte the asymptoic distribution by $N[\hat{\gamma} ,\hat{V}(\hat{\gamma})] $ With this we draw and calculate $\theta_{sim}=g(x_c, \beta_{sim})$ with this, teh draw from $\alpha$ we can use the stochastic compnent to simulate $y$
Example simple linear model
$y_i \sim N(\mu_i, \sigma^2)$ where the systematic component $\mu_i=\beta_0 + \beta_1 x_i$. The MLE gives estimates for $\beta_0$, $\beta_1$ and $\sigma^2$ and the correspodning covariance matrix for the estimates. Then draw from the multivariate normal the two betas and a sigma, calculate for some x the mean, with the calculated mean and the simulated sigma draw a y. Do this say 10000 times and look at the resulting distibution of y. This the uncertanity around the point estimate
10 more on simukation
https://gking.harvard.edu/files/gking/files/making.pdf
-
simulate predictede value fix x and draw coefficients (estimation unceratny) then calc/draw from model (fundamental uncertany)
-
draw expected value for one draw of coef draw m predicted values, take the mean of that to wash out estimation uncertainy