QQ-Plot
My reference notebook for the quantile-quantile plot.
The qq-plot is a commonly used tool to check if a sample can be modeled as if it were drawn from a normal distribution. Given a sample $X_1,..., X_n$ one calculates first the empirical quantiles $q_1,...,q_n$ and then plot them against the theoretical quantiles of the standart normal distribution. A linear reltionship between these quantiles is then an indication that the sample comes from a normal distribution. The reason why this works is simply that linear transformations of normly distributed r.v.s is again, normally distributed. So if we assume that our data was drawn from some normal distribution with mean $\mu$ and standart deviation $\sigma$ then it holds the following relationship:
$X = \frac{Y-\mu}{\sigma}$ $<=>$ $\mu + \sigma X = Y$
where $X$ is standart normal. hence the quantiles also have this linear relationship.
Example: Consider a sample drawn from $N(3, 4)$. One would expect an approximate relationship between the quantiles of this normal distribution and the standart normal to follow
$Y \approx 3 + 4 * X$
#collaps
fig, axes = plt.subplots()
norm1 = norm.ppf(np.linspace(0.01, .99, 100))
norm2 = norm.rvs(size = 100, loc = 3, scale = 4)
norm2 = np.quantile(norm2, np.linspace(0.01, .99, 100))
axes.scatter(norm1, norm2)
axes.plot(np.linspace(-2, 2), np.linspace(-2, 2), label = r"$y=x$")
model = LinearRegression(fit_intercept = True).fit(norm1[:, np.newaxis], norm2)
axes.plot(np.linspace(-2, 2), model.intercept_ + model.coef_ * np.linspace(-2, 2), label = fr"$Y={model.intercept_.round(3)}+{model.coef_[0].round(2)} X$")
axes.set_xlabel("Theoretical Quantiles")
axes.set_ylabel("Empirical Quantiles")
axes.legend();
To understand what a non linear relationship implies for the sample distribution it is helpful to plot the sample distribution, the standart normal and the quantiles in a suitable manner. See below for the plot.
#hide-collaps
from matplotlib.patches import ConnectionPatch
fig, axes = plt.subplots(2,2, tight_layout = False, sharex = False, sharey = False, figsize = (10, 10))
fig.subplots_adjust(wspace = 0, hspace = 0)
empirical_data = gamma.rvs(size = 100, a = 1)
empirical_data_quantiles = np.quantile(empirical_data, np.linspace(0.01, .99, 10))
theoretical_quantiles = norm.ppf(np.linspace(0.01, .99, 10))
axes[0, 1].scatter(theoretical_quantiles, empirical_data_quantiles)
x = np.linspace(-3, 3, 100)
y = norm.pdf(x)
axes[1,1].plot(x, y)
for emp, the in zip(empirical_data_quantiles, theoretical_quantiles):
con = ConnectionPatch(xyA = (the, 0), xyB = (the, emp),
axesA = axes[1, 1], axesB = axes[0,1],
coordsA = 'data', coordsB = 'data',
color = 'red', alpha = .5)
fig.add_artist(con)
con = ConnectionPatch(xyA = (0, emp), xyB = (the, emp),
axesA = axes[0, 0], axesB = axes[0,1],
coordsA = 'data', coordsB = 'data',
color = 'red', alpha = .5)
fig.add_artist(con)
model = LinearRegression(fit_intercept = True).fit(theoretical_quantiles[:, np.newaxis], empirical_data_quantiles)
axes[0,1].plot(np.linspace(-3, 3),
model.intercept_ + model.coef_ * np.linspace(-3, 3),
label = fr"$Y={model.intercept_.round(3)}+{model.coef_[0].round(2)} X$")
axes[0 , 1].set_xlim([-3, 3])
axes[1 , 1].set_xlim([-3, 3])
axes[0, 0].set_ylim([empirical_data_quantiles.min() - 1, empirical_data_quantiles.max() + 1])
axes[0, 1].set_ylim([empirical_data_quantiles.min() - 1, empirical_data_quantiles.max() + 1])
sns.distplot(empirical_data_quantiles, hist = True, kde = True, rug = False, vertical = True, ax = axes[0,0])
# aesthetics
axes[1, 0].axis('off')
axes[0,0].set_xticks([])
axes[1,1].set_yticks([])
axes[0,1].set_xticks([])
axes[0,1].set_yticks([])
axes[0, 0].set_ylabel("Empirical Distribution")
axes[1,1].set_xlabel("Theoretical Distribution")
fig.savefig("QQ-Plot.png");
The plot left shows the sample distribution of a gamma distribution. Note that it is right-skewed. The bottom right plot shows the standard normal with its quantiles. Note that the quantiles are stacked more closely around zero because there is a lot of probability mass. The top right plot then shows a scatter plot of the empirical with the theoretical quantiles with a linear model fitted to the data. By construction, the sample distribution isn't normal. This shows that the relationship is bent inwards. In the beginning, a high-density area of the sample distribution is confronted with the low density are of the normal hence for each step in the normal quantile direction we barely move in the sample quantile space. This explains the bend.