Singular Value Decomposition
My reference notebook for the SVD.
- Motivation
- The Model
- computing svds
- Missing Value imputation
- Helper Functions
- Plot for the Blog Post
- Sources
- References
So yesterday I asked you all what you wanted to hear about from me this week, and one answer stood out from all the others: the SINGULAR VALUE DECOMPOSITION (SVD).https://t.co/9Nyqp2Kr9k
— Women in Statistics and Data Science (@WomenInStat) July 21, 2020
1/
The Model
$X=UDV^T$
$X \in \mathbb{R}^{n \times p}$, w.log $n > p$ in the statsitics context we can think of X as the data matrix, here in most cases we indeed have more observations then variables. $U$ is an $n \times p$ matrix, and $D$ and $V$ have dimension $p \times p.
decomposiition is unuqie
$U$ and $V$ are orthorgonal so it holds that $U^T U = I$, $V V^T = I$ and $V^T V = I$ (she writes $U U^T \neq I$ why? need to check this) the columns of $U$ and $V$ are called the left and right singular vectors
$D$ is diagonal with nonnegative and decreasing elements: $D_{11} \geq D_{22} \geq \ldots \geq D_{pp} \geq 0$ these are called singular values
In words the idea behind SVD is as follows. Given some matrix Recall that this is a mapping which takes a p dimensional vector as input and and transforms it into a n dimensional vector. Graphically it stretches the vector and rotates it. The SVD decomposes this operations in its seperate parts. To do this we take a orthonormal basis in the p dimensional space, apply the X mapping to get a new basis in n.
number non zero singular values is the rank
high colinearity if d_11 / d_pp is "large"
Missing Value imputation
See gives a recipe to fill in missing values, assuming they are missing at random. The procedure is as follows:
- Fill in missing values with mean of column
- calc svd
- get rank k approximation
- replace the entries at the missing values with the rank k approxmation
- repeat until convergence
def rank_k_approx(A, k):
"""A is matrix to be approximated."""
k = k -1 #for python indexing, if i want rank 1 approx need to index with 0
U, D, V_T = svd(A,
full_matrices=False,
compute_uv=True,
hermitian=False)
return U[:, :k] @ np.diag(D)[:k, :k] @ V_T[:k, :]
def svd_impute(df, k):
df_work = df.copy()
#impute with mean
df_work.fillna(df_work.mean(), axis = 0, inplace = True)
err = []
while True:
approx = rank_k_approx(df_work, k)
approx = pd.DataFrame(approx)
df_work_proposal = df.combine_first(approx)
current_error = np.linalg.norm(df_work_proposal - df_work, ord = 'fro')
err.append(current_error)
if current_error < .005:
return df_work_proposal, err
else:
df_work = df_work_proposal