Ordinary Least Squares - Linear Regression

I discuss ordinary least squares (OLS) aka linear regression, a common parametric model that optimizes regression coefficients by minimizing the sum of residual squares.

Posted Apr 10, 2022 Updated Mar 4, 2025

By Ogan Ozsoy

21 min read

Suppose we have a dataset of $n$ observations, each consist of $p$ explanatory (independent) variables ${\bf x}_{i}$ (e.g a row vector) and a response/target (dependent) variable $y_i$ for $i = 1,2,\dots n$. We are interested in modeling the relationship between the independent and dependent variables. The simplest relationship we can assume is that the response variable is a linear function of the predictor variables:

\[\begin{equation}\label{lr1} y_{i} = \beta_1\, x_{i,1} + \beta_2\, x_{i,2} + \dots + \beta_p\, x_{i,p} + \epsilon_i, \end{equation}\]

where $\beta$’s are unknown parameters/coefficients of our hypothesis and $\epsilon_i$’s are the irreducible scalar error term of each observation that could arise due to measuring error, randomness of the underlying process or the modeling error assumed by the linear relationship \eqref{lr1}. Here, we should think of approximating the true relationship $y = f(X) + \epsilon$ by a linear function and $\epsilon$ is the catch-all term that we use to characterize what we miss by this simple model, including modeling errors, measurement errors during the data collection and/or other variables that we did not include in our dataset that can cause a variation in the response $y$.

We can summarize the information contained in the independent variables into a matrix called the design matrix ${\bf X}: n \times p$ and vectorize the corresponding response $\vec{y} = [y_1,y_2,\dots, y_n]^{T}$ and errors $\vec{\epsilon} = [\epsilon_1, \epsilon_2, \dots, \epsilon_n]^{T}$ to write the linear regression model \eqref{lr1} compactly as

\[\begin{equation}\label{lrmf} \vec{y} = {\bf X} \vec{\beta} + \vec{\epsilon}, \end{equation}\]

by defining the $p \times 1$ coefficient vector $\vec{\beta} = [\beta_1,\beta_2,\dots, \beta_p]^T$. From \eqref{lrmf}, it is clear that the vector $\vec{\beta}$ represents the average increase in the random variable $\vec{y}$ associated with a unit increase in ${\bf X}$. In linear regression, it is also customary to add a constant intercept term $\beta_0$, representing the expected value of $\vec{y}$ when ${\bf X} = 0$. In the formulation above, this can be achieved e.g. by adding a vector with unit components as the first column of ${\bf X}$. I will discuss more on this issue later on.

Errors vs residuals. Given the hypothesis in \eqref{lrmf}, our goal is to estimate the optimal coefficients $\hat{\vec{\beta}}$ such that the resulting response predictions $\hat{\vec{y}}$ are as close as the observed values $\vec{y}$. These predictions will not be perfect, and therefore we should quantify this error. The terms used to quantify this error is called the residuals which we will denote by a vector $\vec{\mathrm{e}} = [\mathrm{e}_1, \mathrm{e}_2, \dots, \mathrm{e}_n]^T$:

\[\begin{align} \vec{\mathrm{e}} &\equiv \vec{y} - \hat{\vec{y}},\\ &= \vec{y} - {\bf X}\hat{\vec{\beta}}, \end{align}\]

where $\hat{\vec{y}} = {\bf X} \hat{\vec{\beta}}$ is estimated response arise from the estimated linear regression coefficients $\hat{\vec{\beta}}$. An important distinction here is that residuals are not the same as the error terms $\vec{\epsilon}$ in \eqref{lrmf}. The latter quantifies the difference between the observed response $\vec{y}$ and the “true” data ${\bf X}\vec{\beta}$ that we assume to hold. $\vec{\epsilon} = \vec{y} - {\bf X}\vec{\beta}$. In other words, they are errors related to the true data generating process ${\bf X}\vec{\beta}$ while the residuals are the ones associated with the estimated model ${\bf X}\hat{\vec{\beta}}$. In principle, we can make perfect predictions $\vec{y} = \hat{\vec{y}}$ with zero residuals, but this does not mean that the true errors are zero, and in fact they are still unknown.

Least Squares Criterion and Normal Equations

In our model, making a prediction for the response variable corresponds implies making predictions for the coefficients $\beta$. What criterion should we use to determine the optimal coefficients? In ordinary least squares (OLS) regression, we simply use the sum of squared residuals as the loss function as the function to be minimized. Then the best fit coefficients $\vec{\beta} = [\beta_1,\beta_2, \dots, \beta_p]^T$ are those that minimizes the loss function:

\[\begin{align} \nonumber \hat{\vec{\beta}} &= \textrm{argmin}_{\vec{\beta}}\,\,L(\vec{\beta}),\\ &= \textrm{argmin}_{\vec{\beta}}\,\, (\vec{y} - {\bf X} \vec{\beta})^{T}(\vec{y} - {\bf X} \vec{\beta}). \label{lsc} \end{align}\]

The optimization problem is then solved by setting the first derivative of the loss function w.r.t the coefficients $\vec{\beta}$ zero

\[\begin{equation}\label{lfd} \frac{\partial L(\vec{\beta})}{\partial \vec{\beta}} = - 2 {\bf X}^{T} \left(\vec{y} - {\bf X} \vec{\beta}\right) \quad \Longrightarrow \quad {\bf X}^{T} \left(\vec{y} - {\bf X} \vec{\beta}\right) = 0. \end{equation}\]

The expression on the right is called the normal equations which are a set of ($p$ number of) equations in disguise. Notice from \eqref{lfd} that the second derivative of the loss function is postive definite, ensuring that the following solution is a minimum

\[\begin{equation}\label{sol} \hat{\vec{\beta}} = ({\bf X}^{T}\, {\bf X})^{-1}\, {\bf X}^{T}\, \vec{y}. \end{equation}\]

For a unique solution in the form of \eqref{sol} exist, the $p \times p$ matrix ${\bf X}^T {\bf X}$ must be invertible. Since it is positive definite we know indeed that it has a positive determinant $\textrm{det}({\bf X}^T {\bf X}) = \textrm{det}({\bf X})^2 > 0$ which implies $\textrm{det}({\bf X}) \neq 0$. A matrix has a non-zero determinant if it is full column rank or in other words, if its columns are linearly independent! Therefore, for a unique solution to exist there should be no multi-collinearity in the design matrix ${\bf X}$.

Hat matrix. Let’s have a look at the predictions, which are sometimes called the fitted values. Given the optimal linear regression coefficients our model gives predictions based on:

\[\begin{equation} \hat{\vec{y}} = {\bf X}\, \hat{\vec{\beta}} = {\bf X}\, ({\bf X}^{T} {\bf X})^{-1}\, {\bf X}^{T} \vec{y}, \end{equation}\]

where we simply plugged the solution \eqref{sol} to reach at the second equality. Notice that in the equation above the operator that acts on the target variable $\vec{y}$, outputs the corresponding predictions $\hat{\vec{y}}$. As the preceding statement suggests, we can therefore define the $n \times n$ “hat matrix” as follows:

$\begin{equation} {\bf H} = {\bf X}\, ({\bf X}^{T}\, {\bf X})^{-1}\, {\bf X}^{T}. \end{equation}$ Hat matrix has interesting properties that allow us to categorize it as an orthogonal projection operator. Its orthogonality comes from the fact that ${\bf H}^T = {\bf H}$ while its projective property operator is implied by ${\bf H}^2 = {\bf H}$: e.g. when a test vector is projected onto a subspace, second projection has no effect on the latter. We prove these relations in Appendix A. Using the hat matrix ${\bf H}$, we can also define a residual maker matrix ${\bf M}$:

\[\vec{\mathrm{e}} = \vec{y} - \hat{\vec{y}} = ({\bf I} - {\bf H})\, \vec{y} \equiv {\bf M}\, \vec{y}.\]

The residual maker matrix is also an orthogonal projection matrix (see Appendix A). Finally, it is not hard to see that ${\bf M}$ and ${\bf H}$ are orthogonal: ${\bf H}{\bf M} = {\bf H}({\bf I} - {\bf H}) = {\bf H} - {\bf H} = 0$.

Now considering the normal equations, we gain more insight on OLS regressor:

\[\begin{equation} {\bf X}^{T} \left(\vec{y} - {\bf X} \hat{\vec{\beta}}\right) = {\bf X}^{T} \vec{\mathrm{e}} = 0. \end{equation}\]

This implies that residual vector is orthogonal to the space spanned by the columns of the design matrix ${\bf X}$. In other words, residuals are uncorrelated with the explanatory variables (features)! The projection implied by the hat matrix, i.e. $\hat{\vec{y}} = {\bf H} \vec{y}$, has also a nice geometric interpretation. To understand this we can think of ${\bf H}$ that consist of 3 parts. First, ${\bf X}^T y$ gives us a $p \times 1$ “overlap” vector that measures the correlation between $y$ and the columns of the design matrix. Then the $p\times p$ operator $({\bf X}^T {\bf X})^{-1}$ scales the resulting vector taking into account the correlation (${\bf X}^T {\bf X}$) between the columns of ${\bf X}$. The scaled $p \times 1$ vector then is mapped back to the space ($\in \mathbb{R}^n$) spanned by the columns of ${\bf X}$, via the final operator ${\bf X}$. In other words, ${\bf H}$ outputs a weighted combination of the column of ${\bf X}$ to form $\hat{\vec{y}}$. And it does so in a way to preserve information as it is a $\mathbb{R}^n \to \mathbb{R}^n$ map. These arguments make sense as the linear model $\vec{y} = {\bf X} \beta$ is constrained to leave in the space of linear combinations of the columns of the design matrix. Furthermore, since this projection is orthogonal, it does throw away any information that is perpendicular to the space spanned by the columns of ${\bf X}$. In this sense, this projection is closest to the response $\vec{y}$ in Euclidean distance. By these arguments we can also conclude that residual vector lives in the space complementary to the column subspace of the design matrix. This explains intuitively how the OLS regressor do not leave any useful information behind.

Adding the intercept

So far we focused on linear regression without an intercept. However, adding it can substantially effect the accuracy of the fit. As we mentioned before we can incorporate it by adding a column vector to the predictors that has unit entries:

\[\begin{equation} \vec{y} = \begin{bmatrix} \bf 1\,\, {\bf X}_2 \end{bmatrix}\begin{bmatrix} \beta_1 \\ \vec{\beta}_2 \end{bmatrix} + \epsilon, \end{equation}\]

where we stacked the designed matrix horizontally with a $n \times 1$ column vector ${\bf 1} = [1,1,\dots,1]^T$ and ${\bf X}_2$ while stacking the coefficients vertically with a scalar $\beta_1$ and a $p$ vector $\vec{\beta}_2$. The normal equations \eqref{lfd} then can be partitioned as

\[\begin{align} \nonumber {\bf 1}^T\,{\bf 1} \beta_1 + {\bf 1}^T\,{\bf X}_2 \vec{\beta}_2 &= {\bf 1}^T\, \vec{y},\\ {\bf X}_2^T\,{\bf 1} \beta_1 + {\bf X}_2^T\,{\bf X}_2 \vec{\beta}_2 &= {\bf X}_2^T\, \vec{y} \label{neqi} \end{align}\]

We can then solve for $\beta_1$ and $\vec{\beta}_2$. We start with $\beta_1$, using the first equation above it reads as

\[\begin{equation}\label{b1} \hat{\beta}_1 = \left({\bf 1}^T\,{\bf 1}\right)^{-1}\, {\bf 1}^T \left(\vec{y} - {\bf X}_2 \hat{\vec{\beta}}_2\right) = \frac{ {\bf 1}^T}{n} \left(\vec{y} - {\bf X}_2 \hat{\vec{\beta}}_2\right), \end{equation}\]

noting $\left({\bf 1}^T\,{\bf 1}\right)^{-1} = n^{-1}$. Plugging this equation into the second equation in \eqref{neqi} then gives

\[\begin{equation}\label{ib2} \hat{\vec{\beta}}_2 = \left({\bf X}_2^T\,{\bf M}_1\, {\bf X}_2\right)^{-1}\, {\bf X}_2^T\, {\bf M}_1 \vec{y}, \end{equation}\]

where we defined hat matrix and residual maker matrix for ${\bf 1}$ as

\[\begin{align} \nonumber {\bf H}_1 = {\bf 1}\, \left({\bf 1}^T\,{\bf 1}\right)^{-1}\, {\bf 1}^T,\\ {\bf M}_1 = {\bf I} - {\bf H}_1. \end{align}\]

Notice that the form of the solution for $\hat{\vec{\beta}}_2$ \eqref{ib2} resembles the solution without an intercept, if we were to replace $\vec{y} \to {\bf M}_1 \vec{y}$ and ${\bf X}_2 \to {\bf M}_1\, {\bf X}_2$ by recalling the projective property ${\bf M}_1^2 = {\bf M}_1$. In some sense, \eqref{ib2} imply that we can obtain the optimal regression coefficients by regressing ${\bf M}_1 \vec{y}$ onto ${\bf M}_1\, {\bf X}_2$.

Next, we note that the hat matrix associated with ${\bf 1}$ is a $n\times n$ matrix with all entries equal to $1/n$:

\[\begin{equation} {\bf H}_1 = {\bf 1}\, \left({\bf 1}^T\,{\bf 1}\right)^{-1}\, {\bf 1}^T = \frac{1}{n} {\bf 1}\, {\bf 1}^T = \begin{bmatrix} 1/n & 1/n & \dots & 1/n \\ 1/n & 1/n & \dots & 1/n \\ \vdots & & \ddots & \vdots\\ 1/n & \dots & \dots & 1/n \end{bmatrix} \end{equation}\]

Therefore, when it acts on a column vector it outputs the average of the components of the vector as the entries of the resulting column vector. This implies that ${\bf M}_1\, \vec{y} = \vec{y} - \bar{\vec{y}}$. Similarly, its action on matrix result with a matrix where each column is a vector with averages. This is useful for simplifying the first term in \eqref{ib2} as

\[\begin{align} {\bf X}_2^T\,{\bf M}_1\, {\bf X}_2 = {\bf X}_2^T\,({\bf I} - {\bf H}_1)\, {\bf X}_2 = {\bf X}_2^T\,({\bf X}_2 - \bar{\bf X}_2), \end{align}\]

where $\bar{\bf X}_2$ is a matrix where each column is an $n$-vector with the mean of that respective column in ${\bf X}_2$ repeated $n$ times. With these considerations in mind, we can re-write \eqref{ib2} as

\[\begin{equation} \hat{\vec{\beta}}_2 = \left({\bf X}_2^T\,({\bf X}_2 - \bar{\bf X}_2)\right)^{-1}\, {\bf X}_2^T\, (\vec{y} - \bar{\vec{y}}) \end{equation}\]

Therefore, for an OLS regression with intercept, the optimal coefficients $\hat{\beta}_2$ associated with our predictors in the design matrix are just the result of the normal equations after mean-centering our targets and predictors. This intuitively means that the “slope” (which is now a hyperplane) defined by these coefficients will pass through the origin. It is the scalar $\hat{\beta}_1$ that provides an up-lift/down-lift to this hyperplane. From \eqref{b1}, we obtain this bias as

\[\begin{equation} \hat{\beta}_1 = \frac{ {\bf 1}^T}{n} \left(\vec{y} - {\bf X}_2 \hat{\vec{\beta}}_2\right) = \bar{y} - \bar{\bf x}_2 \hat{\vec{\beta}}_2, \end{equation}\]

where $\bar{\bf x}_2$ is a $1 \times p$ row vector that contains means for each predictor and $\bar{y}$ is scalar of the average of all responses.

A probabilistic look at the linear regression

An interesting question regarding OLS is that why do we pick to minimize the sum of residual squares to reach at the optimal solution? When we seek an answer to this question, we actually learn more about the statistical properties of the generative model. Recall that in linear regression, we assume that the generative process of the data follows \eqref{lrmf}:

\[\begin{equation}\label{gmlr} \vec{y} = {\bf X}\, \vec{\beta} + \vec{\epsilon}, \end{equation}\]

where $\vec{\epsilon}$ is the irreducible noise that we can not model (i.e. the stuff we can not explain). What are some nice properties we want errors to have? First, we want that the linear term in \eqref{gmlr} to capture the all necessary information about the response so that the errors do not carry any useful information about the process \eqref{gmlr}. In statistical terms, we want the errors of the individual observations to be unbiased:

\[\mathbb{E}[\epsilon_{i}] = 0.\]

On the other hand, since we assume each observation to be independent, we would expect the generative process that give rise to it to be independent. This implies the independence of errors:

\[\mathbb{E}[\epsilon_i\, \epsilon_j] = \mathbb{E}[\epsilon_i] \mathbb{E}[\epsilon_j] = 0, \quad\quad \textrm{for} \quad i \neq j.\]

The independence also automatically implies the fact that errors of observations are uncorrelated. Finally, we would like to specify the “noise” in the errors. We know that they should on average tend to zero, but this does not tell us about how much they are spread from one observation to another. The simplest assumption we can take here is to assume constant variance of errors:

\[\mathbb{V}[\epsilon_i] = \mathbb{E}[\epsilon_i^2] = \sigma^2\]

which is commonly referred as homoscedasticity of errors. Our favorite symmetric distribution (although not the only option here) with these properties is normal distribution: $\epsilon_i \sim N(0,\sigma^2)$. The assumption of normality of errors has an interesting consequence that we will explore below. First, the probability of the individual response variable becomes a conditional probability based on the design matrix ${\bf X}$ and the model parameters $\vec{\beta}, \sigma^2$:

\[\begin{equation} \mathbb{P}\left(y_i\, \big |\, \vec{x}_i ;\, \vec{\beta}, \sigma^2\right) = N\left(\vec{x}_{i}^T \vec{\beta}, \sigma^2\right), \end{equation}\]

where we have written the regression problem \eqref{gmlr} in the vector form as in \eqref{lr1}

\[y_i = \vec{x}_{i}^T \vec{\beta} + \epsilon_i\]

using a feature vector for each observation as

\[\vec{x}_{i} = [1,\, x_{i1},\, x_{i2},\, \dots,\, x_{ip}].\]

Since we assume i.i.d data, the joint conditional distribution of the response vector as

\[\begin{align} \nonumber \mathbb{P}\left(\vec{y}\, \big |\, {\bf X} ;\, \vec{\beta}, \sigma^2\right) &= \prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi \sigma^2}}\, \mathrm{e}^{-(y_i - \vec{x}_i^T \vec{\beta})^2 / (2\sigma^2)},\\ &= \frac{1}{(2\pi \sigma^2)^{n/2}}\, \exp \left[- \frac{1}{2} \left(\vec{y} - {\bf X} \vec{\beta}\right)^T (\sigma^2 \mathbb{1}_{n\times n})^{-1} \left(\vec{y} - {\bf X} \vec{\beta}\right)\right]\label{lf} \end{align}\]

The \eqref{lf} is called the joint likelihood function as it provides the likelihood of the response given the design matrix and the model parameters $\vec{\beta}, \sigma^2$. Ideally, we would like to pick the parameters that maximize the likelihood. The estimates derived in this way is called the Maximum Likelihood Estimates for the parameters. Noticing that it is easier to work with the log of the expression \eqref{lf}, we seek for the parameters that maximize the log-likelihood:

\[\begin{align}\label{llf} \ln \mathbb{P}(\vec{y}\, \big | \, {\bf X};\, \vec{\beta}, \sigma^2) \equiv l(\vec{\beta}, \sigma^2) = -\frac{n}{2} \ln(2\pi \sigma^2) - \frac{1}{2\sigma^2} \left(\vec{y} - {\bf X} \vec{\beta}\right)^T \left(\vec{y} - {\bf X} \vec{\beta}\right). \end{align}\]

The regression coefficients that leads to maximum likelihood is given by the partial derivative of \eqref{llf} which is mathematically equivalent to minimizing sum of residual squares in \eqref{lsc}:

\[\begin{align} \hat{\vec{\beta}}_{\rm MLE} = \textrm{argmax}_{\beta}\left(-\left(\vec{y} - {\bf X} \vec{\beta}\right)^T \left(\vec{y} - {\bf X} \vec{\beta}\right)\right). \end{align}\]

As a result of this, the MLE estimate on the parameters will completely agree with those implied by normal equations, see e.g \eqref{sol}. This equivalency provides us with a generative approach to OLS regression, where the best fit coefficients of the regression problem is found by the MLE estimates assuming the data (e.g the response) follows a normal distribution \eqref{lf}. In this sense, we are seeking for the parameters $\vec{\beta}$ that maximizes our hypothesis that the response is generated by the process \eqref{gmlr} with $\vec{\epsilon} \sim N(0,\sigma^2\,{\bf I})$. Once we get MLE estimates for the regression coefficients, we can obtain the MLE estimate for the variance of errors as

\[\begin{align} \hat{\sigma}_{\rm MLE}^2 = \textrm{argmax}_{\sigma^2} \,l(\hat{\vec{\beta}}, \sigma^2), \quad \rightarrow \quad \frac{\partial l}{\partial \sigma^2} = 0 \quad \rightarrow \quad \hat{\sigma}_{\rm MLE}^2 = \frac{1}{n} \left(\vec{y} - {\bf X} \hat{\vec{\beta}}\right)^T \left(\vec{y} - {\bf X} \hat{\vec{\beta}}\right). \end{align}\]

Now, let’s assume $\vec{\beta}_{\star}$ and ${\sigma}_{\star}^2$ to be “true” generative parameters of the process \eqref{gmlr}. Then the conditional expectation of the response given the information on the design matrix read as

\[\begin{align} \nonumber \mathbb{E}[\vec{y}\,\big |\, {\bf X}] &= \mathbb{E}[{\bf X}\vec{\beta}_{\star}\, \big |\, {\bf X}] + \mathbb{E}[\vec{\epsilon}\, \big |\, {\bf X}],\\ \nonumber &= {\bf X}\vec{\beta}_{\star} + \mathbb{E}[\vec{\epsilon}],\\ &= {\bf X}\vec{\beta}_{\star}, \label{ce} \end{align}\]

where we used the linearity of the expectation, as well as the independence of errors with respect to the design matrix. On the other hand, noting the latter property, the conditional variance of the response can be obtained as

\[\begin{align} \nonumber \mathbb{V}[\vec{y}\,\big |\, {\bf X}] &= \mathbb{V}[{\bf X}\vec{\beta}_{\star}\, \big |\, {\bf X}] + \mathbb{V}[\vec{\epsilon}\, \big |\, {\bf X}],\\ \nonumber &= \mathbb{V}[\,\vec{\epsilon}\,],\\ &= \sigma_{\star}^2\, {\bf I}. \label{cv} \end{align}\]

In the context of OLS, to extract some meaning out of these expressions requires to understand that the conditional expectation in \eqref{ce} is the minimizer of mean squared loss (see Appendix B). Therefore, given our model \eqref{gmlr}, ${\bf X}\vec{\beta}_{\star}$ is the best estimate we can have. On the other hand, the conditional variance \eqref{cv} gives us the smallest expected squared error.

References

1. Topics in Mathematics with Applications in Finance, Lecture on Regression Analysis, by Dr. Peter Kempthorne, MIT, Sloan School of Management, 2013.

1. CS229 Machine Learning , Lecture, Dr. Christopher Ré, Stanford University, 2022.

Appendix A: Hat matrix and residual maker as orthogonal projection operators

We first prove orthogonality. For the hat matrix, this is simple

\[\begin{align} \nonumber {\bf H}^T &= ({\bf X}\, ({\bf X}^{T}\, {\bf X})^{-1} {\bf X}^{T})^{T},\\ \nonumber &= {\bf X}\, ({\bf X}\, ({\bf X}^{T}\, {\bf X})^{-1})^T,\\ \nonumber &= {\bf X}\, (({\bf X}^{T}\, {\bf X})^{-1} {\bf X}^{T}),\\ &= {\bf H}, \end{align}\]

where we utilized $({\bf A}\, {\bf B})^{T} = {\bf B}^T {\bf A}^T$ for matrices and the fact that inversion commutes with transpose operation $({\bf A} {\bf A}^{-1})^T = {\bf I}^T = ({\bf A}^{T})^{-1} {\bf A}^T$. The projective property directly follows associative property of matrix multiplication: $\begin{align} \nonumber {\bf H}^2 &= {\bf X}\, ({\bf X}^{T}\, {\bf X})^{-1} {\bf X}^{T} \, {\bf X}\, ({\bf X}^{T}\, {\bf X})^{-1} {\bf X}^{T},\\ \nonumber &= {\bf X}\, ({\bf X}^{T}\, {\bf X})^{-1}\, [({\bf X}^{T} \, {\bf X})\,({\bf X}^{T}\, {\bf X})^{-1}]\, {\bf X}^{T},\\ &= {\bf X}\, ({\bf X}^{T}\, {\bf X})^{-1} {\bf X}^{T}. \end{align}$

Given the orthogonal projection matrix ${\bf H}$, it is easier to prove that the residual maker is also an orthogonal projection matrix:

\[\begin{align} \nonumber {\bf M}^{2} &= ({\bf I} - {\bf H}) ({\bf I} - {\bf H}),\\ \nonumber &= {\bf I} - {\bf H} - {\bf H} + {\bf H}^2,\\ \nonumber &= {\bf I} - {\bf H},\\ &= {\bf M}. \end{align}\]

Orthogonality is also directly follows from the linearity property of the transpose together with the orthogonality of the $n \times n$ identity matrix ${\bf I}$ and ${\bf H}$:

\[{\bf M}^{T} = ({\bf I} -{\bf H})^{T} = ({\bf I}^T - {\bf H}^T) = ({\bf I} - {\bf H}) = {\bf M}.\]

Appendix B: Conditional expectation as the best predictor

We usually think of expectation values as averages of a random variable $Y$ over repeated experiments. However, there is also a nice interpretation for expectation of $y$ as a minimizer of the mean squared error (MSE) loss function. This help us understand why MSE is a commonly used loss function in optimization problems.

We can start considering a simple example. We claim that given the random variable $y$, its expected value is the minimizer of the expected squared loss:

\[\begin{equation}\label{p1} y^{(*)}_{p} \equiv \mathbb{E} [y] = \textrm{argmin}_{y_p} \, \mathbb{E}\left[(y - y_p)^2\right]. \end{equation}\]

To prove this we focus on the right-hand side using the oldest trick in town:

\[\begin{align} \nonumber \mathbb{E}\left[(y-y_p)^2\right] &= \mathbb{E}\left[(y-\mathbb{E}[y] + \mathbb{E}[y] - y_p)^2\right], \\ &= \mathbb{E}\left[(y-\mathbb{E}[y])^2\right] + \mathbb{E}\left[(\mathbb{E}[y]-y_p)^2\right] + 2 \mathbb{E}\left[(y-\mathbb{E}[y])(\mathbb{E}[y] - y_p)\right].\label{rhs} \end{align}\]

Noticing that the factor $\mathbb{E}[y] - y_p$ is a non-random constant, the last expectation value in the expression above is vanishing:

\[\mathbb{E}\left[(y-\mathbb{E}[y])(\mathbb{E}[y] - y_p)\right] = \mathbb{E}\left[(y-\mathbb{E}[y])\right] \, (\mathbb{E}[y] - y_p) = (\mathbb{E}[y]-\mathbb{E}[y]) \, (\mathbb{E}[y] - y_p) = 0.\]

Notice also that the first term in \eqref{rhs} has nothing to do with the optimization problem we are interested in. Therefore, the right-hand side of \eqref{p1} is essentially equivalent to the following

\[\begin{equation}\label{p1f} \textrm{argmin}_{y_p} \, \mathbb{E}\left[(\mathbb{E}[y]-y_p)^2\right]. \end{equation}\]

Since the function inside the expectation is convex in $y_p$, we therefore get what we claimed initially \eqref{p1}:

\[\begin{equation} \textrm{argmin}_{y_p} \, \mathbb{E}\left[(\mathbb{E}[y]-y_p)^2\right] \quad \Longrightarrow \quad \frac{\partial (\mathbb{E}[y]-y_p)^2}{\partial y_p} = 0 \quad \rightarrow \quad y_p^{(*)} = \mathbb{E}[y]. \end{equation}\]

Now in the context of a regression problem, we can think of a vector valued random variable $\vec{y} \in \mathbb{R}^n$ and a model $f({\bf X})$ as the predictor. Note that the map $f$ will also output a vector in $\mathbb{R}^n$, e.g $f({\bf X}) = \vec{y}_p$. We then ask: What is the best predictor function $f$ that minimizes the mean of the residual sum squares or equivalently MSE loss function?

\[\begin{equation}\label{p2} f^{(*)} = \textrm{argmin}_{f} \, \mathbb{E}\left[(\vec{y} - f({\bf X}))^T (\vec{y} - f({\bf X}))\right]. \end{equation}\]

Defining

\[\vec{\mu}({\bf X}) \equiv \mathbb{E}[\vec{y}\, | \, {\bf X}],\]

we apply the oldest trick to both terms in \eqref{p2}:

\[\begin{align} \nonumber \mathbb{E}\left[(\vec{y} - f)^T (\vec{y} - f)\right] &= \mathbb{E}\left[(\vec{y} - \vec{\mu} + \vec{\mu} - f)^T (\vec{y} - \vec{\mu} + \vec{\mu} - f)\right],\\ \nonumber &= \mathbb{E}\left[(\vec{y} - \vec{\mu}({\bf X}))^T (\vec{y} - \vec{\mu}({\bf X}))\right] + \mathbb{E}\left[(\vec{\mu}({\bf X}) - f({\bf X}))^T (\vec{\mu} - f({\bf X}))\right]\\ &\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad + 2 \mathbb{E}\left[(\vec{y} - \vec{\mu}({\bf X})) (\vec{\mu}({\bf X}) - f({\bf X}))^T\right]. \label{loss} \end{align}\]

Notice again that the first term above has nothing to do with the optimization problem of interest. Similar to the previous arguments the last terms can be shown to vanish noting the tower law (or total law) of expectation

\[\mathbb{E}[A] = \mathbb{E}[\mathbb{E}[A\, |\, X]].\]

We thus write

\[\begin{align} \nonumber \mathbb{E}\left[(\vec{y} - \vec{\mu}({\bf X})) (\vec{\mu}({\bf X}) - f({\bf X}))^T\right] &= \mathbb{E}\left[\mathbb{E}\left[(\vec{y} - \vec{\mu}({\bf X})) (\vec{\mu}({\bf X}) - f({\bf X}))^T | {\bf X}\right]\right],\\ \nonumber &= \mathbb{E}\left[\mathbb{E}\left[(\vec{y} - \vec{\mu}({\bf X})) | {\bf X}\right]\right]\, (\vec{\mu}({\bf X}) - f({\bf X}))^T, \\ \nonumber &= 0 \cdot \, (\vec{\mu}({\bf X}) - f({\bf X}))^T,\\ &= 0. \end{align}\]

The optimization problem \eqref{p2} is then equivalent to:

\[\textrm{argmin}_{f} \, \mathbb{E}\left[(\vec{y} - f({\bf X}))^T (\vec{y} - f({\bf X}))\right] = \textrm{argmin}_{f}\, \mathbb{E}\left[(\vec{\mu}({\bf X}) - f({\bf X}))^T (\vec{\mu} - f({\bf X}))\right]\]

Setting the first derivative of the expression inside the expectation to zero, we get

\[f^{(*)}({\bf X}) = \vec{\mu}({\bf X}) \equiv \mathbb{E}\left[\vec{y}\, | \, {\bf X}\right].\]

Supervised Learning

ols linear regression

This post is licensed under CC BY 4.0 by the author.