Gauss-Markov Theorem
I discuss a powerful theorem which has important implications on linear models theory and their applications on regression problems.
At an informal level, the Gauss-Markov theorem tells us that ordinary least squares (OLS) estimator is the best we can have among all linear estimator under the standard OLS assumptions. We will dwell on the meaning of the “best” estimator later on but I find it important to mention first the assumptions of OLS regression as the proof of the Gauss-Markov theorem relies on these.
The assumptions of the OLS regression are as follows:
1) Linearity. Well, this is the most obvious assumption of the OLS, implying that the data generating process indicates a linear relationship between the independent and the dependent variables. Described in terms of the $n \times p$ design matrix ${\bf X}$ consist of $p$ regressors and dependent target variable $\vec{y}: n \times 1$, linearity mathematically translates into
\[\begin{equation}\label{glm} \vec{y} = {\bf X}\, \vec{\beta} + \vec{\epsilon}, \end{equation}\]where $\beta$ are the true regression parameters and $\vec{\epsilon}$ are irreducible errors. Note that although the relationship between the regressor matrix and targets is linear, the former can potentially include non-linear information about the features. The rest of the assumptions actually build upon the linearity by imposing certain conditions on the design matrix ${\bf X}$ and errors.
2) No multi-collinearity. Recall that, the OLS estimate for the regression coefficients
\[\begin{equation}\label{olse} \hat{\vec{\beta}} = ({\bf X}^T {\bf X})^{-1}\, {\bf X}^T \vec{y} \end{equation}\]requires the matrix ${\bf M} = {\bf X}^T {\bf X}$ to be invertible which in turn implies that the determinant of the design matrix is non-zero: $\textrm{det}({\bf M}) = \textrm{det}({\bf X})^2 > 0 \to \textrm{det}({\bf X}) \neq 0$. A matrix has a non-zero determinant iff it’s columns are linearly independent. In other words, in order to have stable estimates on the regression coefficients, we need to assume the absence of multi-collinearity between the data features/columns of the design matrix. The rest of the OLS assumptions impose condition on the error term in \eqref{glm}.
3) Strict exogeneity. This assumption states that the error terms have zero mean or unbiased and this property is independent on the information about the regressors:
\[\begin{equation}\label{sexo} \mathbb{E}[\epsilon_i\, |\, {\bf X}] = 0, \quad \textrm{for} \quad i = 1,2,\dots,n. \end{equation}\]An exogenous variable is a variable that is not determined by other variables or parameters in the model. Therefore, together with the independence of regressors, the relation above provides strict exogeneity.
4) Spherical errors. This assumption can be unpacked into two more digestible assumptions. The first one is about the homoscedasticity of errors which implies that the individual errors exhibit constant variance
\[\begin{equation} \mathbb{V}[\epsilon_i\, |\, {\bf X}] = \sigma^2, \quad \textrm{for} \quad i = 1,2,\dots,n. \end{equation}\]The second is about the independence of the error terms:
\[\begin{equation} \mathbb{E}[\epsilon_i \epsilon_j\, |\, {\bf X}] = 0, \quad \textrm{for} \quad i,j \in \{1,2,\dots,n\} \quad i\neq j, \end{equation}\]which also implies that errors are not correlated. These two assumptions combined imply spherical errors in the sense that they exhibit constant quadratic variation without cross terms:
\[\begin{align} \mathbb{V}[\vec{\epsilon}\, |\, {\bf X}] = \sigma^2\, {\bf I}_{n\times n}. \end{align}\]5) Normality of errors. Finally, the last two statistical properties on the distribution of errors can be uniquely harmonized by the normal distribution. Note that this is not strictly necessary, in the sense that i) normal distribution is not the only distribution that exhibits the properties above ii) OLS theory do not require a distributional assumption on the errors. However, it provides a framework to test the accuracy of the OLS estimates.
\[\begin{equation} \vec{\epsilon}\, |\, {\bf X} \sim N(\vec{0}, \sigma^2 {\bf I}_{n\times n}). \end{equation}\]Under these assumptions, Gauss-Markov theorem formally states that
Theorem. For a linear regression problem with response $\vec{y}$ and design matrix ${\bf X}$, the OLS estimator \eqref{olse} is the minimum variance estimator among all linear unbiased estimators of the model parameter $\vec{\beta}$.
Here, we understand the meaning of “best” as the estimator that exhibit the smallest variance. In other words, the theorem states that the OLS estimator is BLUE (Best Linear Unbiased Estimator)! Of course, a nonlinear method using a different metric on our data may outperform OLS regression but if OLS assumptions hold for our data and we want to use an unbiased linear model to characterize the data, then we should just use OLS.
Proof of Gauss-Markov Theorem
Now consider an alternative linear and unbiased estimator
\[\hat{\vec{\beta}}_a = {\bf A}\, \vec{y},\]where ${\bf A}$ is a $p \times n$. Similar in the spirit of the OLS estimator, it is linear because it provides a linear map from the response to the predictions through
\[\hat{y} = {\bf X}\, \hat{\vec{\beta}}_a = {\bf X}{\bf A}\, \vec{y}.\]On the other hand, we require it to be unbiased as the OLS estimator which translates into the fact that the expected value of the estimator conditioned on the data is just the “true” model parameter $\vec{\beta}$:
\[\begin{align} \mathbb{E}[\hat{\vec{\beta}}_a\, |\, {\bf X}] = \mathbb{E}[{\bf A}\, \vec{y}\, |\, {\bf X}] = \mathbb{E}[{\bf A}{\bf X}\,\beta + {\bf A}\vec{\epsilon}\, |\, {\bf X}] = \vec{\beta}, \end{align}\]where we have used \eqref{glm}. Given the strict exogeneity assumption \eqref{sexo}, to have an unbiased alternative estimator requires
\[\begin{equation}\label{ubne} {\bf A}{\bf X} = {\bf I}_{p \times p}, \quad\quad\quad \hat{\vec{\beta}}_a = {\bf A}{\bf X}\,\beta + {\bf A}\vec{\epsilon}. \end{equation}\]Now, the variance of the new estimator read as
\[\begin{align} \nonumber \mathbb{V}[\hat{\vec{\beta}}_a\, |\, {\bf X}] &= \mathbb{V}[{\bf A}{\bf X}\,\beta + {\bf A}\vec{\epsilon}\, |\, {\bf X}],\\ \nonumber &= \mathbb{V}[{\bf A}\vec{\epsilon}\, |\, {\bf X}],\\ \nonumber &= {\bf A}\mathbb{V}[\vec{\epsilon}\, |\, {\bf X}]{\bf A}^{T},\\ &= \sigma^2 {\bf A} {\bf A}^{T}, \label{vara} \end{align}\]where we used the constancy of the vector $\vec{\beta}$ and matrix ${\bf A}$ together with the 4th OLS assumption above: spherical errors. Finally, we define the $p \times n$ difference map (matrix) between the alternative and the original OLS estimator as
\[{\bf D} = {\bf A} - {\bf O},\]where ${\bf O} = ({\bf X}^T {\bf X})^{-1} {\bf X}^T$ is analogue to ${\bf A}$ in the sense of equation \eqref{olse}. We then simply re-write the variance of the new estimator as
\[\begin{align} \nonumber \mathbb{V}[\hat{\vec{\beta}}_a\, |\, {\bf X}] &= \sigma^2 ({\bf D} + {\bf O})^T ({\bf D} + {\bf O}),\\ &= \sigma^2 ({\bf D}{\bf D}^T + {\bf O}{\bf D}^T + {\bf D}{\bf O}^T + {\bf O}{\bf O}^T).\label{vara2} \end{align}\]To finalize the proof, first notice the original map is orthogonal to the difference map:
\[\begin{equation} {\bf D}{\bf O}^T = ({\bf O}{\bf D}^T)^T = {\bf D} {\bf X}\, ({\bf X}^T {\bf X})^{-1} = 0 \end{equation}\]due to the fact the new estimator is unbiased by virtue of \eqref{ubne} and ${\bf O}{\bf X} = {\bf I}_{p \times }$, implying
\[{\bf D}{\bf X} = {\bf A} {\bf X} - {\bf O}{\bf X} = {\bf I}_{p \times p} - {\bf I}_{p \times p} = 0.\]This lead us to ignore the middle two terms in \eqref{vara2}. Note that for the orthogonality between the difference and the original map, we intrinsically assume the existence of the inverse of ${\bf X}^T{\bf X}$ which points to the 2nd assumption of the OLS: no multi-collinearity. In \eqref{vara2}, we therefore have
\[\begin{align}\label{vara3} \nonumber \mathbb{V}[\hat{\vec{\beta}}_a\, |\, {\bf X}] = \sigma^2 {\bf D}{\bf D}^T + \sigma^2 {\bf O}{\bf O}^T. \end{align}\]Here the last term is just the variance of the original OLS estimator in disguise
\[\mathbb{V}[\hat{\vec{\beta}}\, |\, {\bf X}] = \mathbb{V}[{\bf O}\vec{y}\, |\, {\bf X}] = {\bf O} \mathbb{V}[\vec{y}\, |\, {\bf X}] {\bf O}^T = \sigma^2 {\bf O} {\bf O}^T.\]Finally, since ${\bf D}{\bf D}^T$ is a positive semi-definite matrix, we obtain
\[\begin{align} \nonumber \mathbb{V}[\hat{\vec{\beta}}_a\, |\, {\bf X}] &= \sigma^2 {\bf D}{\bf D}^T + \sigma^2 {\bf O}{\bf O}^T,\\ \nonumber &= \mathbb{V}[\hat{\vec{\beta}}\, |\, {\bf X}] + \sigma^2 {\bf D}{\bf D}^T,\\ &\geq \mathbb{V}[\hat{\vec{\beta}}\, |\, {\bf X}]. \end{align}\]This is it, we have proved that under the OLS assumptions of linearity, no multi-collinearity, strict exogeneity and spherical errors, OLS estimator is BLUE!
References
1. Topics in Mathematics with Applications in Finance, Lecture on Regression Analysis, by Dr. Peter Kempthorne, MIT, Sloan School of Management, 2013.
2. Greene, W. H., Econometric analysis. 8th Edition, Pearson Education.