Linear regression is usually the first model one will learn in a machine learning course due to its simplicity and popularity. While the idea of linear regression seems straight forward and easily understandable, there are many details around linear regression that are often overlooked but are important to know for ML practitioners. This article aims to provide couple of different perspectives on linear regression, what are its assumptions and why are those assumptions necessary, and go over some topics including multicolinearity and confidence intervals.

The basics

Let’s start with the basics. Suppose we have a set of $n$ data points, each with $k$ features, and we denote it as $X\in R^{n\times p}$. And we have a target vector we would like to predict, denoted as $Y\in R^n$. A concrete example could be $Y$ is the sale price of $n$ different homes, and $X$ are features collected for each of the homes (sqft, number of bedrooms, etc).

Assuming that the relationship between $Y$ and $X$ is a function
of the form $$Y = f(X) + \epsilon,$$
where $\epsilon\in R^n$ is a random variable with zero mean that captures the random noise in data. We can approximate $f$ with a linear function of form
$$Y = \alpha + X\beta + \epsilon,$$
where $\alpha\in R$ is the intercept term and $\beta\in R^{p}$ is called the coefficients. What we have here is the linear regression model. We can choose different $\alpha$ ‘s and $\beta$ ‘s to form different linear regression models, and then generate a prediction if given a new data point $x = [x_1, \dots, x_p] \in R^p$ by $$\hat{y} = \alpha + x\beta.$$ Obviously, some choices of $\alpha$ and $\beta$ works better than others. Linear regression chooses the “best” set of coefficients such that a loss function is minimized.

The Loss Function

The loss function or the objective function is a quantitative measure of the model’s performance. Linear regression uses ordinary least squares (OLS) (or Residual Sum of Squares (RSS)) which is written as
$$\sum\limits_{i=1}^n(Y_i-\hat{Y}_i)^2.$$
The RSS is a measure of deviance of predictions to the true target values. Linear regression selects the $\alpha$ and $\beta$ that minimizes the RSS.

Determining Coefficients

First let’s consider a simpler case with just one feature. The loss function then can be written as
$$\sum\limits_{i=1}^n(Y_i-\alpha-\beta X)^2,$$
which, if we take derivate with respect $\alpha$ and $\beta$, and solve for when the expressions evaluate to 0, we will get
$$\hat\beta = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})}$$
$$\hat\alpha = \bar{y} - \hat\beta\bar{x}.$$
What’s interesting about this case is that $\hat\beta$ can be written as
$$\hat\beta = \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(X)},$$
which reminds us that the coefficents of X regress on Y vs Y regress on X are not the same (unless variance of $X$ and $Y$ are equal), which is a common misconception.

Now, for the general case with multiple features, things are cleaner in matrix notations, and that’s what we will use.
The loss function is
$$(Y-X\beta)^T(Y-X\beta),$$
here we concatenated $\alpha$ onto $\beta$ (so $\beta = [\alpha \quad\beta_{old}^T]^T$) and added a vector of ones to the left column of $X$ (so $X = [1 \quad X_{old}]$).
Then take the derivative of loss function with respect to $\beta$, we get first derivative as
$$\frac{\partial RSS}{\partial \beta} = -2X^T(Y - X\beta),$$
and second derivative as
$$\frac{\partial^2 RSS}{\partial\beta\partial\beta^T} = 2X^TX.$$
The second derivative, being a non-negative expression, shows that the loss function is convex with respect to $\beta$. So if we solve for $\beta$ by setting first derivative to zero, that is the optimal solution which minimizes the loss function, which is
$$X^T(y - X\beta) = 0,$$
and $$\hat\beta = (X^TX)^{-1}X^TY.$$
Note that here we assume X has full column rank (columns of X are independent), otherwise $(X^TX)$ may not be invertible. We will revisit this point later on. So to get the coefficients we need to compute a matrix inverse, but we could also leverage gradient descent (steepest descent that uses the negative of gradient as search direction) to bypass that calculation. Both approach has its own pros and cons so we won’t get into it here.

Another interpretation

The above derivation is the usual explanation on how the coefficients are calculated. It starts with defining the loss function, then solves the equation where the first derivative is zero.

There is another perspective on how the coefficients are determined, which has a lot to do with linear algebra. To introduce this viewpoint, let us assume a simple case where we predict something we can predict perfectly. Suppose we want to predict the net profit of a company for each quarter, using the features of revenue and cost. Assuming net profit = revenue - cost, and we have two quarters worth of historical data

where the first column is the revenue, second column is cost, and the target $Y$ is net profit. If we build a linear regression model around this data, we need to determine the $\beta$ in
$$Y = X\beta.$$
In this case, since X is invertible, we can directly get $$\beta = X^{-1}Y.$$ But why can’t we do this in the general case? Because in the general case, $X$ mostly will not be invertible, or even symmetric, and we won’t be able to find $\beta$ so that the product will sum up to be exactly $Y$

However, you can still demand a $\beta$ that gets you a $\hat{Y}$ which is as close to $Y$ as possible. In linear algebra, when a linear system of equation has no solution, you can still take the “pseudo inverse”, which is denoted $$ A^+ = (A^TA)^{-1}A^T.$$ If the matrix has full column rank, then the pseudo inverse solution is the solution to the system of $\hat{Y} = X\beta$ where $\hat{Y}$ is the closest to $Y$ in the least square loss function sense.
So back to the linear regression case, we want to find $\beta$ where $$Y = X\beta,$$ and in general $X$ is singular, so we use the pseudo inverse instead, and we get
$$\hat\beta = (X^TX)^{-1}X^TY,$$
which is the same solution as the previous approach.

What I like about this interpretation is that it helps me memorize the expression for $\beta$. Just take pseudo inverse on both side and I don’t need to memorize the form or take derivatives in my head.

Yet Another Interpretation

We have one more interpretation from a statistical stand point. What linear regression is really predicting can be seen as a conditional expectation of $Y$ given data $X$. In other words, it is $$\mathrm{E}[Y|X] = \alpha + \sum\limits_{i=1}^pX_i\beta_i,$$
where $X_i$ is the $i$ th feature. And the coefficient can be interpreted as, keeping all other features fixed, if the corresponding feature increases by 1 unit, how much does $Y$ move. The key word to focus on is “keeping other features fixed”. This interpretation will be extremely helpful later on in understanding issues around multicolinearity and so on.

Assumptions of Linear Regression

Now that we are familiarized with different perspectives of the linear regression model, let’s delve into the assumptions of linear regression. If you search about these assumptions online, you will likely see different versions that are similar to each other. Some versions say there are 4 assumptions, other says 5 or even up to 10. I dug around and realized that the main source of confusion comes from the Gauss-Markov theorem, which states that if certain assumptions are met, OLS can be shown to have the lowest sampling variance within the class of linear unbiased estimators, or best linear unbiased estimator (BLUE). However, these assumptions are not all required for one to use linear regression and achieve reasonable results. As a matter of fact, the assumptions required largely depends on what you want to use your linear regression for. I will touch on couple of these assumptions below, and go over why they are important, and what happens if they are violated.

The residuals have mean zero: The residuals usually captures the noise or measurement error in $Y$. It is easy to see that, if the residual has a positive/negative mean, it implies that there might be a measurement error that constantly over-estimates/under-estimates $Y$, so that will eventually lead to a positively-biased/negatively-biased estimate of the intercept term ($\alpha$). For linear regression to be an unbiased estimator, this assumption is required.
Homoscedasticity: The residuals need to have the same constant variance. The intuition here is that, linear regression treats each data point with equal weight. So if some data has a larger variance than others, then assigning them with equal importance may not be ideal. A good example is predicting family income. Lower income family tends to have smaller variance compared to families that are better off. If we fit a regression line using only the well off families, the line we get will have wider confidence interval compared to if we fit it over the lower income families. Violating this assumption may cause the standard errors of coefficient estimates to be biased, thus leading to issue with hypothesis testing and potentially result in failure to reject null hypothesis. However, not meeting this assumption does not cause the OLS estimates of coefficients to be biased.
No autocorrelation: This means the residuals should be uncorrelated. Similar argument from Homoscedasticity section can be used - OLS assigns equal weight to each data, and if residuals are correlated, then data that are correlated does not provide much new information and should carry lesser weight. This usually happens in time series data, where two data in consecutive time units usually are correlated, and knowing one data after knowing its previous data does not provide much added value. If autocorrelation exists, then it will lead to underestimates of standard error of coefficient estimates, as well as lower p values.

No multicolinearity: Multicolinearity means some features are highly correlated with each other. This can be a pair of features, or multiple features. Why is this a problem? From the linear algebra perspective, this means $X$ does not have full column rank, so the pseudo inverse will not yield a unique solution to the linear system. From the statistical perspective, it is impossible to achieve “holding all other features fixed”, as if the features are correlated, moving one feature means the other feature will also move. Thus the model can not attribute the contribution of each feature correctly. To detect this problem, a common indicator to use is the Variance Inflation Factor (VIF). Once multicolinearity is detected, removing the correlated features and the problem will go away.

So far I covered some of the major assumptions with intuitions of why they are needed and what happens if they are not satisfied. For a more in depth view of how these assumptions are used to prove OLS is BLUE, I strongly recommend this great article which goes into much of the details. Also, the econometrics course by Professor Mark is a great one to watch.

Hopefully this article gives you a better understanding of linear regression and its assumptions. For further reading materials, I recommend the following

Introduction To Linear Regression: All You Need To Know About Linear Regression