Introduction to Linear Regression
This section will provide a high-level intuitive overview of linear regression and the problem it solves. It serves as a brief context for our gradient calculation sections. As such, a familiarity with the topic of linear regression is welcomed. For more advanced readers this section and a part of the optimization section can be skipped.
Problem statement
To set the stage, a common example of predicting house prices is examined. This represents the first requirement for linear regression use - a dataset. An example dataset is shown below:
Size (sqft) | Bedroom count | Price (USD) |
---|---|---|
1000 | 2 | 200000 |
1500 | 3 | 300000 |
1200 | 2 | 250000 |
2000 | 4 | 400000 |
… | … | … |
The dataset holds the inputs (sqft, bedroom count) and accompanying outputs (price) with respect to the price prediction problem. Let us consider all the input values of an example inside the dataset as a vector
Every method of solving a problem carries some assumptions with its use. Linear regression is a method that assumes that the mapping from inputs to outputs is a linear mapping.
Let
equivalently:
With this in mind, we want to define a rough form of functional dependence between inputs and outputs. As seen, we are allowed to multiply and add inputs inside the function. Multiplying inputs with a scalar is also allowed and we will denote these scalars as
This function is the fundamental view of the problem that we operate on under the linear regression lens. Next, the goal of linear regression is to estimate the unknown parameters
Why is
More generally, if we had
In a summation form, while respecting
Cost
How wrong are we?
Assuming we don’t know how to go about estimating the parameters in our function, we take a wild guess by choosing a vector of parameters
If the dataset contains an entry for a house with
Error was computed as the difference between the actual output
What if the guess was
Successful measurement of the model’s error per example is now possible. Our interest is not in the error the model makes for a single example because we desire a broader generalization ability. To express the error across the entire dataset an average of the errors over all examples is computed in the form of a cost function:
Should we treat each error the same?
What we have derived is close to the actual cost function that is used in linear regression. Linear regression cost function additionally squares the residuals and divides by
Expressing the cost function w.r.t. inputs, outputs, and parameters we obtain:
Origin of the cost function
Even though we intuitively arrived at the cost function, the choice for it is not arbitrary. It can be rigorously derived from statistical principles using maximum likelihood estimation. Maximum likelihood estimation is outside of the scope of this blog post. For more information open page 263. in [1].
Optimization
The best parameters
Knowing the slope at each point with the help of the derivative means knowing in which direction to move should we want to go downhill (minimize the cost). Slope also reveals when a local minimum has been reached as it will equal zero. Iterative approach uses this to make small steps in the downhill direction while the analytical approach solves directly for the derivative being zero.
Iterative solution
As briefly mentioned above, the iterative approach tries to find the minimum by moving downhill in this landscape. When we calculate the derivative for our choice of theta, if the slope is negative we should continue in that direction and if the slope is positive we should move in the opposite direction. We can repeat this process until we reach a minimum.
This iterative type of solution is called gradient descent. To perform gradient descent we can start with an arbitrary choice of parameters
There are considerations that were left out of this section, specifically the concept of global/local minima, convergence, overfitting, etc. As the goal of this post is to showcase how the gradient is derived, we will leave these considerations aside.
Calculating the derivatives/gradient
Having provided enough context on linear regression we proceed with one of the main goals of this post - the linear regression gradient calculation. A step-by-step approach to gradient calculation follows, along with every important calculus or linear algebra rule that we may require.
Non-vectorized derivative calculation
A non-vectorized calculation of derivatives is performed in this section. First, a list of derivation rules we may need to use are listed below:We proceed with the step-by-step derivation of the cost function. Since the cost function is multivariable, our derivatives turn into partial derivatives that need to be calculated for each parameter
Vectorizing the derivation result
Vectorization of our derivation result is performed here. The inner sum can be rewritten as a dot product or a trivial matrix multiplication.
Calculating all the partial derivaties at once can be expressed as calculating the vector of partial derivatives, i.e. the gradient.
Last sum to vectorize is the summation over the dataset. To vectorize we first introduce a matrix
Let us first understand what is done in the previous equation with
This way, for example, the first row of
Our final equation takes the form:
Vectorized derivative calculation
Vectorization is usually done before any derivation rules are used. In this section we vectorize immediately and apply new derivation rules for the vectorized form.We will focus on the term without normalization and account for it at the end. Square operation will be split as follows.
Before calculating the gradient, we will list the calculus / linear algebra rules that will be useful.
Gradient calculation follows.
The term
We take a look at each gradient individually:
Analytical Solution
Analytical solution to linear regression does not perform gradient descent. Instead, it searches for a solution to the normal equation, i.e. it sets the cost function gradient to zero and solves for
How can we be sure that this exact solution exists? This refers to the concept of local or global minima. Some functions can have multiple minima and finding the global one is hard. Iterative approaches shine in this case (under considerable adjustments) and are the backbone of every major neural network model training. What can be said about our function? It would be ideal if our function had a single global minimum. Function of that type is called a convex function.
A function
If
For vector inputs if the function is twice differentiable, that means the Hessian exists for all values in the domain of x. then the function f(x) is convex if and only if
A symmetric matrix A is positive definite if for all non-zero vectors
A symmetric matrix A is positive semidefinite if for all vectors
Turns out our function is indeed convex, let us prove that.
If we plug
The term we obtain is a squared euclidean (
We arrived to the closed-form equation that gives us the solution.
This solution is showcased in most of the textbooks and online courses but is not used in practice. Reason is that the inverse of
Frameworks like numpy inside their linear regression implementations (e.g. np.linalg.lstsq) use a different kind of an inverse to solve these edge cases - the pseudoinverse. Pseudoinverse is always defined and is obtained by a standard matrix factorization technique called singular value decomposition (SVD). Pseudoinverse is outside the scope of this post.
References
Useful resources for exploring the topic of linear regression in more detail are listed below.
- [1] Book, Mathematics for Machine Learning by Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong
- [2] Notes, CS229 lecture notes by Stanford
- [3] Course, Machine Learning course by Stanford on Coursera
- [4] Course, Mathematics for Machine Learning by Imperial College London on Coursera
- [5] Notes, Computing Neural Network Gradients by Kevin Vlark