Gradient Descent vs Ordinary Least Square in Linear Regression

Math&Stats

Introduction

Gradient descent is a popular method applied almost everywhere in machine learning. But back to the period that traditional mathematics rules the world, ordinary least square is the fundamental of solving linear problem. Therefore, my motivation of writing this blog is to figure out the similarity and difference of these two methods.

Hopefully after this, we will have better understanding on the aspect of both theory and experiments.

Theory

While I was reviewing Machine Learninxg by Andrew Ng, the course skipped the proof part of OLS, by default thinking that we have sufficient matrix knowledge (do we? lol). So, I’d like to include the proof here.

First, let’s introduce some terminologies and notations. Consider a multiple linear regression:

hθ(x(i))=θ0+θ1x(i)1+θ2x(i)2++θnx(i)n=θTx(i)=x(i)Tθ

where θ=[θ0θ1θn]x(i)=[1x(i)1x(i)2x(i)n]y=[y1y2ym]

y is the target,hθ(x(i)) is the hypothesis function, m is number of training samples / observations, n is number of features, e.g. x(i) is the ith training sample, x(i)j is the jth feature value of ith training sample.

Given the cost function: J(θ)=12mmi=1(hθ(x(i))yi)2

The objective is to estimate parameters θ, so that hypothesis function can have the minimum cost.

Gradient Descent

Since the cost function is a convex bowl-shaped graph, let’s use a 2-dimension projection between J and θi to explain how the algorithm works.

2-dimension projection looks like:

In order to get the minimum point, the algorithm will start from a higher point θi, using learning rate and derivatives / slope through many iterations until it gets there. Both derivaties Jθ and learning rate α will impact on how fast the algorithm goes until it reaches the bottom. If α is too large, it can bring in divergent problem too. Because of this, normalization and wisely choose α within the model becomes important.

Now the expression of gradient descent becomes: θj:=θjαJ(θ)θj

Furthermore we have:

J(θ)θj=1mmi=1(hθ(x(i))yi) hθ(x(i))θj=1mmi=1(hθ(x(i))yi) x(i)θTθj=1mmi=1(hθ(x(i))yi) x(i),

This satisfies all except θ0, which the derivatives part is equal to 1 as θ0 is a constant.

Finally, the algorithm becomes (repeat until convergence):

θ0:=θ0α1mmi=1(hθ(x(i))yi)
θn:=θnα1mmi=1(hθ(x(i))yi) x(i)

Note that parameters θ0, θ1, …, θn must be updated simultaneously in gradient descent.

Ordinary Least Square

Have mi=1(hθ(x(i))yi)2=[hθ(x(1))y1  hθ(x(m))ym]×[hθ(x(1))y1hθ(x(m))ym] =[θTx(1)y1  θTx(m)ym]×[θTx(1)y1θTx(m)ym]

Also have [θTx(1)y1θTx(m)ym]=[θTx(1)θTx(m)]y=[x(1)Tθx(m)Tθ]y=Xθy,

where X=[x(1)Tx(m)T]=[1x(1)1x(1)2x(1)n1x(m)1x(m)2x(m)n]m×(n+1)

Therefore have:

mi=1(hθ(x(i))yi)2=(Xθy)T(Xθy)

Regardless 12m, the cost function then can be simplified as:

J(θ)=(Xθy)T(Xθy)
J(θ)=((Xθ)TyT)(Xθy)=θTXTXθ(Xθ)TyyTXθ+yTy

Note that (Xθ)Ty=yTXθ, because both Xθ and y are vectors and have the same dimensions, so the order to multiply does not matter.

Then:

J(θ)=θTXTXθ2(Xθ)Ty+yTy

Therefore to get the minimum cost, calculate derivatives:

Jθ=(θTXTXθ2(Xθ)Ty+yTy)θ=(θTXTXθ)θ(2(Xθ)Ty)θ=(XTX)(θTθ)θ2(Xy)θTθ=0

where X and y are constants.

Furthermore, recall matrix calculus can infer that: θTθ=2θ,θT=(θ)T=1

Therefore we have:

Jθ=2XTXθ2XTy=0

Finally, obtain the normal equation as:

θ=(XTX)1XTy

Experiments

Since these two methods solve the same problem from different perspectives, first question would be will they end up having the same estimation result?

Short answer to this – Yes, and it should not be a surprise…

Here’s an article that took the experiment with a linear regression generated by small size of random points: OLS vs Gradient Descent Experiment.

According to the article, the estimate of intercept θ0 is 6.089, and the estimate of slope θ1 is 0.767 based on OLS. And here below is the overview of GD:

In addition, the article compared simulation result of mini-batch GD vs OLS, which comes up with the conclusion that min-batch GD will not be convergent due to having non-shuffled data during training and inappropriate learning rate.

Similarity and difference

Under most situation, you can choose to use either ordinary least square or gradient descent. With small dataset or relatively small number of features (~<104), OLS is preferred. However, if features # is too large, should consider using gradient descent as it would compute faster.

Here below lists the difference between OLS and GD:

  • GD requires to choose α and do iteration, but OLS doesn’t
  • OLS requires to compute (XTX)1, which have rare cases that matrix is non-invertible, so you may have to reduce dimensions or avoid multicollinearity
  • OLS is slower than GD when features # is large

In the end, it also exists normal equation in nonlinear least square, which I suppose can be another discussion compared with gradient descent later.

Written on June 5, 2020