Before we dive into the technical aspect of Ridge and Lasso, let’s try and understand what regularization actually is. Lets’s take a toy example, imagine there’s a kid preparing for his exam. And you give him all the time to prepare and ask, “whenever you think you think you are ready, come and appear for exam”. We have given him ample amount of time. An infinitesimal amount of time can allow him to memorize every concept in his book and he can answer anything you ask off him from book. How do you think he will perform, when asked something different or when we twist the question. He wont be able to, right ?
In the example above student is said to have overfitted himself. Now, what would happen if you applied of lot of constraint on his studies, like, he is allowed only 2 hrs to prepare ?
Now, he wont be able to answer questions correctly even when asked directly from book, this is a typical case of under-fitting. Similar is the case with machines. We come with terminologies like Bias and Variance in case of machines to understand overfitting and underfitting. Let’s discuss them in detail.
1. Bias and Variance
1.1 Theoretical definition
Bias: This is the error calculated as the difference between the average prediction of our model and the correct value. In this case we have only one model so obviously we will not have concept of average prediction, but let’s imagine we train multiple model by repeating the model building process by taking new data each time (a case of K-Fold). Due to randomness in data we will have a range of predictions. Bias is a measure of how far off our predictions are from the correct value.
Variance: As the name suggests, this is the variation of models’ prediction between different realizations of model. Again we can imagine repeating the model building process multiple times.
1.2 Graphical definition
We will use bulls-eye diagram to describe the bias variance concept. Imagine centre is the model that predicts the value accurately and each hit represents prediction due to variation in data.
1.3 Mathematical definition
Let’s imagine Y is the actual value we are trying to predict and X, is a matrix of variables on which Y depends. We may assume some relationship such as:
Error term is normally distributed with mean zero.
Let’s estimate a model f̂ (X) of f(X) using linear regression or any other modelling technique.
The Error then is,
This can be further decomposed into
So, we really need to hit the sweet spot so that bias error is not that much i.e. model is not under-fitting and variance is also under control i.e. overfitting is checked.
Mostly overfitting happens due to two reasons.
- High dimensions: As dimension increases data becomes sparse hence it can easily be overfitted.
- Less data.
2. Why add penalty to coefficient ?
According to Gauss-Markov theorem, the beta coefficients obtained through OLS (Ordinary Least Squares) is BLUE (Best Linear Unbiased Estimator).
Let’s focus on word ‘Best’ and ‘Unbiased’, former means that the coefficients obtained are the coefficients which have the smallest variance out of all other types of regression coefficients, and later means that if we run OLS multiple times with many different random samples, then expected value of the mean of all the coefficients will be equal to the population parameters.
That means, if we can find a biased estimator then we can have even smaller variance, and this is what we want in order to avoid overfitting. So, if there are many highly correlated explanatory variables then it causes coefficients to have abnormally high variance if we run model on different data. So, how are we gonna control this variance in coefficients ? The real issue we have is, when explanatory variables are correlated then then tend to get abnormally big, and thats what we want to control. We don’t want the coefficients to be unbounded.
Therefore we add constraint to the coefficients in the form of Ridge and Lasso (primarily based on L2 and L1 norm of coefficient values). Let’s discuss them in detail.
3. Ridge Regression
We add a constraint L2 norm of the coefficients and then try to solve for the loss function as shown in the proof below. We will be using the concept of Lagrange multiplier. Let’s first analyse it using countour plots.
In the case above we are adding constraint on coefficients that, it can take values that lies only in the sphere of radius c. If there was no constraint then minimum value for OLS would have been at point ‘A’, but if we were to follow constraint then, new minimum will be found at θ.
Now lets solve the equation using Lagrange Multiplier.
We can clearly see that coefficients are inversely proportional to ƛ. So, as lambda get really big, coefficients tends to reach near zero.
4. Lasso Regression
We add a constraint L1 norm of the coefficients and then try to solve for the loss function as shown in the proof below. We will again be using the concept of Lagrange multiplier. Let’s first analyse it using countour plots.
From the image above we can clearly see one major difference between ridge and lasso. We have non constraint minima for loss function at ‘A’ and, constrained minima at ‘c’. Owing to the highly complex structure for loss function and Ridge constraint being spherical it becomes really impossible for loss function curve to meet at axis, but since in case of Lasso we have diamond shape constraint, it’s fairly easy for a loss function to touch at axis. The above mentioned phenomenon leads to feature elimination, where some of the coefficients tends to become zero.
This brings us to the end of a really exhaustive and detailed blog post on regularization. I have here tried to be a lot verbose so as we can get a complete picture of what’s happening behind the maths applied in regularization and why we apply constraint on coefficients.
Please feel free to add your valuable suggestions to it. 😊