2. Validation 3. Regularization 3.3. (l_2) regularization

Overfitting is not an algorithm in Machine Learning. It is a common undesirable phenomenon, Machine Learning modelers need to know the techniques to avoid this phenomenon.

Watching: What is Overfitting?

## 1. Introduction

This is a story of myself when I first learned about Machine Learning.

In my third year of college, a teacher introduced my class to Neural Networks. When we first heard this concept, we asked him what its purpose was. He said, basically, from the given data, we need to find a function to turn the input points into corresponding output points, no need to be exact, just an approximation.

At that time, being a math student who worked a lot with polynomials in high school, I replied with confidence right away that the Lagrange Interpolation Polynomial could do it, as long as the input points were different. double one! He said that “what we know is only a tiny fraction of what we do not know”. And that’s what I want to start with in this post.

A little reminder about Lagrange Interpolation Polynomials: With (N) pairs of data points ((x_1, y_1), (x_2, y_2), dots, (x_N, y_N)) with (x_i) distinct (x_i) values, always find a polynomial (P(.)) of degree not exceeding (N-1) such that (P(x_i) = y_i, ~forall i = 1, 2, dots, N). Isn’t this similar to finding a model that fits the data in a Supervised Learning problem? This is even better because in Supervised Learning we only need approximations.

The truth is that if a model overfits the data it will backfire! This phenomenon of overfitting in Machine Learning is called overfitting, which is something that when building models, we always need to avoid. For a first look at overfitting, let’s see Figure below. There are 50 data points created with a cubic polynomial plus noise. This dataset is divided into two, 30 red data points for training data, 20 yellow data points for test data. The graph of this cubic polynomial is given by the green line. Our problem is that assuming we don’t know the original model but only the data points, find a “good” model to describe the given data.

With what we know from Linear Regression, with this kind of data we can apply Polynomial Regression. This problem can be solved by Linear Regression with extended data for a pair of points ((x, y)) which is ((mathbf{x}, y)) with (mathbf{x} = <1, x, x^2, x^3, dots, x^d>^T) for a polynomial of degree (d). It is important that we find the degree (d) of the polynomial we are looking for.

READ MORE  And Application In Design

It is clear that a polynomial of degree not exceeding 29 can be perfectly fit to 30 points in the training data. Let’s look at some values ​​(d = 2, 4, 8, 16). With (d = 2), the model is not really good because the predictive model is too different from the real model. In this case, we say the model is underfitting. For (d = 8), for the data points in the interval of the training data, the prediction model and the real model are quite similar. To the right, however, the 8th order polynomial gives exactly the opposite of the trend of the data. The same thing happens in the case (d = 16). This polynomial of degree 16 is overfit in the data interval under consideration, and overfit, ie not smooth in the training data range. Overfitting in the 16th order case is not good because the model is trying to describe noise rather than data. These two cases of higher order polynomials are called Overfitting.

If you know about Lagrange Interpolation Polynomials, you can understand the phenomenon of large errors with points outside the range of the given points. That’s why the method has the word “interpolation”, with cases of “extrapolation”, the results are often incorrect.

With (d = 4), we get the prediction model quite similar to the real model. The highest order coefficient found is very close to 0 (see the results in the source code), so this fourth degree polynomial is quite close to the original 3rd degree polynomial. This is a good model.

Overfitting is the phenomenon where the found model overfits the training data. This overfitting can lead to false prediction of noise, and the model quality is no longer good on the test data. Test data is assumed to be unknown in advance, and is not used to build Machine Learning models.

Basically, overfitting occurs when the model is too complex to simulate the training data. This especially happens when the amount of training data is too small while the complexity of the model is too high. In the example above, the complexity of the model can be thought of as the order of the polynomial to be found. In the Multi-layer Perceptron, model complexity can be thought of as the number of hidden layers and the number of units in the hidden layers.

So, are there techniques to help avoid Overfitting?

First of all, we need some metrics to evaluate the quality of the model on training data and test data. Here are two simple quantities, assuming (mathbf{y}) is the actual output (possibly a vector), and (mathbf{hat{y}}) is the expected output. guessed by model:

Train error: Usually a loss function applied to training data. This loss function needs a factor (frac{1}{N_{ ext{train}}} ) to calculate the mean, i.e. the average loss per data point. With Regression, this is usually defined:< ext{train error}= frac{1}{N_{ ext{train}}} sum_{ ext{training set}} |mathbf{y} - mathbf{hat{y }}|_p^2>where (p) is usually 1 or 2.

With Classification, the average of cross entropy can be used.

Test error: Same as above but apply the found model to test data. Note that, when building the model, we must not use the information in the test dataset. Test data is only used to evaluate the model. With Regression, this is usually defined:< ext{test error}= frac{1}{N_{ ext{test}}} sum_{ ext{test set}} |mathbf{y} - mathbf{hat{y }}|_p^2>

with (p) the same as (p) in the train error calculation above.

Averaging is important because the amount of data in the two training and test sets can vary greatly.

A model is considered good if both train error and test error are low. If the train error is low but the test error is high, we say the model is overfitting. If the train error is high and the test error is high, we say the model is underfitting. If the train error is high but the test error is low, I don’t know the name of this model, because it is extremely fortunate that this phenomenon occurs, or only if the test data set is too small.

Let’s go to the first method

2. Validation

### 2.1. Validation

We are still used to dividing the data set into two small sets: training data and test data. And one thing I still want to reiterate is that when building the model, we must not use test data. So how to know the quality of the model with unseen data (that is, data that has never been seen)?

READ MORE  Translation Meaning of the word Premise

The simplest method is to extract a small subset from the training data set and perform model evaluation on this small subset. The small subset extracted from this training set is called the validation set. At this point, the training set is the remainder of the original training set. Train error is calculated on this new training set, and there is another concept that is defined similarly to validation error, that is, the error is calculated on the validation set.

This is like when you study for an exam. Let’s say you don’t know how the exam questions are, but there are 10 sets of exam questions from previous years. To see how your level is before the exam, one way is to leave a separate set of problems, not to review anything. The review will be done based on the remaining 9 sets. After reviewing, you leave the set of questions that have been set aside to try and check the results, so it is “objective”, just like the real exam. The 10 sets of questions from previous years are the “entire” training sets you have. To avoid skewed learning, learn according to only 10 sets, you separate 9 sets to make a real training set, the rest is a validation test. When you do that, you will be able to evaluate whether your study is really good or not, or just studying. Therefore, Overfitting can also be compared with human learning.

With this new concept, we find a model such that both the train error and the validation error are small, thereby predicting that the test error is also small. The commonly used method is to use a variety of models. Which model gives the smallest validation error will be the good model.

Usually, we start from the simple model, then gradually increase the complexity of the model. Until the validation error tends to increase, choose the model immediately before. Notice that the more complex the model, the smaller the train error tends to be.

See also: What is Intersect – Identify Union, Minus, Union All, Intersect

The figure below shows the example above with the degree of the polynomial increasing from 1 to 8. The validation set consisting of 10 points is extracted from the initial training set.