The linear regression algorithm solves problems with real output values, for example: predicting house prices, predicting stock prices, predicting age, etc.


You work at a real estate company, you have data on the area and house price, now there is a new house you want to estimate how much that house will cost. In fact, house prices depend on many factors: area, number of rooms, proximity to a shopping mall, .. but for the sake of simplicity, let’s assume the house price depends only on the area of ​​the house. You have data on the area and selling price of 30 houses as follows:

Area(m2) Selling price (million VND) 30 448,524 32,4138 509,248 34,8276 535,104 37,2414 551,432 39,6552 623,418 … …

Figure 2: Estimated house price of 50 m^2

Programmatically also need to do 2 similar things:

Training: find the line (model) closest to the above points. People can immediately draw a line describing the data from Figure 1, but the computer can’t, it has to find it using the Gradient descent algorithm at the bottom. (The words model and line may be used interchangeably for the rest of the article.) Inference: predict how much a house of 50 m^2 will cost based on the line found above.

Formula setting


The equation of the line has the form y = ax + b.

Watching: What is gradient descent

The difference at the point x = 42 of the line model y = x and the actual value in table 1

It is clear that the y = x line is not near the points or is not the line we are looking for. For example, at the point x = 42 (house 42 m^2) the real price is 625 million, but the price predicted by the model is only 42 million.

So now we need a function to evaluate whether a straight line with parameter set (w_0, w_1) = (0, 1) is good or not. For each data point (x_i, y_i) the difference between the actual price and the predicted price is calculated by: displaystyle frac{1}{2} * (hat{y_i} – y_i)^2. And the difference across the entire data is the sum of the differences of each point:

READ MORE  Reviews, Features, Price List, Comparison

displaystyle J = frac{1}{2} * frac{1}{N} * (sum_{i=1}^{N} (hat{y_i} – y_i)^2) (N is the number of data points). Comment:

The smaller the non-negative J, the closer the line is to the data point. If J = 0, then the line passes through all the data points.

=> The problem of finding the line closest to the data points becomes finding w_0, w_1 such that the function J is the smallest. But since finding the minimum value of J is the same as finding the smallest value of k*J (k is a positive, real number), we will look for the minimum value of

displaystyle J = frac{1}{2} *(sum_{i=1}^{N} (hat{y_i} – y_i)^2)

Summary: first from finding the line (model) -> find w_0, w_1 so that the function J is the smallest. Now we need an algorithm to find the minimum value of the function J. That is gradient descent.

Gradient descent

What is a derivative

I believe there are many people who can calculate the derivative of the function f(x) = x^2 or f(x) = sin(cos(x)) but still do not know what the derivative really is.

In Chinese, the path is the road, the function is the function, so the derivative refers to the change of the function or has a more affectionate name than the slope of the graph.

Step 1: Initialize a random value x = -2 (point A).

See also: What is Ic3 – International Certificate of Informatics

Step 2: Because at A, the graph decreases, so f”(x=-2) = 2*(-2) = -4 When assigning x = x – learning_rate * f”(x), x increases, so the next step graph at point C. Continue to perform step 2, assign x = x – learning_rate * f”(x), the graph is at point D,… => the function gradually decreases towards the minimum value.

Has anyone noticed that the absolute value of sputum at A is greater than at C and at C is greater than at D. When approaching the point where the minimum value x = 0, the derivative is approximately 0 until the function reaches the minimum value at x = 0, then the derivative is 0, so at the point near the minimum value, step 2 assign x = x – learning_rate * f”(x) is negligible and almost keeps the value of x.

READ MORE  Learn About Bilirubin

Similarly, if the initial value is at x = 2 (at B), then the derivative at B is positive, so because x = x – learning_rate * f”(x) decreases -> the graph is at point E -> then continues to assign x= x -learning_rate * f”(x), the function f(x) will also gradually decrease to the minimum value.


The algorithm works very well in cases where it is not possible to find the smallest value using linear algebra. The most important part of the algorithm is to calculate the derivative of the function with respect to each variable then repeat step 2.

Choosing the learning_rate coefficient is extremely important, there are 3 cases:

If the learning_rate is small: each time the function decreases very little, so it takes a lot of steps to get the function to the minimum value If the learning_rate is reasonable: after a moderate number of iterations of step 2, the function will reach a small enough value .If learning_rate is too large: will cause overshoot and never reach the minimum value of the function.

Matrix A (3 * 4)

The matrix index is the row before the following column for example A<1, 2> = -5 (row 1, column 2); A<3, 1> = 4 (row 3, column 1)

Matrix multiplication

The matrix multiplication A * B is only possible when the number of columns of A is equal to the number of rows of B, or A has size m*n and B has size n*k.

Matrix C = A * B then C has size m * k and C = sum_{k=1}^{n} A*B

Problem representation

Because for each point x_i, y_i we need to calculate (w_{0} + w_{1} * x_i – y_i), so instead of calculating for each data point one by one, we will represent it as a ghost match, X size n * 2, Y size n * 1 (n is the number of data points in the dataset that we have).

READ MORE  Meaning of Overlap in Vietnamese

See also: Bearings (What is a bearing, Guide to Choosing a Bearing, Bearing


New people can skip this part

Quadratic function

Assuming the problem is still the same as above, but when you graph the data, it will look like this

You can guess the model might be a quadratic function now you’ll set up the formula differently

Now you need to calculate the derivative of J with respect to w_0, w_1, w_2 to apply gradient descent

Gaussian Process (GP)

What if you graph the data and you can’t guess what the model is? The answer is to use Gaussian Process. The model above you noticed, I have parameters to look for w_0, w_1, w_2 but in GP, ​​there are no parameters at all (or can be considered as an infinite number of parameters). This article is a good introduction to GP, everyone can read more.

Python code

Data and code you can get here

# -*- coding: utf-8 -*-“””Created on Mon Feb 18 22:06:34 2019

author: DELL”””import numpy as npimport pandas as pdimport matplotlib.pyplot as plt#numOfPoint = 30#noise = np.random.normal(0.1,numOfPoint).reshape(-1,1)#x = np. linspace(30, 100, numOfPoint).reshape(-1,1)#N = x.shape<0>#y = 15*x + 8 + 20*noise#plt.scatter(x, y)data = pd. read_csv(“data_linear.csv”).valuesN = data.shape<0>x = data .reshape(-1, 1)y = data .reshape(-1, 1)plt.scatter(x, y)plt.xlabel (“square meter”)plt.ylabel(“price”)x = np.hstack((np.ones((N, 1)), x))w = np.array(<0.,1.>). reshape(-1,1)numOfIteration = 100cost = np.zeros((numOfIteration,1))learning_rate = 0.000001for i in range(1, numOfIteration): r =, w) – y cost = 0.5* np.sum(r*r) w<0> -= learning_rate*np.sum(r) # correct the shape dimension w<1> -= learning_rate*np.sum(np.multiply(r, x .reshape(- 1,1))) print(cost)predict =, w)plt.plot((x<0><1>, x <1>),(predict<0>, predict ), “r “) = 50y1 = w<0> + w<1> * 50print(“The price of a house for 50m^2 is : “, y1) So the end of the first post, I know it can take a long time. already people haven’t touched math, maybe first time programming. Nothing is too easy to achieve, as long as you pass the level of lesson 1, the following lessons will be lighter.

If you don’t understand something or have anything to contribute, just comment below and I will answer. Thank you very much !!!