In the article, I will introduce one of the most basic algorithms of Machine Learning. This is a Linear Regression algorithm belonging to the Supervised learning group. Linear regression is a very simple method but has proven useful for a large number of situations. In this article, you will discover exactly how linearity works. In data analysis, you will come across the term “Regression” very often. Before diving into Linear Regression, let’s understand the concept of Regression first. Primary regression is a statistical method for establishing a relationship between a dependent variable and a set of independent variables. Eg :
Age = 5 + Height * 10 + Weight * 13
Here it is we who are establishing the relationship between a person’s Height & Weight and his/her Age. This is a very basic example of Regression.
Watching: What is Linear Regression
Simple Linear Regression
“Linear regression” is a statistical method for regressing data with dependent variables having continuous values while independent variables can have either continuous values or categorical values. In other words “Linear Regression” is a method to predict the dependent variable (Y) based on the value of the independent variable (X). It can be used for cases where we want to predict a continuous quantity. For example, predicting traffic at a retail store, predicting how long a user will stop on a certain page or the number of pages visited on a certain website, etc.
To get started with Linear Regression, let’s go over some mathematical concepts about statistics.
Correlation (r) – Explains the relationship between two variables, the value of which can run from -1 to +1 Variance (σ2) – Estimate the dispersion in your data Standard Deviation (σ) – Rate evaluate the dispersion in your data (square root of variance) Normal distribution Error (error) – Assumptions
There is no one size fits all, the same is true for Linear Regression. To satisfy linear regression, the data should satisfy several important assumptions. If your data does not follow the assumptions, your results can be false as well as misleading.
Linear & Addition : There should be a linear relationship between the independent variable and the non-independent variable and the effect of a change in the values of the independent variables should affect the dependent variable in addition. Normality of Error Distribution: The difference between the true values and the predicted values (errors) should be normally distributed. Similarity: Variance of errors should be a constant value of , Prediction Time Values of independent variables Statistical independence of errors: Errors (residuals) should not have any any correlation between them. Example: In the case of time series data, there should be no correlation between consecutive errors. Linear regression line
While using linear regression, our goal is to get a straight line to produce the closest distribution to most points. Thus reducing the distance (error) of the data points to that line.
For example, the points in the figure above (left) represent different data points, and the line (right) represents an approximation that can explain the relationship between the x & y axes. Through linear regression we try to find such a line. For example, if we have a dependent variable Y and an independent variable X – the relationship between X and Y can be represented as the following equation:
Y = 0 + 1*X
Y = Dependent variable X = independent variable Β0 = Constant Β1 = Relationship coefficient between X and Y Some properties of linear regression The regression line always passes through the mean of the independent variable (x) as well as the mean of the dependent variable (y) The regression line minimizes the sum of the “Area of Errors”. That is why the linear regression method is called “Ordinary Least Square (OLS)” Β1 which explains the change in Y with a change in X by one unit. In other words, if we increase the value of X by one unit then it will be the change in the value of Y Find the Linear Regression
Using statistical tools such as Excel, R, SAS… you will directly find the constant (B0 and B1) as a result of the linear regression function. As the theory above, it works on the OLS concept and tries to reduce the area of error, these tools use software packages that calculate these constants.
For example, let’s say we want to predict y from x in the following table and assume that our regression equation will be something like y = B0 + B1 * x
x y Predict “y” 1 2 Β0+B1*1 2 1 0+B1*2 3 3 0+B1*3 4 6 0+B1*4 5 9 0+B1*5 6 11 Β0+B1*6 7 13 0 +B1*7 8 15 0+B1*8 9 17 0+B1*9 10 20 0+B1*10
Standard Deviation x 3.02765 Standard Deviation y 6.617317 Mean x 5.5 Mean y 9.7 Correlation x and y .989938
If we distinguish the Remaining Sum of Error Area (RSS) corresponding to B0 &B1 and equivalent to zero results, we get the following equation as a result:
B1 = Correlation * (Standard Deviation of y / Standard Deviation of x) B0 = Mean (Y) – B1 * Mean (X)
Putting values from table 1 into the above equations,
B1 = 2.64 B0 = –2.2
Therefore, the most regressive equation will become –
Y = –2.2 + 2.64 * x
Let’s see, how do we predict using this equation
x Y -true value Y – Prediction 1 2 0.44 2 1 3.08 3 3 5.72 4 6 8.36 5 9 11 6 11 13.64 7 13 16.28 8 15 18.92 9 17 21.56 10 20 24.2
With only 10 data points to fit a straight line our prediction will be very accurate, but if we see a correlation between “Y-Actual” and “Y – Prediction” the outlook will be different. very high so both series are moving together and here is the chart to show the predicted value:
Once you build the model, the next question that comes to mind is whether your model is sufficient to predict the future or is the relationship you have built between the dependent and independent variables. enough or not.
For this purpose there are many indicators that we need to refer to
R – Square (R^2)
The formula for calculating R^2 will be equal to :
Total Areas (TSS): TSS is a measurement of the total variation in the response/dependent variable Y ratio and can be thought of as the amount of variation inherent in the response before regression was performed. Sum of Squares (RSS): RSS measures the amount of variation that remains unexplained after performing a regression. (TSS – RSS) measures how much of a change in response is explained (or eliminated) by performing regression
Where N is the number of observations used to fit the model, σx is the standard deviation of x, and σy is the standard deviation of y.
R2 ranges from 0 to 1. R2 of 0 means that the dependent variable cannot be predicted from the independent variable R2 of 1 means that the dependent variable can be predicted without error from the independent variable An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that 20 percent of the variance in Y can be predicted from X; An R2 of 0.40 means 40 percent is predictable and so on.
Root Mean Square Error (RMSE) RMSE indicates how scattered the predicted values are from the actual values. The formula for calculating RMSE is
N: Total number of observations
Although the RMSE is a good estimate for errors, the problem with it is that it is very easily influenced by the range of your dependent variable. If your dependent variable has a narrow range, your RMSE will be low, and if your dependent variable has a wide range, your RMSE will be high. Therefore, RMSE is a good metric for comparison between different iterations of the model
Mean Absolute Percentage Error (MAPE)
To overcome the limitations of RMSE, analysts prefer to use MAPE over RMSE. MAPE gives error in percentages and is therefore comparable between models. The formula for calculating MAPE can be written as follows:
N: Total number of observations
Multivariate Linear Regression
Up to now, we have discussed the scenario where we have only one independent variable. If we have more than one independent variable, the most suitable method is “Multiple Regression Linear”.
There is essentially no difference between “simple” and “multivariate” linear regression. Both work according to OLS principle and the algorithm to get the most optimal regression curve is similar. In the latter case, the regression equation will have the following shape:
Bi: Different coefficients Xi: Different independent variables
Running linear regression using Python scikit-Learn
Above, you already know that linear regression is a popular technique and you can also see the mathematical equations of linear regression. But do you know how to do a linear regression in Python?? There are several ways to be able to do that, you can do linear regression using statistical models, numpy, scipy and sckit learn. But in this article we will use sckit learn to perform linear regression.
Scikit-learn is a powerful Python module for machine learning. It contains functions for regression, classification, clustering, model selection and dimensionality reduction. We will explore the sklearn.linear_model module that contains “methods to perform a regression, where the target value will be a linear combination of the input variables”.
In this post, we will use the Boston Housing dataset, which contains information about home values in suburban Boston. This dataset was originally obtained from the StatLib library maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.
Explore the Boston home dataset
The Boston Housing Dataset includes home prices in different parts of Boston. Along with price, the dataset also provides information such as Crime (CRIM), town non-retail business areas (INDUS), home owner age (AGE), and multiple attributes. Other properties are available here. The dataset itself can be downloaded from now . However, since we are using scikit-learn, we can import it from scikit-learn.
%matplotlib inline import numpy as np import pandas as pd import scipy.stats as stats import matplotlib.pyplot as plt import sklearn import statsmodels.api as sm import seaborn as sns sns.set_style(“whitegrid”) sns.set_context(“poster” ) from matplotlib import rcParams
First, we will import the Boston Housing dataset and store it in a variable called boston. To import it from scikit-learn, we will need to run this code.
from sklearn.datasets import load_boston boston = load_boston()
The boston variable is a dictionary, so we can check its key using the code below.
It will return the following
First, we can easily check its shape by calling boston.data.shape and it will return the size of the dataset with column dimensions.
As we can see it returns (506, 13), which means there are 506 rows of data with 13 columns. Now we want to know what the 13 columns are. We will run the following code:
You can use the print(boston.DESCR) command to check the description of the data instead of opening the web to read it.
Next, convert the data to pandas! It’s very simple, call pd.DataFrame() and pass boston.data. We can check the first 5 data using bos.head().
See also: What is Usb – Data Storage Usb Connection Port
bos = pd.DataFrame(boston.data) print(bos.head())
Or you can use the following command to show column names
bos.columns = boston.feature_names print(bos.head())
Looks like there is no column named PRICE yet.
bos<“PRICE”> = boston.target print(bos.head())
We will add it using the above code
If you want to see the statistical aggregates, run the following code.
Split data for train-test
Basically, before dividing the data into a dataset for train-testing, we need to split the data into two values: the target value and the forecast value. Let’s call the target value Y and the predictor values X. Thus,
Y = Boston Housing Price X = All other features X = bos.drop(“PRICE”, axis = 1) Y = bos<“PRICE”>
Now we can split the data to train and test with the following snippet.
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X, Y, test_size = 0.33, random_state = 5) print(X_train.shape) print(X_test.shape) print(Y_train.shape) print(Y_test.shape)
If we check the shape of each variable, we get the dataset with the test data set that has a ratio of 66.66% for the train data and 33.33% for the test data.
Next, we will run a linear regression.
from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train, Y_train) Y_pred = lm.predict(X_test) plt.scatter(Y_test, Y_pred) plt.xlabel(“Prices: $Y_i$”) plt.ylabel (“Predicted prices: $hat_i$”) plt.title(“Prices vs Predicted prices: $Y_i$ vs $hat_i$”)
The above code will fit a model based on X_train and Y_train. Now that we have a linear model, we will try to predict it for X_test and the predicted values will be stored in Y_pred. To visualize the difference between the actual price and the predicted value, we also create a table.
In fact, the graph above should have created a linear line as we discussed in theory above. However, the model did not fit 100%, so it was not able to create a linear curve.
Average error area
To check the error level of a model, we can use Mean Squared Error. This is one of the methods to measure the squared average of the error. Basically, it checks the difference between the actual value and the predicted value. To use it we can use scikit-learn error mean square function by running this code
mse = sklearn.
See also: What is English Scaffolding, What is English Type Scaffolding
metrics.mean_squared_error(Y_test, Y_pred) print(mse)
Reference and translation