# Basic concept of Regression Analysis

Here in this article, we are going to discuss the basic concept of regression analysis along with its assumptions. We will also discuss the correlation analysis which is necessary to see if the independent variables are correlated to each other

## Introduction

Regression analysis is a tool to develop a prediction model for predicting the effect of one or more variables or factors on a particular phenomenon.

Before carrying out a regression analysis it would be fruitful to examine the significance of the correlation which exists between each of the independent variables.

### Correlation Analysis

Correlation is a technique for investigating the extent of the linear relationship between two quantitative, continuous variables. Pearson's correlation coefficient (ρ) is a measure of the strength of the linear association between the two variables. Pearson's sample correlation coefficient (r), which is an estimate of ρ, ranges from -1 to +1 and it is given by, Nearer the scatter of points is to a straight line, higher the strength of association between the variables irrespective of what measurement units are used.

The Student's t-test is used to test if the population correlation coefficient is significantly different from zero and hence implying that there is an association between the two variables.

The hypothesis to be tested here is:
Ho: The population correlation coefficient between two variable is equal to zero, against
H1: The population correlation coefficient between two variable is not equal to zero.

Under the null hypothesis, follows the Student's t-distribution with (n-1) degrees of freedom.

If the correlations come out to be significant then the problem of multicollinearity exist which is a deviation from one of the assumptions of the multiple linear regression models and hence we cannot use regression analysis.

If it comes out to be nonsignificant then we may use regression analysis.

### Regression analysis

The variable which is to be explained or for which we need to develop a prediction model is called the dependent variable and the variable(s) which explains the dependent variable or factors is called the independent variable(s).

If the number of independent variables is only one then the model is termed as a simple regression model. In case the number of independent variables is greater than one then the model is termed as multiple regression models.
The simplest regression model is the linear regression model and is of the form, Where Xi's are the independent variables and Y is the dependent variable. The coefficients reflect the degree of the influence that Xi's has on Y.

Here our objective is to determine the unknown coefficients in such a way that the sum of squares of errors in estimating Y by the regression curve is minimized. We use the least square method for this purpose.

### Test of the Significance of the Regression model:

This uses the variance of the observed data to determine if a regression model can be applied to the observed data. In this approach we need to calculate the following terms:

a) Total Sum of Squares:
Total sum of squares (TSS) is given by, It is the sum of the square of deviations of the entire observations y from their mean. The number of degrees of freedom associated with TSS in (n-2).

b) Total Mean Square:
The total mean square (abbreviated MST) is given by, It is the sample variance of all the observations. The number of degrees of freedom associated with MST is (n-1).

c) Regression Sums of Squares:
The regression sum of squares is the variation attributed to the relationship between the independent and dependent variables. It is given by, The number of degrees of freedom associated with SSR is 1.

d) Squared Sums of Error:
The residual sum of squares (RSS), or the sum of squared errors of prediction (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is given by, It is a measure of the discrepancy between the data and an estimation model. The number of degrees of freedom associated with it is (n-2).

The total variability of the observed data (i.e., the total sum of squares, ) can be written using the portion of the variability explained by the model (SSR) and the portion unexplained by the model (SSE) as:
SST = SSR + SSE

e) Error Mean Square:
The error mean square (MSE) by dividing the error sum of squares by the respective degrees of freedom as follows: It is an estimate of the variance of the random error term, ε. The degree of freedom associated with it is (n-2).

f) Regression Mean Square:
The regression mean square (SSR) can be obtained by dividing the regression sum of squares by the respective degrees of freedom as follows: The degree of freedom associated with it is 1.

g) F-Statistics:
The F-test is used to infer about the significance of the fitted model.

The hypothesis that it tests is:
H0: The fit of the intercept-only model (with no predictors) and the chosen model are equal, against
H1: The fit of the intercept-only model (with no predictors) and the chosen model are not equal

The statistic used here, is based on the F distribution. It can be shown that if the null hypothesis implying that a model does exist between X and Y, i.e., H0 is true, then the statistic: follows the F distribution with 1 degree of freedom in the numerator and (n-2) degrees of freedom in the denominator . H0 is rejected if the calculated statistic is Fo such that: Where, fα,1,n-2 is the tabulated value of the distribution corresponding to a cumulative probability of (1-α) and α is the significance level.

### The ANOVA table is shown below: However, the linear regression model is based on the following five key assumptions

i) The first assumption of that of linearity requires the relationship between the independent and dependent variable to be linear.
ii) The second assumption is that of the independence of the dependent variable values (or the residuals).
iii) The third assumption of that of non-multicollinearity requires that all the predictors should be independent of each other.
iv) The fourth assumption of that of homoscedasticity requires the residuals to be equally spread about the fitted regression line.
v) The fifth assumption that of normality requires the residuals to be normally distributed.