Multiple Linear Regression
Definition
The statistical model to analyze the correlation between multiple independent variables and a single dependent variable. Multiple linear regression can be written as below.
where
- indicates intercept, the predicted value of when is 0.
- , , …, indicate regression coefficients, the changing rate of according to .
- indicates error of the estimate, showing how much variation there is in our estimate of the regression coefficient.
Usage
Multiple linear regression is useful when we want to analyze the relationship between multiple factors and depenent variable and figure out which independent varilables have the most influence on the dependent variable.
Hypotheses
: None of the independent variables affect the dependent variable.
: At least one independent variable affect dependent variable.
Examples
We will still be using the Real estate price prediction data used in Simple Linear Regression post.
Data preprocessing
We will be using the three following independent variables.
- House age
- Distance to the nearest MRT station
- Number of convenience stores
and all above independent variables are continuous variables.
data <- read.csv("real_estate.csv")data <- data[, -c(1, 2, 6, 7)]colnames(data) <- c("house_age", "distance_nearest_mrt", "num_conv_stores", "house_price")head(custom_data)
> head(data)no house_age distance_nearest_mrt num_conv_stores house_price1 1 32.0 84.87882 10 37.92 2 19.5 306.59470 9 42.23 3 13.3 561.98450 5 47.34 4 13.3 561.98450 5 54.85 5 5.0 390.56840 5 43.16 6 7.1 2175.03000 3 32.1
Data Visualization
Plot
Histogram
We can see that most of the houses are close to MRT stations.
We can see that most of the houses have at least 1 convenience stores nearby.
Multiple Regression
We will be performing multiple regression where
- Independent variables() are
house_age
,distance_nearest_mrt
, andnum_conv_stores
, - Dependent variables() is
house_price
.
result <- lm(data$house_price ~ data$house_age + data$distance_nearest_mrt + data$num_conv_stores)summary(result)
Result
Call:lm(formula = data$house_price ~ data$house_age + data$distance_nearest_mrt +data$num_conv_stores)Residuals:Min 1Q Median 3Q Max-37.304 -5.430 -1.738 4.325 77.315Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 42.977286 1.384542 31.041 < 2e-16 ***data$house_age -0.252856 0.040105 -6.305 7.47e-10 ***data$distance_nearest_mrt -0.005379 0.000453 -11.874 < 2e-16 ***data$num_conv_stores 1.297443 0.194290 6.678 7.91e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 9.251 on 410 degrees of freedomMultiple R-squared: 0.5411, Adjusted R-squared: 0.5377F-statistic: 161.1 on 3 and 410 DF, p-value: < 2.2e-16
Analysis
Overall
- Adjusted value is 0.5377. This means 53.77% of data fit for our multiple regression model.
Usually the more independent variables are added, the more value we get. Therefore, it is better practice to use adjusted when we evaluate the multiple regression model.
- p-value is 2.2e-16 which is way lower than 0.05. This means we can reject the null hypothesis, which means our multiple regression model is effective.
Non-standardized Coefficients
house_age
- p-value is 7.47e-10 which is lower than 0.05. Therefore, we can reject null hypothesis and this means house age affects on house price.
- Coefficient is -0.252856. This means house age and house price has negative relationship with house price and as house age increases by 1 unit, house price will decrease by 0.252856.
distance_nearest_mrt
- p-value is 2e-16 which is lower than 0.05. Therefore, we can reject null hypothesis and this means distance of nearest MRT stations affects on house price.
- Coefficient is -0.005379. This means distance of nearest MRT stations has negative relationship with house price and as distance of nearest MRT stations increases by 1 unit, house price will decrease by 0.005379.
num_conv_stores
- p-value is 7.91e-11 whi is below 0.05. Therefore, we can reject null hypothesis and this means number of convenience stores affects on house price.
- Coefficient is 1.297443. This means number of convenience stores has positive relationship with house price and as number of convenience stores increases by 1 unit, house price will increase by 1.297443.
Meanwhile, coefficients for each independent variables are non-standardized coefficients. Therefore, analysis with non-standardized coefficients is as same as replying for below question:
What if {variable} is included in this model? And how does this single {variable} affect the dependent variable?
In other words, we can analyze that whether or not each of single independent variables affect dependent variable, and if so, how much does it affect independent variable when included. However, it does not allow us to take holistic view of how these all independent variables affect dependent variable in a broad sense.
Standardized Coefficients
In order to compare how much each independent variables has influence on the dependent variable, we would need to use standardized coefficients.
result2 <- lm(scale(data$house_price) ~ scale(data$house_age) + scale(data$distance_nearest_mrt) + scale(data$num_conv_stores))summary(result2)
Call:lm(formula = scale(data$house_price) ~ scale(data$house_age) +scale(data$distance_nearest_mrt) + scale(data$num_conv_stores))Residuals:Min 1Q Median 3Q Max-2.7417 -0.3991 -0.1277 0.3178 5.6822Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) -4.883e-16 3.342e-02 0.000 1scale(data$house_age) -2.117e-01 3.358e-02 -6.305 7.47e-10 ***scale(data$distance_nearest_mrt) -4.990e-01 4.202e-02 -11.874 < 2e-16 ***scale(data$num_conv_stores) 2.809e-01 4.206e-02 6.678 7.91e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.6799 on 410 degrees of freedomMultiple R-squared: 0.5411, Adjusted R-squared: 0.5377F-statistic: 161.1 on 3 and 410 DF, p-value: < 2.2e-16
Below is the table of standardized coefficients and p-value for each independent variables.
NA | Standardized Coefficients | P-value |
---|---|---|
house_age | -2.117e-01 | 7.47e-10 |
distance_nearest_mrt | -4.990e-01 | < 2e-16 |
num_conv_stores | 2.809e-01 | 7.91e-11 |
- To begin with, p-values for all independent variables are lower than 0.05. We can reject null hypothesis and can conclude that all factors affect the house price.
- If we are to compare the absolute value of standardized coefficients of each independent variables,
num_conv_stores
has the highest value. This means that number of convenience stores is a factor that has the most influence on the dependent variable. - All of the standardized coefficients of each independent variables have positive value. That means as any of independent variable increases, house price will increase as well.