Skip to Content
StatisticsSimple Linear Regression

Simple Linear Regression

Definition

The statistical model to analyze the correlation between a single independent variable (normally marked as XX) and a single dependent variable (normally marked as YY). Simple linear regression can be written as below.


y=β0+β1X+ϵ\LARGE y = \beta_0 + \beta_1X + \epsilon

where

  1. β0\beta_0 indicates intercept, the predicted value of yy when xx is 0.
  2. β1\beta_1 indicates regression coefficients, the changing rate of yy according to xx.
  3. ϵ\epsilon indicates error of the estimate, showing how much variation there is in our estimate of the regression coefficient.

Simple linear regression chart

Source: GeeksForGeeks


Usage

Simple linear regression could be useful when we want to analyze the relationship between two quantitative variables where one variable (XX) directly affects the other (YY).

Conditions

  1. Linearity: There should be a linear relationship between independent variable and dependent variable.
  2. Independence: Errors (residuals) from each data should be independent, not affecting others.
    • For the simple linear regression, this condition is not effective as there is only one independent variable.
  3. Homoscedasticity: The spread of data points around the regression line should be roughly the same throughout.
  4. Normality: The distribution of the errors should be approximately normal (bell-curve shaped).
  5. Both independent variable and dependent variable should be continuous variable.

Hypotheses


H0\LARGE H_0: Independent variable does NOT affect dependent variable.

H1\LARGE H_1: Independent variable DOES affect dependent variable.

Examples

We will be using Real estate price prediction data from Kaggle.

In dataset, I was wondering if the house age will affect house price or not because normally the older the house age, the lower the house price.

Hypotheses


H0\LARGE H_0: House age will NOT affect house price.

H1\LARGE H_1: House age WILL affect house price.


Code

Data pre-processing

I changed my current working directory to point the folder where my data file is lying.


statistics.R
getwd()
setwd("[Your data folder location]")
data <- read.csv("real_estate.csv")
head(data)
> head(data)
No X1.transaction.date X2.house.age X3.distance.to.the.nearest.MRT.station X4.number.of.convenience.stores
1 1 2012.917 32.0 84.87882 10
2 2 2012.917 19.5 306.59470 9
3 3 2013.583 13.3 561.98450 5
4 4 2013.500 13.3 561.98450 5
5 5 2012.833 5.0 390.56840 5
6 6 2012.667 7.1 2175.03000 3
X5.latitude X6.longitude Y.house.price.of.unit.area
1 24.98298 121.5402 37.9
2 24.98034 121.5395 42.2
3 24.98746 121.5439 47.3
4 24.98746 121.5439 54.8
5 24.97937 121.5425 43.1
6 24.96305 121.5125 32.1

Next, set the variables. For this, I set house_price as YY dependant variable and house_age as XX independent variable.


statistics.R
house_price <- data$Y.house.price.of.unit.area
house_age <- data$X2.house.age

Data Visualization

Plot
statistics.R
plot(house_price ~ house_age, data=data)

R Plot

It seems that there is no correlation between house_price and house_age.

Histogram (House age)

Histogram for house age

Simple Linear Regression

In order to get more specific result, let’s actually perform simple linear regression.


statistics.R
result <- lm(house_price ~ house_age, data=data)
summary(result)
summary(result)$r.squared

Result


> summary(result)
Call:
lm(formula = house_price ~ house_age, data = data)
Residuals:
Min 1Q Median 3Q Max
-31.113 -10.738 1.626 8.199 77.781
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.43470 1.21098 35.042 < 2e-16 ***
house_age -0.25149 0.05752 -4.372 1.56e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.32 on 412 degrees of freedom
Multiple R-squared: 0.04434, Adjusted R-squared: 0.04202
F-statistic: 19.11 on 1 and 412 DF, p-value: 1.56e-05
> summary(result)$r.squared
[1] 0.04433848

Analysis

Overall

  1. R2R^2 value is 0.04433848 (4.43%). This means 4.43% of variation in the YY values is accounted for by the XX values.
  2. p-value of this model is 0.000015605 (1.56e-05) which is lower than 0.05. This means that we can reject the null hypothesis (H0H_0). Therefore, we can say that this statistical model is effective.

Coefficients

  1. Estimate of house_age has p-value of 0.000015605 which is lower than 0.05. This means that we can reject the null hypothesis (H0H_0). Therefore, we can say that house_age DOES affect house_price.
  2. Estimate of house_age is -0.25149. This means, house_price will decrease 0.25149 per 1 increase in value of house_age. In other words, the older the house, the lower the house price becomes by small percentage.
Last updated on