StatisticsSimple Linear Regression

The statistical model to analyze the correlation between a single independent variable (normally marked as XX) and a single dependent variable (normally marked as YY). Simple linear regression can be written as below.

y=β0+β1X+ϵ\LARGE y = \beta_0 + \beta_1X + \epsilon


  1. β0\beta_0 indicates intercept, the predicted value of yy when xx is 0.
  2. β1\beta_1 indicates regression coefficients, the changing rate of yy according to xx.
  3. ϵ\epsilon indicates error of the estimate, showing how much variation there is in our estimate of the regression coefficient.

Simple linear regression chart

Source: GeeksForGeeks


Simple linear regression could be useful when we want to analyze the relationship between two quantitative variables where one variable (XX) directly affects the other (YY).


  1. Linearity: There should be a linear relationship between independent variable and dependent variable.
  2. Independence: Errors (residuals) from each data should be independent, not affecting others.
    • For the simple linear regression, this condition is not effective as there is only one independent variable.
  3. Homoscedasticity: The spread of data points around the regression line should be roughly the same throughout.
  4. Normality: The distribution of the errors should be approximately normal (bell-curve shaped).
  5. Both independent variable and dependent variable should be continuous variable.


H0\LARGE H_0: Independent variable does NOT affect dependent variable.

H1\LARGE H_1: Independent variable DOES affect dependent variable.


We will be using Real estate price prediction data from Kaggle.

In dataset, I was wondering if the house age will affect house price or not because normally the older the house age, the lower the house price.


H0\LARGE H_0: House age will NOT affect house price.

H1\LARGE H_1: House age WILL affect house price.


Data pre-processing

I changed my current working directory to point the folder where my data file is lying.

setwd("[Your data folder location]")
data <- read.csv("real_estate.csv")
> head(data)
No X4.number.of.convenience.stores
1 1 2012.917 32.0 84.87882 10
2 2 2012.917 19.5 306.59470 9
3 3 2013.583 13.3 561.98450 5
4 4 2013.500 13.3 561.98450 5
5 5 2012.833 5.0 390.56840 5
6 6 2012.667 7.1 2175.03000 3
X5.latitude X6.longitude
1 24.98298 121.5402 37.9
2 24.98034 121.5395 42.2
3 24.98746 121.5439 47.3
4 24.98746 121.5439 54.8
5 24.97937 121.5425 43.1
6 24.96305 121.5125 32.1

Next, set the variables. For this, I set house_price as YY dependant variable and house_age as XX independent variable.

house_price <- data$
house_age <- data$

Data Visualization

plot(house_price ~ house_age, data=data)

R Plot

It seems that there is no correlation between house_price and house_age.

Histogram (House age)

Histogram for house age

Simple Linear Regression

In order to get more specific result, let’s actually perform simple linear regression.

result <- lm(house_price ~ house_age, data=data)


> summary(result)
lm(formula = house_price ~ house_age, data = data)
Min 1Q Median 3Q Max
-31.113 -10.738 1.626 8.199 77.781
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.43470 1.21098 35.042 < 2e-16 ***
house_age -0.25149 0.05752 -4.372 1.56e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.32 on 412 degrees of freedom
Multiple R-squared: 0.04434, Adjusted R-squared: 0.04202
F-statistic: 19.11 on 1 and 412 DF, p-value: 1.56e-05
> summary(result)$r.squared
[1] 0.04433848



  1. R2R^2 value is 0.04433848 (4.43%). This means 4.43% of variation in the YY values is accounted for by the XX values.
  2. p-value of this model is 0.000015605 (1.56e-05) which is lower than 0.05. This means that we can reject the null hypothesis (H0H_0). Therefore, we can say that this statistical model is effective.


  1. Estimate of house_age has p-value of 0.000015605 which is lower than 0.05. This means that we can reject the null hypothesis (H0H_0). Therefore, we can say that house_age DOES affect house_price.
  2. Estimate of house_age is -0.25149. This means, house_price will decrease 0.25149 per 1 increase in value of house_age. In other words, the older the house, the lower the house price becomes by small percentage.
