Simple Linear Regression
Definition
The statistical model to analyze the correlation between a single independent variable (normally marked as ) and a single dependent variable (normally marked as ). Simple linear regression can be written as below.
where
- indicates intercept, the predicted value of when is 0.
- indicates regression coefficients, the changing rate of according to .
- indicates error of the estimate, showing how much variation there is in our estimate of the regression coefficient.
data:image/s3,"s3://crabby-images/50be3/50be3076cb9ae72cd35530522e2a76d862969090" alt="Simple linear regression chart"
Source: GeeksForGeeks
Usage
Simple linear regression could be useful when we want to analyze the relationship between two quantitative variables where one variable () directly affects the other ().
Conditions
- Linearity: There should be a linear relationship between independent variable and dependent variable.
- Independence: Errors (residuals) from each data should be independent, not affecting others.
- For the simple linear regression, this condition is not effective as there is only one independent variable.
- Homoscedasticity: The spread of data points around the regression line should be roughly the same throughout.
- Normality: The distribution of the errors should be approximately normal (bell-curve shaped).
- Both independent variable and dependent variable should be continuous variable.
Hypotheses
: Independent variable does NOT affect dependent variable.
: Independent variable DOES affect dependent variable.
Examples
We will be using Real estate price prediction data from Kaggle.
In dataset, I was wondering if the house age will affect house price or not because normally the older the house age, the lower the house price.
Hypotheses
: House age will NOT affect house price.
: House age WILL affect house price.
Code
Data pre-processing
I changed my current working directory to point the folder where my data file is lying.
getwd()setwd("[Your data folder location]")data <- read.csv("real_estate.csv")head(data)
> head(data)No X1.transaction.date X2.house.age X3.distance.to.the.nearest.MRT.station X4.number.of.convenience.stores1 1 2012.917 32.0 84.87882 102 2 2012.917 19.5 306.59470 93 3 2013.583 13.3 561.98450 54 4 2013.500 13.3 561.98450 55 5 2012.833 5.0 390.56840 56 6 2012.667 7.1 2175.03000 3X5.latitude X6.longitude Y.house.price.of.unit.area1 24.98298 121.5402 37.92 24.98034 121.5395 42.23 24.98746 121.5439 47.34 24.98746 121.5439 54.85 24.97937 121.5425 43.16 24.96305 121.5125 32.1
Next, set the variables. For this, I set house_price
as dependant variable and house_age
as independent variable.
house_price <- data$Y.house.price.of.unit.areahouse_age <- data$X2.house.age
Data Visualization
Plot
plot(house_price ~ house_age, data=data)
It seems that there is no correlation between house_price
and house_age
.
Histogram (House age)
Simple Linear Regression
In order to get more specific result, let’s actually perform simple linear regression.
result <- lm(house_price ~ house_age, data=data)summary(result)summary(result)$r.squared
Result
> summary(result)Call:lm(formula = house_price ~ house_age, data = data)Residuals:Min 1Q Median 3Q Max-31.113 -10.738 1.626 8.199 77.781Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 42.43470 1.21098 35.042 < 2e-16 ***house_age -0.25149 0.05752 -4.372 1.56e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 13.32 on 412 degrees of freedomMultiple R-squared: 0.04434, Adjusted R-squared: 0.04202F-statistic: 19.11 on 1 and 412 DF, p-value: 1.56e-05> summary(result)$r.squared[1] 0.04433848
Analysis
Overall
- value is 0.04433848 (4.43%). This means 4.43% of variation in the values is accounted for by the values.
- p-value of this model is 0.000015605 (1.56e-05) which is lower than 0.05. This means that we can reject the null hypothesis (). Therefore, we can say that this statistical model is effective.
Coefficients
- Estimate of
house_age
has p-value of 0.000015605 which is lower than 0.05. This means that we can reject the null hypothesis (). Therefore, we can say thathouse_age
DOES affecthouse_price
. - Estimate of
house_age
is -0.25149. This means,house_price
will decrease 0.25149 per 1 increase in value ofhouse_age
. In other words, the older the house, the lower the house price becomes by small percentage.