Simple Linear Regression
Linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables).
The Mathematical Model
In simple linear regression, we predict a quantitative response $$Y$$ on the basis of a single predictor variable $$X$$. It assumes that there is approximately a linear relationship between $$X$$ and $$Y$$. Mathematically, we can write this relationship as:
Here, $$\beta_0$$ represents the intercept, and $$\beta_1$$ represents the slope in the linear model. The term $$\epsilon$$ is the error term, representing the deviation of the observation from the true linear relationship.
Definition 1.2
Residuals ($$e_i$$): The difference between the observed value of the dependent variable ($$y_i$$) and the predicted value ($$\hat{y}_i$$) is called the residual. $$ e_i = y_i - \hat{y}_i $$
Least Squares Method
The goal is to find estimates $$\hat{\beta}_0$$ and $$\hat{\beta}_1$$ that minimize the Residual Sum of Squares (RSS).
Theorem: OLS Estimators
The least squares estimates for the regression coefficients are given by: $$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} $$ $$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} $$
▶ View Derivation (Calculus)
We minimize the Loss Function $$L$$:
$$ L(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 $$Take partial derivatives with respect to $$\beta_0$$ and $$\beta_1$$ and set to zero:
$$ \frac{\partial L}{\partial \beta_0} = -2 \sum (y_i - \beta_0 - \beta_1 x_i) = 0 $$ $$ \frac{\partial L}{\partial \beta_1} = -2 \sum x_i (y_i - \beta_0 - \beta_1 x_i) = 0 $$Solving this system yields the OLS estimators defined above.
Python Implementation
We can calculate these coefficients manually using NumPy, or use `scikit-learn`. Here is the manual calculation to demonstrate the math above.
import numpy as np
# 1. Sample Data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 3.5, 2.8, 4.6, 5.5])
# 2. Calculate Means
x_bar = np.mean(X)
y_bar = np.mean(Y)
# 3. Calculate Beta_1 (Slope)
# numerator: sum((x - x_bar)(y - y_bar))
numerator = np.sum((X - x_bar) * (Y - y_bar))
# denominator: sum((x - x_bar)^2)
denominator = np.sum((X - x_bar)**2)
beta_1 = numerator / denominator
# 4. Calculate Beta_0 (Intercept)
beta_0 = y_bar - (beta_1 * x_bar)
print(f"y = {beta_0:.2f} + {beta_1:.2f}x")
# Output: y = 0.96 + 0.92x