PolyMath
Regression Analysis Simple Linear Model

Simple Linear Regression

15 min read MATH-201

Linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables).

The Mathematical Model

In simple linear regression, we predict a quantitative response $Y$ on the basis of a single predictor variable $X$. It assumes that there is approximately a linear relationship between $X$ and $Y$. Mathematically, we can write this relationship as:

$$ Y \approx \beta_0 + \beta_1 X + \epsilon $$

Here, $\beta_0$ represents the intercept, and $\beta_1$ represents the slope in the linear model. The term $\epsilon$ is the error term, representing the deviation of the observation from the true linear relationship.

Definition 1.2

Residuals ($e_i$): The difference between the observed value of the dependent variable ($y_i$) and the predicted value ($\hat{y}_i$) is called the residual. $$ e_i = y_i - \hat{y}_i $$

Least Squares Method

The goal is to find estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the Residual Sum of Squares (RSS).

Theorem: OLS Estimators

The least squares estimates for the regression coefficients are given by: $$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} $$ $$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} $$

View Derivation (Calculus)

We minimize the Loss Function $L$:

$$ L(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 $$

Take partial derivatives with respect to $\beta_0$ and $\beta_1$ and set to zero:

$$ \frac{\partial L}{\partial \beta_0} = -2 \sum (y_i - \beta_0 - \beta_1 x_i) = 0 $$ $$ \frac{\partial L}{\partial \beta_1} = -2 \sum x_i (y_i - \beta_0 - \beta_1 x_i) = 0 $$

Solving this system yields the OLS estimators defined above.

Python Implementation

We can calculate these coefficients manually using NumPy, or use `scikit-learn`. Here is the manual calculation to demonstrate the math above.

ols_calculation.py
import numpy as np

# 1. Sample Data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 3.5, 2.8, 4.6, 5.5])

# 2. Calculate Means
x_bar = np.mean(X)
y_bar = np.mean(Y)

# 3. Calculate Beta_1 (Slope)
numerator = np.sum((X - x_bar) * (Y - y_bar))
denominator = np.sum((X - x_bar)**2)

beta_1 = numerator / denominator

# 4. Calculate Beta_0 (Intercept)
beta_0 = y_bar - (beta_1 * x_bar)

print(f"y = {beta_0:.2f} + {beta_1:.2f}x")
# Output: y = 0.96 + 0.92x