Linear regression can be performed using several approaches, including Ordinary Least Squares (OLS with single variable ), the OLS with multiple variables (closed-form solution ), and gradient descent.
In this post, we discuss Ordinary Least Squares (OLS) for performing simple linear regression. The topics include:
- Simple Linear regression
- Objective of Simple Linear Regression
- Example
- Mathematical Intuition
- Interpretation of Coefficients
- Applications of Simple Linear Regression
- Numerical Example Using Ordinary Least Squares (OLS)
- Model Evaluation
- Steps and Python code for Implementing Simple Linear Regression using OLS
- Limitations of Ordinary Least Squares (OLS)
1. Simple Linear Regression is a type of supervised learning algorithm in machine learning that predicts a continuous outcome based on one input feature. It models the relationship between two variables by fitting a straight line (regression line) through the data points. The idea is to understand how a change in the independent variable (X) leads to a change in the dependent variable (Y).
The model assumes that the relationship between X and Y is linear, meaning it can be described by a straight line. This can be expressed mathematically as:
Where:
- Y is the predicted output (dependent variable).
- X is the input feature (independent variable).
- is the intercept (the value of Y when X = 0).
- is the slope (rate at which Y changes with X).
2. Objective of Simple Linear Regression Example
The objective of simple linear regression is to find the best-fit line by estimating the values of (intercept) and (slope). The "best-fit" line minimizes the error between the actual observed values and the predicted values.
In machine learning, this is achieved by minimizing a loss function, typically the Mean Squared Error (MSE), which calculates the average squared difference between the actual output and the predicted output:
Where:
- is the actual value.
- is the predicted value based on the current model.
- is the number of data points.
3. Example
Imagine you want to predict the test score of students based on how many hours they studied. You collect data about 5 students:
Hours Studied (X) | Test Score (Y) |
---|---|
1 | 50 |
2 | 55 |
3 | 65 |
4 | 70 |
5 | 80 |
In simple linear regression, the key is to determine the line of best fit for the data. This line is determined by two parameters: the slope and the intercept .
- The slope tells us how much Y (output) changes for each unit change in X (input).
- If is positive, it indicates that as X increases, Y also increases.
- If is negative, it indicates that as X increases, Y decreases.
- The intercept represents the value of Y when X = 0.
The task of the machine learning algorithm is to minimize the prediction error by adjusting and . The ordinary least squares (OLS) method, commonly used in linear regression, minimizes the Mean Squared Error (MSE), which calculates the average squared difference between the actual output and the predicted output
Once the model is trained and the best-fit line is determined, the coefficients and provide key insights:
Intercept :
The intercept tells us the predicted value of Y when the input X is zero. For example, if in a study-hours model, it means that a student who doesn't study at all (X = 0) is expected to score 45.Slope :
The slope represents the change in Y for a one-unit increase in X. For example, if it means that for each additional hour of studying, the test score is expected to increase by 7 points.
Predicting Prices:
Predict house prices based on the size of the house (e.g., square footage).Sales Forecasting:
Predict future sales based on past sales or marketing spend.Risk Assessment:
Estimate insurance premiums based on factors like age or driving history.Medical Applications:
Predict health outcomes based on variables like age, blood pressure, or cholesterol levels.Advertising Effectiveness:
Understand how changes in advertising budget impact revenue or brand awareness.
Let’s walk through the Ordinary Least Squares (OLS) method to compute the best-fit line for the data:
Hours Studied (X) | Test Score (Y) |
---|---|
1 | 50 |
2 | 55 |
3 | 65 |
4 | 70 |
5 | 80 |
Step 1: Calculate the Mean of X and Y
First, we calculate the means ( and ) of X and Y.
Step 2: Calculate the Slope
The slope is calculated as:
We calculate each part:
Now, calculate the products:
Sum these values:
Now, calculate :
Finally, calculate the slope:
Step 3: Calculate the Intercept
The intercept is calculated as:
Substitute the known values:
Step 4: Write the Final Equation
The final regression line equation is:
Using the regression equation, we can now predict test scores for any given number of hours studied. For example, if a student studies for 6 hours:
So, the predicted test score for 6 hours of study is 86.5.
8. Model Evaluation
Evaluating a simple linear regression model involves checking how well the model fits the data, how accurate its predictions are, and whether the assumptions of the linear regression model are satisfied. Let's go step-by-step to evaluate the model derived from the above example.
Model Equation:
Evaluation Metrics:
We can use various metrics and tests to evaluate the performance of the model. Some common methods include:
- Residual Analysis (Checking Errors)
- R-squared ( ) (Goodness-of-Fit)
- Mean Squared Error (MSE) (Error Magnitude)
- Adjusted
- Residual Standard Error (RSE)
- Assumptions Validation (Linear Regression Assumptions)
1. Residual Analysis
The residuals are the differences between the actual values and the predicted values . By analyzing the residuals, we can assess whether the model is underfitting or overfitting, and whether the assumptions of linearity and homoscedasticity (constant variance of residuals) are valid.
For each data point:
Hours Studied (X) |
Actual Score (Y) |
Predicted Score Ŷ (41.5 + 7.5X) |
Residual (Y - Ŷ ) |
1 |
50 |
49 |
1 |
2 |
55 |
56.5 |
-1.5 |
3 |
65 |
64 |
1 |
4 |
70 |
71.5 |
-1.5 |
5 |
80 |
79 |
1 |
The residuals are small, and there’s no clear pattern of bias, suggesting the model fits the data well.
Residual Interpretation:2. R-squared ( )
The R-squared value indicates the proportion of the variance in the dependent variable (Y) that is predictable from the independent variable (X). It ranges between 0 and 1:
- indicates a perfect fit.
- indicates the model explains none of the variability of the data.
Where:
- is the actual value.
- is the predicted value.
- is the mean of the actual values.
Let's calculate :
- Total Sum of Squares (TSS): Measures the total variability in the dataset.
- Residual Sum of Squares (RSS): Measures the unexplained variability (errors) from the model.
- R-squared:
Interpretation:
means that 98.68% of the variance in test scores is explained by the hours studied. This indicates an excellent fit.
3. Mean Squared Error (MSE)
The Mean Squared Error (MSE) quantifies the average squared difference between the actual and predicted values. The formula is:
We already calculated the residuals squared sum as , so the MSE is:
Interpretation:
The MSE of 1.5 suggests that, on average, the squared prediction error is 1.5 points. In real-world terms, this means the predictions are generally quite close to the actual values.
4. Adjusted
Adjusted is a modified version of the metric that accounts for the number of predictors in the model. Unlike which can increase simply by adding more predictors (regardless of their relevance), Adjusted only increases if the new predictor improves the model fit. It penalizes the model for adding predictors that do not meaningfully contribute to explaining the variation in the dependent variable.
The formula for Adjusted is:
where:
- is the regular coefficient of determination,
- is the number of observations,
- is the number of predictors.
Adjusted provides a more reliable measure of model performance when comparing models with different numbers of predictors. A higher Adjusted value suggests that the model explains a larger proportion of the variance in the dependent variable, taking into account model complexity.
R2≈0.9868
Calculate Adjusted
Interpretation:
For example, an Adjusted of 0.98 means that about 98% of the variability in the dependent variable (such as exam scores) is explained by the model, even after considering model complexity.
5. Residual Standard Error (RSE)
Residual Standard Error (RSE) is a measure of the average amount by which the predicted values (from a regression model) differ from the actual observed values. It indicates the spread of the residuals, or errors, and provides insight into how well the regression model fits the data.
Mathematically, RSE is calculated as:
where:
- is the actual observed value,
- is the predicted value,
- is the number of observations
A lower RSE indicates that the model has a better fit, with residuals (differences between observed and predicted values) closer to zero. RSE is expressed in the same units as the dependent variable, making it easier to interpret in context.
The residuals are:
.
Step 1.1: Calculate the Sum of Squared Residuals
Step 1.2: Calculate RSE
Interpretation:
For example, if your dependent variable represents exam scores and your RSE is 1.58, it means that, on average, the predictions are about 1.58 points off from the actual scores.
6. Assumptions Validation (Linear Regression Assumptions)
To properly evaluate the model, it is essential to check the key assumptions underlying linear regression:
- Linearity
- Independence
- Homoscedasticity
- Normality of Residual
Definition: The relationship between the independent variable (hours studied) and the dependent variable Y (test score) should be linear.
Numerical Example: Let's say we have the following data points:
Hours Studied (X) | Test Score (Y) |
---|---|
1 | 50 |
2 | 55 |
3 | 65 |
4 | 70 |
5 | 80 |
Scatter Plot
If we were to create a scatter plot of this data, we would plot the hours studied on the x-axis and the test scores on the y-axis.
- The points would roughly form a straight line, suggesting that as study hours increase, test scores also increase linearly.
Residuals Calculation
To confirm linearity with residuals, we calculate the predicted scores using the model:
Now, calculate residuals (actual - predicted):
Hours Studied (X) |
Actual Score (Y) |
Predicted Score Ŷ |
Residual ( Y - Ŷ) |
1 |
50 |
49 |
1 |
2 |
55 |
56.5 |
-1.5 |
3 |
65 |
64 |
1 |
4 |
70 |
71.5 |
-1.5 |
5 |
80 |
79 |
1 |
Independence
Definition: The residuals should not display any pattern when plotted against the predicted values or any other variable.
Residuals Check
Using the residuals calculated above:
Residual (Y - Ŷ ) |
1 |
-1.5 |
1 |
-1.5 |
1 |
- The plot should show a random scatter without any clear pattern.
Interpretation: If the residuals appear randomly scattered, we can assume that they are independent of each other.
Definition: The variance of residuals should be constant across all levels of X.
Residuals Calculation
Using the same residuals:
Hours Studied (X) |
Residual ( Y - Ŷ) |
1 |
1 |
2 |
-1.5 |
3 |
1 |
4 |
-1.5 |
5 |
1 |
- Absolute Residuals: which are
Interpretation: The residuals fluctuate between and , indicating that their variance does not change significantly with the value of . This supports the assumption of homoscedasticity.
Normality of Residuals
Definition: The residuals should be approximately normally distributed.
Residuals Check
Using the same residuals from above:
Hours Studied (X) |
Residual ( Y - Ŷ |
1 |
1 |
2 |
-1.5 |
3 |
1 |
4 |
-1.5 |
5 |
1 |
Summary Statistics of Residuals:
Mean of Residuals
The mean is calculated by adding up all the residuals and dividing by the number of residuals.
Variance of Residuals
The variance is the average of the squared deviations from the mean. Since the mean is , we just square each residual and find the average.
Standard Deviation of Residuals
The standard deviation is the square root of the variance.
To check the normality of residuals, you can use either a Q-Q (Quantile-Quantile) plot or a histogram.
Q-Q Plot
A Q-Q plot compares the quantiles of your residuals with the quantiles of a standard normal distribution. If the residuals are normally distributed, they should fall approximately along the 45-degree reference line. Deviations from this line suggest non-normality.
- Interpretation: Points that deviate from the line, especially in the tails, indicate potential departures from normality.
Histogram
A histogram provides a simple visualization of the distribution of residuals. By overlaying a normal distribution curve, you can visually inspect if the residuals have a bell-shaped, symmetric distribution around zero, which indicates normality.
- Interpretation: A symmetric, bell-shaped histogram centered around zero suggests normality. Skewness or kurtosis in the distribution indicates departures from normality.
9. Steps and Python code for Implementing Simple Linear Regression using OLS
data = {
'Hours_Studied': [1, 2, 3, 4, 5],
'Test_Score': [50, 55, 65, 70, 80]
}
df = pd.DataFrame(data)
Output: This creates a DataFrame df
that looks like this:
Hours_Studied Test_Score
df
is created containing two columns: Hours_Studied
and Test_Score
. This dataset can be used to analyze the relationship between hours studied and corresponding test scores.Hours_Studied
and Test_Score
Inference: The first histogram shows that the hours studied are evenly distributed between 1 and 5. The second histogram indicates that test scores are normally distributed, peaking towards the higher scores, which suggests that more hours of study correlate with better scores.
Step 2: EDA - Scatter plot to visualize the relationship
plt.figure(figsize=(8, 6))sns.scatterplot(x='Hours_Studied', y='Test_Score', data=df, color='purple', s=100)plt.title('Scatter Plot: Hours Studied vs Test Score')plt.xlabel('Hours Studied')plt.ylabel('Test Score')plt.show()
Output: A scatter plot showing points for each student.
Inference: The scatter plot illustrates a clear positive linear relationship between Hours_Studied
and Test_Score
, indicating that as the number of hours studied increases, test scores tend to increase as well.
Step 2: EDA - Correlation analysis
correlation = df.corr()print(f"Correlation between Hours Studied and Test Score: {correlation.loc['Hours_Studied', 'Test_Score']}")
Output:
Correlation between Hours Studied and Test Score: 0.993399267798783
Inference: The correlation coefficient of approximately 0.986 indicates a very strong positive relationship between hours studied and test scores, confirming the trend observed in the scatter plot.
Step 3: Prepare the data for OLS
X = df['Hours_Studied']Y = df['Test_Score']X_with_const = sm.add_constant(X)
Output:
const Hours_Studied0 1.0 11 1.0 22 1.0 33 1.0 44 1.0 5
Inference: The data is prepared for Ordinary Least Squares (OLS) regression by separating the independent variable (X
) from the dependent variable (Y
) and adding a constant to X
to account for the intercept in the model.
Step 4: Fit the OLS model
model = sm.OLS(Y, X_with_const).fit()
Inference: The OLS regression model is fitted to the data, establishing the relationship between hours studied and test scores.
Step 5: Get the summary of the regression
print(model.summary())
Hours_Studied
is 7, which means for every additional hour studied, the test score increases by an average of 7 points. The low p-values (less than 0.05) indicate that the coefficients are statistically significant.Output: No immediate output, but predicted_scores
contains predicted values.
Inference: The model predicts test scores for each student based on the number of hours studied, allowing us to evaluate model performance.
Step 8: Calculate residuals (actual - predicted values).
We calculate the residuals (differences between actual and predicted scores).
residuals = Y - predicted_scores
Inference: Analyzing residuals helps assess the model's accuracy and identify any systematic errors.
Step 9: Evaluation Metrics
mse = mean_squared_error(Y, predicted_scores)
rmse = np.sqrt(mse)
r_squared = model.rsquared
adjusted_r_squared = model.rsquared_adj
rse = np.sqrt(sum(residuals**2) / (len(Y) - 2))
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared: {r_squared}")
print(f"Adjusted R-squared: {adjusted_r_squared}")
print(f"Residual Standard Error (RSE): {rse}")
Output
Mean Squared Error (MSE): 6.25
Root Mean Squared Error (RMSE): 2.5
R-squared: 0.973
Adjusted R-squared: 0.965
Residual Standard Error (RSE): 2.5
Inference: The plot should show residuals randomly distributed around zero, suggesting that the linearity assumption is met.
Step 13: Residuals vs Hours Studied (Homoscedasticity Check)
Inference: The residuals should not display a pattern or funnel shape, indicating homoscedasticity (constant variance of residuals). A pattern might suggest issues with the model.
Inference: If the points fall along the reference line, it indicates that the residuals are normally distributed. Deviations from this line suggest non-normality, which could affect hypothesis testing in the regression model.
While Ordinary Least Squares (OLS) is a popular and widely used method for linear regression, it has several limitations that make it less suitable for certain types of data or scenarios. Here are some of the key limitations:
- Limitation: OLS assumes a linear relationship between the independent and dependent variables. However, many real-world relationships are not strictly linear.
- Impact: If the true relationship is non-linear, OLS will provide a poor fit and inaccurate predictions.
2. Sensitivity to Outliers
- Limitation: OLS is sensitive to outliers, as it minimizes the sum of squared residuals, which disproportionately weights larger errors.
- Impact: A few outliers can heavily influence the regression line, leading to misleading results.
3. Assumption of Homoscedasticity
- Limitation: OLS assumes that the variance of the errors (residuals) is constant across all levels of the independent variable (homoscedasticity).
- Impact: If the errors exhibit heteroscedasticity (i.e., variance changes with different values of the independent variable), the standard errors of the coefficients may be incorrect, leading to unreliable hypothesis tests and confidence intervals.
4. Normality of Errors
- Limitation: OLS assumes that the residuals are normally distributed, which is required for making valid inferences about the coefficients.
- Impact: When the residuals deviate from normality, statistical tests (like t-tests for coefficients) may become invalid, affecting the reliability of the model.
In the next post, we will dive into multiple linear regression using the closed-form solution. This method is a powerful extension of linear regression that allows us to model the relationship between a dependent variable and multiple independent variables.
The closed-form solution provides a direct way to compute the coefficients for all variables by solving the normal equations, making it an efficient approach for fitting linear models.
Stay tuned as we break down the mathematics behind the closed-form solution and explore its application in real-world scenarios