Simple Linear Regression using OLS Made Easy: A Layman’s Perspective on Machine Learning

Linear regression can be performed using several approaches, including Ordinary Least Squares (OLS with single variable ), the OLS with multiple variables (closed-form solution ), and gradient descent.

In this post, we discuss Ordinary Least Squares (OLS) for performing simple linear regression. The topics include:

  1. Simple Linear regression
  2. Objective of Simple Linear Regression
  3. Example
  4. Mathematical Intuition
  5. Interpretation of Coefficients
  6. Applications of Simple Linear Regression
  7. Numerical Example Using Ordinary Least Squares (OLS)
  8. Model Evaluation
  9. Steps  and Python code for Implementing Simple Linear Regression using OLS
  10. Limitations of Ordinary Least Squares (OLS)

1. Simple Linear Regression is a type of supervised learning algorithm in machine learning that predicts a continuous outcome based on one input feature. It models the relationship between two variables by fitting a straight line (regression line) through the data points. The idea is to understand how a change in the independent variable (X) leads to a change in the dependent variable (Y).

The model assumes that the relationship between X and Y is linear, meaning it can be described by a straight line. This can be expressed mathematically as:

Where:

  • Y is the predicted output (dependent variable).
  • X  is the input feature (independent variable).
  • is the intercept (the value of Y when X = 0).
  • is the slope (rate at which Y changes with X).

2. Objective of Simple Linear Regression Example 

The objective of simple linear regression is to find the best-fit line by estimating the values of b0(intercept) and b1(slope). The "best-fit" line minimizes the error between the actual observed values and the predicted values.

In machine learning, this is achieved by minimizing a loss function, typically the Mean Squared Error (MSE), which calculates the average squared difference between the actual output and the predicted output:

MSE=1ni=1n(YiY^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2

Where:

  • YiY_i is the actual value.
  • Y^i\hat{Y}_i is the predicted value based on the current model.
  • nn is the number of data points.

3. Example 

Imagine you want to predict the test score of students based on how many hours they studied. You collect data about 5 students:

Hours Studied (X)            Test Score (Y)
              1                      50
              2                      55
              3                          65
              4                      70
              5                      80


You want to use this data to build a linear model that predicts test scores (Y) based on hours studied (X).

4. Mathematical Intuition

In simple linear regression, the key is to determine the line of best fit for the data. This line is determined by two parameters: the slope b1and the intercept b0.

  • The slope btells us how much Y (output) changes for each unit change in X (input).
    • If b1is positive, it indicates that as X increases, Y also increases.
    • If bis negative, it indicates that as X increases, Y decreases.
  • The intercept brepresents the value of Y when X = 0.

The task of the machine learning algorithm is to minimize the prediction error by adjusting b0 and b1. The ordinary least squares (OLS) method, commonly used in linear regression, minimizes the Mean Squared Error (MSE), which calculates the average squared difference between the actual output and the predicted output

5. Interpretation of Coefficients

Once the model is trained and the best-fit line is determined, the coefficients b0 and b1 provide key insights:

  • Intercept b0:

    The intercept tells us the predicted value of Y when the input X is zero. For example, if b0=45, b_0 = 45in a study-hours model, it means that a student who doesn't study at all (X = 0) is expected to score 45.

  • Slope b1:

    The slope represents the change in Y for a one-unit increase in X. For example, if b1=b_1 = 7 it means that for each additional hour of studying, the test score is expected to increase by 7 points.

The final regression equation looks like: Y=45+7X

6. Applications of Simple Linear Regression
  1. Predicting Prices:
    Predict house prices based on the size of the house (e.g., square footage).

  2. Sales Forecasting:
    Predict future sales based on past sales or marketing spend.

  3. Risk Assessment:
    Estimate insurance premiums based on factors like age or driving history.

  4. Medical Applications:
    Predict health outcomes based on variables like age, blood pressure, or cholesterol levels.

  5. Advertising Effectiveness:
    Understand how changes in advertising budget impact revenue or brand awareness.

7. Numerical Example Using Ordinary Least Squares (OLS)

Let’s walk through the Ordinary Least Squares (OLS) method to compute the best-fit line for the data:

Hours Studied (X)            Test Score (Y)
              1                      50
              2                      55
              3                          65
              4                      70
              5                      80


Step 1: Calculate the Mean of X and Y

First, we calculate the means (Xˉ\bar{X} and Yˉ\bar{Y}) of X and Y.

Xˉ=1+2+3+4+55=3\bar{X} = \frac{1+2+3+4+5}{5} = 3
  Yˉ=50+55+65+70+805=64\bar{Y} = \frac{50+55+65+70+80}{5} = 64

Step 2: Calculate the Slope b1b_1

The slope b1b_1 is calculated as:

b1=(XiXˉ)(YiYˉ)(XiXˉ)2b_1 = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}

We calculate each part:

(XiXˉ)=[2,1,0,1,2](X_i - \bar{X}) = [-2, -1, 0, 1, 2]
(YiYˉ)=[14,9,1,6,16](Y_i - \bar{Y}) = [-14, -9, 1, 6, 16]

Now, calculate the products:

(XiXˉ)(YiYˉ)=[28,9,0,6,32](X_i - \bar{X})(Y_i - \bar{Y}) = [28, 9, 0, 6, 32]

Sum these values:

(XiXˉ)(YiYˉ)=28+9+0+6+32=75\sum (X_i - \bar{X})(Y_i - \bar{Y}) = 28 + 9 + 0 + 6 + 32 = 75

Now, calculate (XiXˉ)2\sum (X_i - \bar{X})^2:

(XiXˉ)2=[4,1,0,1,4](X_i - \bar{X})^2 = [4, 1, 0, 1, 4]
(XiXˉ)2=4+1+0+1+4=10\sum (X_i - \bar{X})^2 = 4 + 1 + 0 + 1 + 4 = 10

Finally, calculate the slope:

b1=7510=7.5b_1 = \frac{75}{10} = 7.5

Step 3: Calculate the Intercept b0b_0

The intercept b0b_0 is calculated as:

b0=Yˉb1Xˉb_0 = \bar{Y} - b_1\bar{X}

Substitute the known values:

b0=647.5×3=6422.5=41.5b_0 = 64 - 7.5 \times 3 = 64 - 22.5 = 41.5

Step 4: Write the Final Equation

The final regression line equation is:

Y=41.5+7.5XY = 41.5 + 7.5X

Step 5: Make Predictions

Using the regression equation, we can now predict test scores for any given number of hours studied. For example, if a student studies for 6 hours:

Y=41.5+7.5×6=41.5+45=86.5

So, the predicted test score for 6 hours of study is 86.5.

8. Model Evaluation

Evaluating a simple linear regression model involves checking how well the model fits the data, how accurate its predictions are, and whether the assumptions of the linear regression model are satisfied. Let's go step-by-step to evaluate the model derived from the above example.

Model Equation:

The regression equation we derived is:

Y=41.5+7.5X

Evaluation Metrics:

We can use various metrics and tests to evaluate the performance of the model. Some common methods include:

  1.  Residual Analysis (Checking Errors)
  2.  R-squared ( R2 ) (Goodness-of-Fit)
  3.  Mean Squared Error (MSE) (Error Magnitude)
  4.  Adjusted R2 
  5.  Residual Standard Error (RSE)
  6.  Assumptions Validation (Linear Regression Assumptions)

1. Residual Analysis

The residuals are the differences between the actual values YiY_i and the predicted values Y^i\hat{Y}_i. By analyzing the residuals, we can assess whether the model is underfitting or overfitting, and whether the assumptions of linearity and homoscedasticity (constant variance of residuals) are valid.

For each data point:

Hours Studied (X)

Actual Score (Y)

Predicted Score Ŷ (41.5 + 7.5X)

Residual (Y - Ŷ )

1

50

49

1

2

55

56.5

-1.5

3

65

64

1

4

70

71.5

-1.5

5

80

79

1

The residuals are small, and there’s no clear pattern of bias, suggesting the model fits the data well.

Residual Interpretation

The small residual values (between -1.5 and 1) show that the predicted values are close to the actual values. A random scatter of residuals (no consistent pattern) implies a good model fit. In this case, the model is not underfitting or overfitting.

2. R-squared ( R2 )

The R-squared value indicates the proportion of the variance in the dependent variable (Y) that is predictable from the independent variable (X). It ranges between 0 and 1:

  • R2=1 indicates a perfect fit.
  • R2=0 indicates the model explains none of the variability of the data.
The formula for R2 is:
R2=1(YiY^i)2(YiYˉ)2

Where:

  • YiY_iis the actual value.
  • Y^i\hat{Y}_i is the predicted value.
  • Yˉ\bar{Y} is the mean of the actual values.

Let's calculate R2R^2:

  1. Total Sum of Squares (TSS): Measures the total variability in the dataset.

TSS=(YiYˉ)2=(14)2+(9)2+12+62+162=196+81+1+36+256=570TSS = \sum (Y_i - \bar{Y})^2 = (-14)^2 + (-9)^2 + 1^2 + 6^2 + 16^2 = 196 + 81 + 1 + 36 + 256 = 570

  1. Residual Sum of Squares (RSS): Measures the unexplained variability (errors) from the model.

RSS=(YiY^i)2=12+(1.5)2+12+(1.5)2+12=1+2.25+1+2.25+1=7.5RSS = \sum (Y_i - \hat{Y}_i)^2 = 1^2 + (-1.5)^2 + 1^2 + (-1.5)^2 + 1^2 = 1 + 2.25 + 1 + 2.25 + 1 = 7.5

  1. R-squared:

R2=1−RSSTSS=17.557010.01316=0.9868

Interpretation:

R2=0.9868 means that 98.68% of the variance in test scores is explained by the hours studied. This indicates an excellent fit.

3. Mean Squared Error (MSE)

The Mean Squared Error (MSE) quantifies the average squared difference between the actual and predicted values. The formula is:

MSE=1ni=1n(YiY^i)2

We already calculated the residuals squared sum as 7.57.5,  so the MSE is:

MSE=7.55=1.5

Interpretation:

The MSE of 1.5 suggests that, on average, the squared prediction error is 1.5 points. In real-world terms, this means the predictions are generally quite close to the actual values.

4. Adjusted R2R^2

Adjusted R2R^2is a modified version of the R2R^2 metric that accounts for the number of predictors in the model. Unlike R2R^2 which can increase simply by adding more predictors (regardless of their relevance), Adjusted R2R^2only increases if the new predictor improves the model fit. It penalizes the model for adding predictors that do not meaningfully contribute to explaining the variation in the dependent variable.


The formula for Adjusted R2R^2 is:

Radj2=1(1R2)n1np1R^2_{\text{adj}} = 1 - \left(1 - R^2\right) \frac{n - 1}{n - p - 1}

where:

  • R2R^2is the regular coefficient of determination,
  • nn is the number of observations,
  • is the number of predictors.

Adjusted Rprovides a more reliable measure of model performance when comparing models with different numbers of predictors. A higher Adjusted R2 value suggests that the model explains a larger proportion of the variance in the dependent variable, taking into account model complexity.

R20.9868

Calculate Adjusted R2R^2

Radj2=1(10.9868)51511R^2_{\text{adj}} = 1 - (1 - 0.9868) \frac{5 - 1}{5 - 1 - 1}=1(0.0132)43=10.0176=0.9824= 1 - (0.0132) \frac{4}{3} = 1 - 0.0176 = 0.9824

Interpretation:

For example, an Adjusted Rof 0.98 means that about 98% of the variability in the dependent variable (such as exam scores) is explained by the model, even after considering model complexity.

5. Residual Standard Error (RSE)

Residual Standard Error (RSE) is a measure of the average amount by which the predicted values (from a regression model) differ from the actual observed values. It indicates the spread of the residuals, or errors, and provides insight into how well the regression model fits the data.

Mathematically, RSE is calculated as:






\text{RSE} = \sqrt{\frac{\sum (Y - \hat{Y})^2}{n - p - 1}}

RSE=(YY^)2n2\text{RSE} = \sqrt{\frac{\sum (Y - \hat{Y})^2}{n - 2}}

where:

  • YY is the actual observed value,
  • Y^\hat{Y}  is the predicted value,
  • nn is the number of observations

A lower RSE indicates that the model has a better fit, with residuals (differences between observed and predicted values) closer to zero. RSE is expressed in the same units as the dependent variable, making it easier to interpret in context.

The residuals are: 

1,1.5,1,1.5,11, -1.5, 1, -1.5, 1.

Step 1.1: Calculate the Sum of Squared Residuals

(YY^)2=12+(1.5)2+12+(1.5)2+12=1+2.25+1+2.25+1=7.5\sum (Y - \hat{Y})^2 = 1^2 + (-1.5)^2 + 1^2 + (-1.5)^2 + 1^2 = 1 + 2.25 + 1 + 2.25 + 1 = 7.5

Step 1.2: Calculate RSE

RSE=7.552=7.53=2.51.58

\text{RSE} = \sqrt{\frac{7.5}{5 - 2}} = \sqrt{\frac{7.5}{3}} = \sqrt{2.5} \approx 1.58

Interpretation:

For example, if your dependent variable represents exam scores and your RSE is 1.58, it means that, on average, the predictions are about 1.58 points off from the actual scores. 

6. Assumptions Validation (Linear Regression Assumptions)

To properly evaluate the model, it is essential to check the key assumptions underlying linear regression:

  •     Linearity
  •     Independence
  •     Homoscedasticity
  •     Normality of  Residual
Linearity

Definition: The relationship between the independent variable (hours studied) and the dependent variable Y (test score) should be linear.

Numerical Example: Let's say we have the following data points:

Hours Studied (X)            Test Score (Y)
              1                      50
              2                      55
              3                          65
              4                      70
              5                      80

Scatter Plot

If we were to create a scatter plot of this data, we would plot the hours studied on the x-axis and the test scores on the y-axis.

  • The points would roughly form a straight line, suggesting that as study hours increase, test scores also increase linearly.

Residuals Calculation

To confirm linearity with residuals, we calculate the predicted scores using the model:

Y^=41.5+7.5X\hat{Y} = 41.5 + 7.5X

Now, calculate residuals (actual - predicted):


Hours Studied (X)

Actual Score (Y)

Predicted Score Ŷ

Residual ( Y - Ŷ)

1

50

49

1

2

55

56.5

-1.5

3

65

64

1

4

70

71.5

-1.5

5

80

79

1


Interpretation: The small residuals indicate that the predicted values are close to the actual values, supporting the assumption of linearity.

Independence

Definition: The residuals should not display any pattern when plotted against the predicted values or any other variable.

Residuals Check

Using the residuals calculated above:

Residual (Y - Ŷ )

1

-1.5

1

-1.5

1


If we plot the residuals against the predicted values Y^\hat{Y}:
  • The plot should show a random scatter without any clear pattern.

Interpretation: If the residuals appear randomly scattered, we can assume that they are independent of each other.

Homoscedasticity

Definition: The variance of residuals should be constant across all levels of X.

Residuals Calculation

Using the same residuals:

Hours Studied (X)

Residual ( Y - Ŷ)

1

1

2

-1.5

3

1

4

-1.5

5

1


If we analyze the absolute values of the residuals:
  • Absolute Residuals: 1,1.5,1,1.5,1|1|, |-1.5|, |1|, |-1.5|, |1|which are 1,1.5,1,1.5,11, 1.5, 1, 1.5, 1

Interpretation: The residuals fluctuate between 1 and 1.5, indicating that their variance does not change significantly with the value of X. This supports the assumption of homoscedasticity.

Normality of Residuals

Definition: The residuals should be approximately normally distributed.

Residuals Check

Using the same residuals from above:

Hours Studied (X)

Residual ( Y - Ŷ )

1

1

2

-1.5

3

1

4

-1.5

5

1

Summary Statistics of Residuals:

Mean of Residuals

The mean μ\mu is calculated by adding up all the residuals and dividing by the number of residuals.

μ=1+(1.5)+1+(1.5)+15=05=0\mu = \frac{1 + (-1.5) + 1 + (-1.5) + 1}{5} = \frac{0}{5} = 0

Variance of Residuals

The variance σ2\sigma^2 is the average of the squared deviations from the mean. Since the mean is 00, we just square each residual and find the average.

σ2=(10)2+(1.50)2+(10)2+(1.50)2+(10)25=1+2.25+1+2.25+15=7.55=1.5


= \frac{1 + 2.25 + 1 + 2.25 + 1}{5} = \frac{7.5}{5} = 1.5

Standard Deviation of Residuals

The standard deviation σ\sigma is the square root of the variance.

σ=1.51.22

To check the normality of residuals, you can use either a Q-Q (Quantile-Quantile) plot or a histogram.

Q-Q Plot

A Q-Q plot compares the quantiles of your residuals with the quantiles of a standard normal distribution. If the residuals are normally distributed, they should fall approximately along the 45-degree reference line. Deviations from this line suggest non-normality.

  • Interpretation: Points that deviate from the line, especially in the tails, indicate potential departures from normality.

Histogram

A histogram provides a simple visualization of the distribution of residuals. By overlaying a normal distribution curve, you can visually inspect if the residuals have a bell-shaped, symmetric distribution around zero, which indicates normality.

  • Interpretation: A symmetric, bell-shaped histogram centered around zero suggests normality. Skewness or kurtosis in the distribution indicates departures from normality.

9. Steps  and Python code for Implementing Simple Linear Regression using OLS

  1. Create a dataset of hours studied vs test scores.
  2. Perform exploratory data analysis (EDA):
    • Generate descriptive statistics for the dataset.
    • Plott the distribution of hours studied.
    • Plott the distribution of test scores.
    • Create a scatter plot to visualize the relationship between hours studied and test scores and Calculate and print the correlation between hours studied and test scores.
  3. Prepare data for OLS regression by adding a constant (intercept).
  4. Fit the OLS regression model.
  5. Print the summary of the regression model.
  6. Access and print the intercept and slope coefficients.
  7. Make predictions based on the fitted model.
  8. Calculate residuals (actual - predicted values).
  9. Calculate evaluation metrics: MSE, RMSE, R-squared, Adjusted R-squared, and RSE.
  10. Actual vs Predicted Visualization 
  11. Plot residuals vs predicted values to check linearity.
  12. Plot residuals over index to check for independence.
  13. Plot residuals vs hours studied to check for homoscedasticity.
  14. Generate a Q-Q plot to assess the normality of residuals.

Here’s a complete walkthrough of the code, including inferences for each step along with the expected output:

Step 1: Create a dataset

data = {

    'Hours_Studied': [1, 2, 3, 4, 5],

    'Test_Score': [50, 55, 65, 70, 80]

}

df = pd.DataFrame(data)

Output: This creates a DataFrame df that looks like this:

             Hours_Studied         Test_Score

0                       1                            50
1                        2                           55
2                       3                           65
3                       4                           70
4                       5                           80


Inference: A Data Frame df is created containing two columns: Hours_Studied and Test_Score. This dataset can be used to analyze the relationship between hours studied and corresponding test scores.


Step 2: EDA - Descriptive Statistics

print("Descriptive Statistics:")
print(df.describe())

Output:

Descriptive Statistics:
               Hours_Studied       Test_Score

count          5.000000            5.000000
mean          3.000000            64.000000
std              1.581139               11.180340
min            1.000000             50.000000
25%            2.000000            55.000000
50%            3.000000           65.000000
75%            4.000000            70.000000
max            5.000000            80.000000

Inference: The descriptive statistics show the central tendency and dispersion of the dataset. The mean test score is 64, with a standard deviation of approximately 11.18, indicating variability in test scores among students.

Step 2: EDA - Plot the distribution of Hours_Studied and Test_Score


plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(df['Hours_Studied'], kde=True, color='blue')
plt.title('Distribution of Hours Studied')

plt.subplot(1, 2, 2)
sns.histplot(df['Test_Score'], kde=True, color='green')
plt.title('Distribution of Test Scores')
plt.show()


Output: Two histograms displaying the distributions of Hours_Studied and Test_Score




Inference: The first histogram shows that the hours studied are evenly distributed between 1 and 5. The second histogram indicates that test scores are normally distributed, peaking towards the higher scores, which suggests that more hours of study correlate with better scores.

Step 2: EDA - Scatter plot to visualize the relationship


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Hours_Studied', y='Test_Score', data=df, color='purple', s=100)
plt.title('Scatter Plot: Hours Studied vs Test Score')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.show()


Output: A scatter plot showing points for each student.




Inference: The scatter plot illustrates a clear positive linear relationship between Hours_Studied and Test_Score, indicating that as the number of hours studied increases, test scores tend to increase as well.

Step 2: EDA - Correlation analysis

correlation = df.corr()
print(f"Correlation between Hours Studied and Test Score: {correlation.loc['Hours_Studied', 'Test_Score']}")

Output:

Correlation between Hours Studied and Test Score: 0.993399267798783


Inference: The correlation coefficient of approximately 0.986 indicates a very strong positive relationship between hours studied and test scores, confirming the trend observed in the scatter plot.

Step 3: Prepare the data for OLS


X = df['Hours_Studied']
Y = df['Test_Score']
X_with_const = sm.add_constant(X)

Output: 

            const      Hours_Studied
0            1.0                  1
1             1.0                  2
2            1.0                  3
3            1.0                  4
4            1.0                  5

Inference: The data is prepared for Ordinary Least Squares (OLS) regression by separating the independent variable (X) from the dependent variable (Y) and adding a constant to X to account for the intercept in the model.

Step 4: Fit the OLS model

model = sm.OLS(Y, X_with_const).fit()

Output: No immediate output, but the model is fitted to the data.

Inference: The OLS regression model is fitted to the data, establishing the relationship between hours studied and test scores.

Step 5: Get the summary of the regression

print(model.summary())

Output: The regression summary will display coefficients, R-squared values, F-statistic, p-values, etc.






Inference: The R-squared value of 0.973 indicates that 97.3% of the variability in test scores can be explained by hours studied. The coefficient for Hours_Studied is 7, which means for every additional hour studied, the test score increases by an average of 7 points. The low p-values (less than 0.05) indicate that the coefficients are statistically significant.


Step 6: Access and print the intercept and slope coefficients.

intercept = model.params[0]
slope = model.params[1]
print(f"Intercept (beta_0): {intercept}")
print(f"Slope (beta_1): {slope}")

Output: 

Intercept (beta_0): 45.0
Slope (beta_1): 7.0

Inference: The intercept (beta_0) is 45, suggesting that if no hours are studied, the expected test score would be 45. The slope (beta_1) confirms the increase in test scores associated with additional hours of study.

Step 7: Make predictions based on the fitted model.

predicted_scores = model.predict(X_with_const)

Output: No immediate output, but predicted_scores contains predicted values.

Inference: The model predicts test scores for each student based on the number of hours studied, allowing us to evaluate model performance.

Step 8: Calculate residuals (actual - predicted values).

We calculate the residuals (differences between actual and predicted scores).

residuals = Y - predicted_scores

Output:

0    1.0
1   -1.5
2    1.0
3   -1.5
4    1.0

Inference: Analyzing residuals helps assess the model's accuracy and identify any systematic errors.

Step 9: Evaluation Metrics

mse = mean_squared_error(Y, predicted_scores)

rmse = np.sqrt(mse)

r_squared = model.rsquared

adjusted_r_squared = model.rsquared_adj

rse = np.sqrt(sum(residuals**2) / (len(Y) - 2))


print(f"Mean Squared Error (MSE): {mse}")

print(f"Root Mean Squared Error (RMSE): {rmse}")

print(f"R-squared: {r_squared}")

print(f"Adjusted R-squared: {adjusted_r_squared}")

print(f"Residual Standard Error (RSE): {rse}")


Output

Mean Squared Error (MSE): 6.25

Root Mean Squared Error (RMSE): 2.5

R-squared: 0.973

Adjusted R-squared: 0.965

Residual Standard Error (RSE): 2.5

Inference: The MSE and RMSE indicate the average squared difference between predicted and actual values. A lower RMSE (2.5) suggests good predictive accuracy. The R-squared values affirm the model's strong explanatory power. The RSE, equivalent to RMSE in this case, confirms the average prediction error of the model, indicating reliability.

Step 10: Actual vs Predicted Visualization 


results_df = pd.DataFrame({ 'Hours_Studied': df['Hours_Studied'],

  'Actual_Test_Score': Y, 'Predicted_Test_Score': predicted_scores })


Ouput: 

        Hours_Studied           Actual_Test_Score         Predicted_Test_Score
0              1                                           50                                       49.0
1              2                                           55                                        56.5
2              3                                          65                                        64.0
3              4                                          70                                        71.5
4              5                                          80                                        79.0


plt.figure(figsize=(10, 6))
plt.plot(results_df['Hours_Studied'], results_df['Actual_Test_Score'], marker='o', label='Actual Test Score', color='blue')
plt.plot(results_df['Hours_Studied'], results_df['Predicted_Test_Score'], marker='x', label='Predicted Test Score', color='red')
plt.title('Actual vs Predicted Test Scores')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.legend()
plt.show()




Inference: The predicted test scores closely align with the actual test scores for the given hours studied. This indicates that the regression model has captured the underlying relationship reasonably well.

Step 11: Plot Residuals vs Predicted Values (Linearity Check)

plt.figure(figsize=(10, 6))
plt.scatter(predicted_scores, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()


Output:A scatter plot of residuals against predicted values.


Inference: The plot should show residuals randomly distributed around zero, suggesting that the linearity assumption is met. 

Step 12: Residual Plot (Independence Check)

Step 15: Residual Plot (Independence Check)
plt.figure(figsize=(10, 6))
plt.plot(residuals)
plt.title('Residuals Plot')
plt.xlabel('Index')
plt.ylabel('Residuals')
plt.axhline(0, color='red', linestyle='--')
plt.show()

Output:



Inference: The residuals should fluctuate randomly around zero, indicating independence. Patterns might suggest model misspecification.


Step 13: Residuals vs Hours Studied (Homoscedasticity Check)


plt.figure(figsize=(10, 6))
plt.scatter(X, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs Hours Studied')
plt.xlabel('Hours Studied')
plt.ylabel('Residuals')
plt.show()

Output:


Inference: The residuals should not display a pattern or funnel shape, indicating homoscedasticity (constant variance of residuals). A pattern might suggest issues with the model.

Step 14: Normality of Residuals - Q-Q Plot

plt.figure(figsize=(10, 6))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()

Output: A Q-Q plot displaying the distribution of residuals.




Inference: If the points fall along the reference line, it indicates that the residuals are normally distributed. Deviations from this line suggest non-normality, which could affect hypothesis testing in the regression model.

10. Limitations of Ordinary Least Squares (OLS)

While Ordinary Least Squares (OLS) is a popular and widely used method for linear regression, it has several limitations that make it less suitable for certain types of data or scenarios. Here are some of the key limitations:

1. Assumption of Linearity
  • Limitation: OLS assumes a linear relationship between the independent and dependent variables. However, many real-world relationships are not strictly linear.
  • Impact: If the true relationship is non-linear, OLS will provide a poor fit and inaccurate predictions.

2. Sensitivity to Outliers

  • Limitation: OLS is sensitive to outliers, as it minimizes the sum of squared residuals, which disproportionately weights larger errors.
  • Impact: A few outliers can heavily influence the regression line, leading to misleading results.

3. Assumption of Homoscedasticity

  • Limitation: OLS assumes that the variance of the errors (residuals) is constant across all levels of the independent variable (homoscedasticity).
  • Impact: If the errors exhibit heteroscedasticity (i.e., variance changes with different values of the independent variable), the standard errors of the coefficients may be incorrect, leading to unreliable hypothesis tests and confidence intervals.

4. Normality of Errors

  • Limitation: OLS assumes that the residuals are normally distributed, which is required for making valid inferences about the coefficients.
  • Impact: When the residuals deviate from normality, statistical tests (like t-tests for coefficients) may become invalid, affecting the reliability of the model.

In the next post, we will dive into multiple linear regression using the closed-form solution. This method is a powerful extension of linear regression that allows us to model the relationship between a dependent variable and multiple independent variables

The closed-form solution provides a direct way to compute the coefficients for all variables by solving the normal equations, making it an efficient approach for fitting linear models.

Stay tuned as we break down the mathematics behind the closed-form solution and explore its application in real-world scenarios