Linear Regression Explained

5 min readJan 1, 2025

Linear regression is a statistical method used to model and understand the relationship between variables:

One or more independent variables (X)
One dependent variable (Y)

Linear regression finds a line that best fits the dataset. This line helps predict the dependent variable (Y) based on the independent variable(s) (X).

Use Cases

Predicting House Prices
Stock Price Prediction
Medical Cost Estimation
Sales Forecasting

Key Points

Y: the values you want to predict
X: the values used to make the prediction

Mathematical Interpretation

For simple linear regression (with one independent variable), the equation is:

Y = wX + b

Where:

w (weight or slope) represents the change in Y for each unit change in X.
X is the independent variable.
b (bias or intercept) is the value of Y when X is 0.

In more complex cases with multiple independent variables, this equation can be extended to:

Y = w1X1 + w2X2 + ... + wnXn + b

This is called multivariate linear regression.

Basic Example

Let’s start with a simple linear regression example using a dataset with one independent variable (X) and one dependent variable (Y).

Code Implementation

import numpy as np
import matplotlib.pyplot as plt

# Define the dataset
X = np.array([1, 2])  # Independent variable
Y = np.array([300, 500])  # Dependent variable (target values)

# Choose arbitrary values for weight (w) and bias (b)
w = 200  # Weight
b = 100  # Bias

# Calculate predicted Y values using the linear equation Y = wX + b
predicted_Y = w * X + b

# Print predicted values for clarity
for i in range(len(X)):
    print(f"Predicted Y for X={X[i]}: {predicted_Y[i]}")

Explanation

In this example, we are manually defining the weight ((w)) and bias ((b)) of the linear equation:

[ Y = wX + b ]

Predictions

Using the given values for (w) and (b), we can compute the predicted values of (Y):

For (X = 1): [ Y = (200 \times 1) + 100 = 300 ]
For (X = 2): [ Y = (200 \times 2) + 100 = 500 ]

These calculations yield the predicted values of (Y) that perfectly match our actual values in the dataset.

Important Note

This basic example uses arbitrary values for (w) and (b) to illustrate the concept of linear regression. In real-world applications, we typically use a cost function (like mean squared error) to optimize these parameters based on the data, allowing the model to learn the best-fit line automatically.

Visualization

To visualize how well our predicted values align with the actual data points, we can create a scatter plot and draw the regression line:

# Visualization
plt.scatter(X, Y, marker="X", c="g", label="Actual Values")
plt.plot(X, predicted_Y, label="Predicted Line", color='r')
plt.title('Linear Regression Example')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (Y)')
plt.legend()
plt.grid()
plt.show()

Interpretation of the Visualization

In the visualization:

Green Markers (Actual Values): These represent the actual data points in our dataset.
Red Line (Predicted Line): This line represents our model’s predictions based on the chosen (w) and (b). It shows how well the model fits the actual data.

scikit-learn implementation

from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Data
X = np.array([[1], [2]])  # Features should be 2D for sklearn
Y = np.array([300, 500])  # Target values

# Linear regression model
model = LinearRegression()
model.fit(X, Y)

# Predict values
predicted_Y = model.predict(X)

# Plotting
plt.scatter(X, Y, marker="X", c="g", label="Actual Value")
plt.plot(X, predicted_Y, label="Our prediction")
plt.legend()
plt.show()

Implementing Linear Regression from Scratch

Let’s implement simple linear regression from scratch using Python:

The Cost Function

To assess how well our model fits the data, we use the Mean Squared Error (MSE) cost function:

Gradient Descent

To optimize our model, we employ the gradient descent algorithm, which minimizes the cost function by iteratively adjusting the weights and bias. The update rules for the weights and bias are given by:

Code Implementation

Let’s dive into the code for our Linear Regression model.

import numpy as np

class LinearRegression():
    def __init__(self, learning_rate, max_iter):
        self.lr = learning_rate
        self.max_iter = max_iter

    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        self.weights = np.random.rand(n_features)
        self.bias = 0

        for _ in range(self.max_iter):
            y_pred = np.dot(X, self.weights) + self.bias
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)

            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias
    

if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error

    X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=42)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    reg = LinearRegression(learning_rate=0.01, max_iter=1000)
    reg.fit(X=X_train, y=y_train)
    y_pred = reg.predict(X_test)

    print(f"{mean_squared_error(y_test, y_pred):.3f}")

This implementation uses gradient descent to minimize the mean squared error and find the optimal weights and bias.

Using Scikit-learn for Linear Regression

Now, let’s implement the same linear regression using scikit-learn:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(mean_squared_error(y_test, y_pred))

plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Dataset')
plt.show()

This scikit-learn implementation provides a more streamlined approach with built-in features like train-test splitting and model evaluation metrics.

Both implementations demonstrate how to create a linear regression model, fit it to data, make predictions, and visualize the results. The scikit-learn version offers additional functionality and is generally more efficient for real-world applications.

Conclusion

Linear regression is a powerful and widely used technique in statistics and machine learning. Its simplicity and interpretability make it an excellent starting point for predictive modeling and data analysis. While it has limitations, particularly with complex, non-linear relationships, it remains a fundamental tool in a data scientist’s toolkit.

By understanding the concepts behind linear regression and how to implement it, you’re well on your way to more advanced machine learning techniques. Remember, the key to mastering linear regression is practice — try applying it to various datasets and real-world problems to gain a deeper understanding of its capabilities and limitations.

As you continue your journey in data science and machine learning, you’ll find that the principles you’ve learned here form the foundation for many more advanced techniques. Keep exploring, and happy modeling!

Linear Regression Explained

Use Cases

Key Points

Mathematical Interpretation

Basic Example

Code Implementation

Explanation

Predictions

Important Note

Visualization

Interpretation of the Visualization

scikit-learn implementation

Implementing Linear Regression from Scratch

The Cost Function

Gradient Descent

Code Implementation

Using Scikit-learn for Linear Regression

Conclusion

Written by Midhun G Raj

No responses yet