Linear Regression Explained
Linear regression is a statistical method used to model and understand the relationship between variables:
- One or more independent variables (X)
- One dependent variable (Y)
Linear regression finds a line that best fits the dataset. This line helps predict the dependent variable (Y) based on the independent variable(s) (X).
Use Cases
- Predicting House Prices
- Stock Price Prediction
- Medical Cost Estimation
- Sales Forecasting
Key Points
- Y: the values you want to predict
- X: the values used to make the prediction
Mathematical Interpretation
For simple linear regression (with one independent variable), the equation is:
Y = wX + b
Where:
- w (weight or slope) represents the change in Y for each unit change in X.
- X is the independent variable.
- b (bias or intercept) is the value of Y when X is 0.
In more complex cases with multiple independent variables, this equation can be extended to:
Y = w1X1 + w2X2 + ... + wnXn + b
This is called multivariate linear regression.
Basic Example
Let’s start with a simple linear regression example using a dataset with one independent variable (X) and one dependent variable (Y).
Code Implementation
import numpy as np
import matplotlib.pyplot as plt
# Define the dataset
X = np.array([1, 2]) # Independent variable
Y = np.array([300, 500]) # Dependent variable (target values)
# Choose arbitrary values for weight (w) and bias (b)
w = 200 # Weight
b = 100 # Bias
# Calculate predicted Y values using the linear equation Y = wX + b
predicted_Y = w * X + b
# Print predicted values for clarity
for i in range(len(X)):
print(f"Predicted Y for X={X[i]}: {predicted_Y[i]}")
Explanation
In this example, we are manually defining the weight ((w)) and bias ((b)) of the linear equation:
[ Y = wX + b ]
Predictions
Using the given values for (w) and (b), we can compute the predicted values of (Y):
- For (X = 1): [ Y = (200 \times 1) + 100 = 300 ]
- For (X = 2): [ Y = (200 \times 2) + 100 = 500 ]
These calculations yield the predicted values of (Y) that perfectly match our actual values in the dataset.
Important Note
This basic example uses arbitrary values for (w) and (b) to illustrate the concept of linear regression. In real-world applications, we typically use a cost function (like mean squared error) to optimize these parameters based on the data, allowing the model to learn the best-fit line automatically.
Visualization
To visualize how well our predicted values align with the actual data points, we can create a scatter plot and draw the regression line:
# Visualization
plt.scatter(X, Y, marker="X", c="g", label="Actual Values")
plt.plot(X, predicted_Y, label="Predicted Line", color='r')
plt.title('Linear Regression Example')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (Y)')
plt.legend()
plt.grid()
plt.show()
Interpretation of the Visualization
In the visualization:
- Green Markers (Actual Values): These represent the actual data points in our dataset.
- Red Line (Predicted Line): This line represents our model’s predictions based on the chosen (w) and (b). It shows how well the model fits the actual data.
scikit-learn implementation
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
# Data
X = np.array([[1], [2]]) # Features should be 2D for sklearn
Y = np.array([300, 500]) # Target values
# Linear regression model
model = LinearRegression()
model.fit(X, Y)
# Predict values
predicted_Y = model.predict(X)
# Plotting
plt.scatter(X, Y, marker="X", c="g", label="Actual Value")
plt.plot(X, predicted_Y, label="Our prediction")
plt.legend()
plt.show()
Implementing Linear Regression from Scratch
Let’s implement simple linear regression from scratch using Python:
The Cost Function
To assess how well our model fits the data, we use the Mean Squared Error (MSE) cost function:
Gradient Descent
To optimize our model, we employ the gradient descent algorithm, which minimizes the cost function by iteratively adjusting the weights and bias. The update rules for the weights and bias are given by:
Code Implementation
Let’s dive into the code for our Linear Regression model.
import numpy as np
class LinearRegression():
def __init__(self, learning_rate, max_iter):
self.lr = learning_rate
self.max_iter = max_iter
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.random.rand(n_features)
self.bias = 0
for _ in range(self.max_iter):
y_pred = np.dot(X, self.weights) + self.bias
dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
db = (1 / n_samples) * np.sum(y_pred - y)
self.weights -= self.lr * dw
self.bias -= self.lr * db
def predict(self, X):
return np.dot(X, self.weights) + self.bias
if __name__ == "__main__":
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
reg = LinearRegression(learning_rate=0.01, max_iter=1000)
reg.fit(X=X_train, y=y_train)
y_pred = reg.predict(X_test)
print(f"{mean_squared_error(y_test, y_pred):.3f}")
This implementation uses gradient descent to minimize the mean squared error and find the optimal weights and bias.
Using Scikit-learn for Linear Regression
Now, let’s implement the same linear regression using scikit-learn:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred))
plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Dataset')
plt.show()
This scikit-learn implementation provides a more streamlined approach with built-in features like train-test splitting and model evaluation metrics.
Both implementations demonstrate how to create a linear regression model, fit it to data, make predictions, and visualize the results. The scikit-learn version offers additional functionality and is generally more efficient for real-world applications.
Conclusion
Linear regression is a powerful and widely used technique in statistics and machine learning. Its simplicity and interpretability make it an excellent starting point for predictive modeling and data analysis. While it has limitations, particularly with complex, non-linear relationships, it remains a fundamental tool in a data scientist’s toolkit.
By understanding the concepts behind linear regression and how to implement it, you’re well on your way to more advanced machine learning techniques. Remember, the key to mastering linear regression is practice — try applying it to various datasets and real-world problems to gain a deeper understanding of its capabilities and limitations.
As you continue your journey in data science and machine learning, you’ll find that the principles you’ve learned here form the foundation for many more advanced techniques. Keep exploring, and happy modeling!