Codecademy Live Linear Regression #1: Introduction To Simple Linear Regression

Linear Regression

Simplest and most widely regression techniques in statistical model and machine learning
assumes linear relationship between the independent variable(s) and the dependent variable
- ideal for scenarios where the trend between predictors and outcomes follows a straight-line pattern
Model estimates the coefficients that minimize the difference between observed values and predicted values
sensitive to outlier and may struggle with non-linearity, necessitating data preprocessing or transformation
Example:
- a retail company analyzes historical data on advertising budgets across different marketing channels, such as TV, online ads, and social media, to determine how much an increase in spending contributes to higher sales. By fitting a linear regression model, the company can estimate the expected revenue increase per dollar spent on advertising, allowing for more informed budget allocation. This helps in optimizing marketing expenditures and improving profitability based on data-driven insights.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# Sample data: Advertising budget (X) and Sales revenue (Y)
data = {'Advertising Budget': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'Sales Revenue': [15, 25, 35, 45, 50, 60, 70, 85, 95, 105]}
 
df = pd.DataFrame(data)
 
# Splitting the data
X = df[['Advertising Budget']]
y = df['Sales Revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
 
# Prediction
y_pred = model.predict(X_test)
 
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X_test, y_pred, color='red', linewidth=2, label="Predicted Line")
plt.xlabel("Advertising Budget (in $1000)")
plt.ylabel("Sales Revenue (in $1000)")
plt.legend()
plt.show()
 
# Print model coefficients
print(f"Coefficient: {model.coef_[0]}, Intercept: {model.intercept_}")
print(f"Mean Squared Error: {mse}")

Polynomial Regression

extends the concept of Linear Regression by fitting a polynomial equation to the data, making it well-suited for capturing non-linear relationships
introduced higher-degree terms (quadratic, cubic) to model curves in data
- allows for better predictions in scenarios where the relationship between independent and dependent variables is not strictly linear
- as polynomial degrees increase, models may become more complex and prone to overfitting, requiring careful balance in choosing the polynomial order

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
 
# Generate sample data
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1 1)
y = np.array([3, 6, 10, 15, 21, 28, 36, 45, 55, 66])  # Quadratic pattern
 
# Create polynomial model
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)
 
# Predictions
y_pred = poly_model.predict(X)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X, y_pred, color='red', linewidth=2, label="Polynomial Fit")
plt.xlabel("Distance (km)")
plt.ylabel("Delivery Time (min)")
plt.legend()
plt.show()

Ridge Regression

extension of multiple linear regression designed to handle multicollinearity
introduced regularization term (L2 penalty)
- shrinks the regression coefficients
- preventing regression coefficients from becoming excessively large and reducing model variance
- ensures more stable and generalizable predictions
  - especially when working with complex datasets containing many interdependent variables
Does not eliminate irrelevant predictors entirely
- less effective for feature selection tasks

from sklearn.linear_model import Ridge
 
# Ridge Regression Model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
 
# Predictions
y_pred_ridge = ridge_model.predict(X_test)
 
# Evaluation
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print(f"Ridge Regression MSE: {mse_ridge}")

Lasso Regression

Lasso stands for “Least Absolute Shrinkage and Selection Operator”
powerful regression technique that
- addresses multicollinearity
- performs automatic feature selection
Applying L1 penalty
- forces some regression coefficients to be reduced to 0
- removing less important predictors from the model
- particularly useful for high-dimensional datasets where many independent variables may not significantly contribute to the prediction outcome
may sometimes eliminate variables that still contain minor predictive power

from sklearn.linear_model import Lasso
 
# Lasso Regression Model
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
 
# Predictions
y_pred_lasso = lasso_model.predict(X_test)
 
# Evaluation
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f"Lasso Regression MSE: {mse_lasso}")

ElasticNet Regression

combines the benefits of Ridge and Lasso Regression
useful when working with datasets that have both multicollinearity and high number of features
introduces a hybrid penalty that balances L1 and L2 regularization
- ensuring that some features are selected while others are shrunk rather than completely eliminated
- better performance than Lasso when multiple correlated features exits
- more balanced and stable approach to predictive modeling
requires fine-tuning of its regularization parameters to achieve optimal performance

from sklearn.linear_model import ElasticNet
 
# ElasticNet Model
elastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_model.fit(X_train, y_train)
 
# Predictions
y_pred_elastic = elastic_model.predict(X_test)
 
# Evaluation
mse_elastic = mean_squared_error(y_test, y_pred_elastic)
 
print(f"ElasticNet Regression MSE: {mse_elastic}")

Logistic Regression

widely used statistical modeling technique for binary classification problems
- the outcome variable is categorical
- outcome represented as 0 or 1
estimates the probabilities which can then be thresholded to classify observations into categories
- highly effective in cases where decision-making is based on probabilities
  - fraud detection
  - medical diagnosis
  - risk assessment
Assumes linear decision boundary
- may limit its effectiveness for complex, non-linear classification problems

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
 
# Sample binary dataset
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])  # Pass (1) or Fail (0)
 
# Model training
log_reg = LogisticRegression()
log_reg.fit(X, y)
 
# Predictions
y_pred = log_reg.predict(X)
 
# Accuracy
accuracy = accuracy_score(y, y_pred)
print(f"Model Accuracy: {accuracy}")

Multivariate Regression

extends multiple regression by predicting multiple dependent variables simultaneously
estimates multiple outcomes
- uses a shared set of independent variables
- useful for interconnected problems where multiple factors are influenced by the same predictors
particularly beneficial in scenarios where relationships between variables must be considered holistically rather than independently
complexity increases as the number of dependent variables grows
- requiring careful interpretation and robust data management

from sklearn.multioutput import MultiOutputRegressor
 
# Sample data (Weather parameters)
X = np.array([[30, 10], [35, 15], [40, 20], [45, 25], [50, 30]])  # Temperature, Rainfall
y = np.array([[50, 5, 2], [55, 7, 3], [60, 9, 4], [70, 12, 5], [80, 15, 6]])  # Crop Yield, Soil Erosion, Organic Matter
 
# Model training
multi_reg = MultiOutputRegressor(LinearRegression())
multi_reg.fit(X, y)
 
# Predictions
y_pred = multi_reg.predict(X)
 
# Display results
print("Predicted Crop Yield, Soil Erosion, and Organic Matter:\n", y_pred)

Support Vector Regression (SVG)

advanced machine learning technique that applies the principle of Support Vector Machines (SVM) to regression tasks
seeks to fit the best possible regression models that minimize the squared error instead of minimizing the squared error
using different kernel functions
- can model both linear and non-linear patterns
- versatile choice for high-dimensional datasets
computationally expensive and requires careful tuning of hyperparameters

from sklearn.svm import SVR
 
# Sample data
X = np.array([[20], [25], [30], [35], [40], [45], [50]]).reshape(-1, 1)  # Temperature
y = np.array([5, 7, 10, 12, 15, 18, 20])  # Rainfall (mm)
 
# SVR Model (Using Radial Basis Function kernel)
svr_model = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr_model.fit(X, y)
 
# Predictions
y_pred_svr = svr_model.predict(X)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X, y_pred_svr, color='red', linewidth=2, label="SVR Prediction")
plt.xlabel("Temperature (°C)")
plt.ylabel("Rainfall (mm)")
plt.legend()
plt.show()

Decision Tree Regression

non-parametric regression method that predicts numerical outcomes by partitioning data into hierarchical decision rules
each node in the tree, the model selects the feature that best splits the dataset to minimize error
- leads to interpretable, rule-based decision-making
useful when dealing with mixed data types and missing values
- can handle both categorical and numerical variables
prone to overfitting, which can lead to poor generalization on unseen data
- techniques like pruning or ensemble methods are often deployed to mitigate overfitting

from sklearn.tree import DecisionTreeRegressor
 
# Sample data (Car age, mileage)
X = np.array([[1, 5000], [2, 15000], [3, 30000], [4, 45000], [5, 60000]])  # Age (years), Mileage
y = np.array([25000, 22000, 18000, 14000, 10000])  # Price ($)
 
# Model training
tree_model = DecisionTreeRegressor()
tree_model.fit(X, y)
 
# Predictions
y_pred_tree = tree_model.predict(X)
 
# Display results
print("Predicted Car Prices:\n", y_pred_tree)

Random Forest Regression

ensemble learning method that improves upon Decision Tree Regression by combining multiple decision trees to generate more accurate and stable predictions
each tree in the forest is trained on a random subset of the data and the final prediction is obtained by averaging the outputs of all individual trees
- reduces overfitting and improves generalization
- highly effective for large datasets with complex relationships
lower interpretability of Random Forest models than single decision trees
- final prediction is divided from multiple trees rather than a single, transparent rule

from sklearn.ensemble import RandomForestRegressor
 
# Sample data (Stock market indicators)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])  # Days
y = np.array([100, 105, 110, 108, 115, 120, 125, 130, 128, 135])  # Stock Prices
 
# Model training
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X, y)
 
# Predictions
y_pred_rf = rf_model.predict(X)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Prices")
plt.plot(X, y_pred_rf, color='red', linewidth=2, label="RF Prediction")
plt.xlabel("Days")
plt.ylabel("Stock Price ($)")
plt.legend()
plt.show()

Jason's Notebook

Explorer

Testing Regression Models

Linear Regression

Polynomial Regression

Ridge Regression

Lasso Regression

ElasticNet Regression

Logistic Regression

Multivariate Regression

Support Vector Regression (SVG)

Decision Tree Regression

Random Forest Regression

Graph View

Table of Contents

Backlinks