Codecademy Live Linear Regression #1: Introduction To Simple Linear Regression

Linear Regression

  • Simplest and most widely regression techniques in statistical model and machine learning
  • assumes linear relationship between the independent variable(s) and the dependent variable
    • ideal for scenarios where the trend between predictors and outcomes follows a straight-line pattern
  • Model estimates the coefficients that minimize the difference between observed values and predicted values
  • sensitive to outlier and may struggle with non-linearity, necessitating data preprocessing or transformation
  • Example:
    • a retail company analyzes historical data on advertising budgets across different marketing channels, such as TV, online ads, and social media, to determine how much an increase in spending contributes to higher sales. By fitting a linear regression model, the company can estimate the expected revenue increase per dollar spent on advertising, allowing for more informed budget allocation. This helps in optimizing marketing expenditures and improving profitability based on data-driven insights.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# Sample data: Advertising budget (X) and Sales revenue (Y)
data = {'Advertising Budget': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'Sales Revenue': [15, 25, 35, 45, 50, 60, 70, 85, 95, 105]}
 
df = pd.DataFrame(data)
 
# Splitting the data
X = df[['Advertising Budget']]
y = df['Sales Revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
 
# Prediction
y_pred = model.predict(X_test)
 
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X_test, y_pred, color='red', linewidth=2, label="Predicted Line")
plt.xlabel("Advertising Budget (in $1000)")
plt.ylabel("Sales Revenue (in $1000)")
plt.legend()
plt.show()
 
# Print model coefficients
print(f"Coefficient: {model.coef_[0]}, Intercept: {model.intercept_}")
print(f"Mean Squared Error: {mse}")

Polynomial Regression

  • extends the concept of Linear Regression by fitting a polynomial equation to the data, making it well-suited for capturing non-linear relationships
  • introduced higher-degree terms (quadratic, cubic) to model curves in data
    • allows for better predictions in scenarios where the relationship between independent and dependent variables is not strictly linear
    • as polynomial degrees increase, models may become more complex and prone to overfitting, requiring careful balance in choosing the polynomial order
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
 
# Generate sample data
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1 1)
y = np.array([3, 6, 10, 15, 21, 28, 36, 45, 55, 66])  # Quadratic pattern
 
# Create polynomial model
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)
 
# Predictions
y_pred = poly_model.predict(X)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X, y_pred, color='red', linewidth=2, label="Polynomial Fit")
plt.xlabel("Distance (km)")
plt.ylabel("Delivery Time (min)")
plt.legend()
plt.show()

Ridge Regression

  • extension of multiple linear regression designed to handle multicollinearity
  • introduced regularization term (L2 penalty)
    • shrinks the regression coefficients
    • preventing regression coefficients from becoming excessively large and reducing model variance
    • ensures more stable and generalizable predictions
      • especially when working with complex datasets containing many interdependent variables
  • Does not eliminate irrelevant predictors entirely
    • less effective for feature selection tasks
from sklearn.linear_model import Ridge
 
# Ridge Regression Model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
 
# Predictions
y_pred_ridge = ridge_model.predict(X_test)
 
# Evaluation
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print(f"Ridge Regression MSE: {mse_ridge}")

Lasso Regression

  • Lasso stands for “Least Absolute Shrinkage and Selection Operator”
  • powerful regression technique that
    • addresses multicollinearity
    • performs automatic feature selection
  • Applying L1 penalty
    • forces some regression coefficients to be reduced to 0
    • removing less important predictors from the model
    • particularly useful for high-dimensional datasets where many independent variables may not significantly contribute to the prediction outcome
  • may sometimes eliminate variables that still contain minor predictive power
from sklearn.linear_model import Lasso
 
# Lasso Regression Model
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
 
# Predictions
y_pred_lasso = lasso_model.predict(X_test)
 
# Evaluation
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f"Lasso Regression MSE: {mse_lasso}")

ElasticNet Regression

  • combines the benefits of Ridge and Lasso Regression

  • useful when working with datasets that have both multicollinearity and high number of features

  • introduces a hybrid penalty that balances L1 and L2 regularization

    • ensuring that some features are selected while others are shrunk rather than completely eliminated
    • better performance than Lasso when multiple correlated features exits
    • more balanced and stable approach to predictive modeling
  • requires fine-tuning of its regularization parameters to achieve optimal performance

from sklearn.linear_model import ElasticNet
 
# ElasticNet Model
elastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_model.fit(X_train, y_train)
 
# Predictions
y_pred_elastic = elastic_model.predict(X_test)
 
# Evaluation
mse_elastic = mean_squared_error(y_test, y_pred_elastic)
 
print(f"ElasticNet Regression MSE: {mse_elastic}")

Logistic Regression

  • widely used statistical modeling technique for binary classification problems
    • the outcome variable is categorical
    • outcome represented as 0 or 1
  • estimates the probabilities which can then be thresholded to classify observations into categories
    • highly effective in cases where decision-making is based on probabilities
      • fraud detection
      • medical diagnosis
      • risk assessment
  • Assumes linear decision boundary
    • may limit its effectiveness for complex, non-linear classification problems
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
 
# Sample binary dataset
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])  # Pass (1) or Fail (0)
 
# Model training
log_reg = LogisticRegression()
log_reg.fit(X, y)
 
# Predictions
y_pred = log_reg.predict(X)
 
# Accuracy
accuracy = accuracy_score(y, y_pred)
print(f"Model Accuracy: {accuracy}")

Multivariate Regression

  • extends multiple regression by predicting multiple dependent variables simultaneously
  • estimates multiple outcomes
    • uses a shared set of independent variables
    • useful for interconnected problems where multiple factors are influenced by the same predictors
  • particularly beneficial in scenarios where relationships between variables must be considered holistically rather than independently
  • complexity increases as the number of dependent variables grows
    • requiring careful interpretation and robust data management
from sklearn.multioutput import MultiOutputRegressor
 
# Sample data (Weather parameters)
X = np.array([[30, 10], [35, 15], [40, 20], [45, 25], [50, 30]])  # Temperature, Rainfall
y = np.array([[50, 5, 2], [55, 7, 3], [60, 9, 4], [70, 12, 5], [80, 15, 6]])  # Crop Yield, Soil Erosion, Organic Matter
 
# Model training
multi_reg = MultiOutputRegressor(LinearRegression())
multi_reg.fit(X, y)
 
# Predictions
y_pred = multi_reg.predict(X)
 
# Display results
print("Predicted Crop Yield, Soil Erosion, and Organic Matter:\n", y_pred)

Support Vector Regression (SVG)

  • advanced machine learning technique that applies the principle of Support Vector Machines (SVM) to regression tasks
  • seeks to fit the best possible regression models that minimize the squared error instead of minimizing the squared error
  • using different kernel functions
    • can model both linear and non-linear patterns
    • versatile choice for high-dimensional datasets
  • computationally expensive and requires careful tuning of hyperparameters
from sklearn.svm import SVR
 
# Sample data
X = np.array([[20], [25], [30], [35], [40], [45], [50]]).reshape(-1, 1# Temperature
y = np.array([5, 7, 10, 12, 15, 18, 20])  # Rainfall (mm)
 
# SVR Model (Using Radial Basis Function kernel)
svr_model = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr_model.fit(X, y)
 
# Predictions
y_pred_svr = svr_model.predict(X)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X, y_pred_svr, color='red', linewidth=2, label="SVR Prediction")
plt.xlabel("Temperature (°C)")
plt.ylabel("Rainfall (mm)")
plt.legend()
plt.show()

Decision Tree Regression

  • non-parametric regression method that predicts numerical outcomes by partitioning data into hierarchical decision rules
  • each node in the tree, the model selects the feature that best splits the dataset to minimize error
    • leads to interpretable, rule-based decision-making
  • useful when dealing with mixed data types and missing values
    • can handle both categorical and numerical variables
  • prone to overfitting, which can lead to poor generalization on unseen data
    • techniques like pruning or ensemble methods are often deployed to mitigate overfitting
from sklearn.tree import DecisionTreeRegressor
 
# Sample data (Car age, mileage)
X = np.array([[1, 5000], [2, 15000], [3, 30000], [4, 45000], [5, 60000]])  # Age (years), Mileage
y = np.array([25000, 22000, 18000, 14000, 10000])  # Price ($)
 
# Model training
tree_model = DecisionTreeRegressor()
tree_model.fit(X, y)
 
# Predictions
y_pred_tree = tree_model.predict(X)
 
# Display results
print("Predicted Car Prices:\n", y_pred_tree)

Random Forest Regression

  • ensemble learning method that improves upon Decision Tree Regression by combining multiple decision trees to generate more accurate and stable predictions

  • each tree in the forest is trained on a random subset of the data and the final prediction is obtained by averaging the outputs of all individual trees

    • reduces overfitting and improves generalization
    • highly effective for large datasets with complex relationships
  • lower interpretability of Random Forest models than single decision trees

    • final prediction is divided from multiple trees rather than a single, transparent rule
from sklearn.ensemble import RandomForestRegressor
 
# Sample data (Stock market indicators)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])  # Days
y = np.array([100, 105, 110, 108, 115, 120, 125, 130, 128, 135])  # Stock Prices
 
# Model training
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X, y)
 
# Predictions
y_pred_rf = rf_model.predict(X)
 
# Plot results
plt.scatter(X, y, color='blue', label="Actual Prices")
plt.plot(X, y_pred_rf, color='red', linewidth=2, label="RF Prediction")
plt.xlabel("Days")
plt.ylabel("Stock Price ($)")
plt.legend()
plt.show()