Logistic Regression

INFO

Widely used statistical method for binary classification problems, where the outcome variable is categorical and typically represents two classes

Developed by: David Cox (1958)
Core Principle: Estimates the probability that a given input belongs to a particular class using the logistic (sigmoid) function
Search Strategy:
- Trained using maximum likelihood estimation
- Adjusts coefficients to minimize the difference between predicted and actual values
- Outputs probabilities in the range 0 to 1 for decision-making

Workflow

Data Preparation
- Select relevant features and standardize inputs
Model Training
- Fit logistic regression using training data
- Optimize coefficients via maximum likelihood
Prediction & Evaluation
- Predict class labels on test data
- Evaluate using metrics:
  - Accuracy: proportion of correct predictions
  - Confusion Matrix: shows true positives, true negatives, false positives, false negatives
  - Classification Report: includes precision, recall, F1-score

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import AffinityPropagation
from sklearn.preprocessing import StandardScaler
 
# Generate synthetic customer purchase data
np.random.seed(42)
data = np.array([
    [500, 20, 3], [550, 18, 2], [600, 22, 5],  # High spenders
    [100, 5, 0], [120, 4, 1], [130, 6, 0],     # Low spenders
    [300, 10, 2], [310, 12, 3], [320, 9, 1],   # Mid-range spenders
    [400, 15, 4], [420, 14, 3], [430, 16, 4]   # Upper-mid-range spenders
])
 
# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
 
# Apply Affinity Propagation
aff_prop = AffinityPropagation(random_state=42)
clusters = aff_prop.fit_predict(data_scaled)
 
# Convert results to a DataFrame
customer_df = pd.DataFrame(data, columns=['Annual Spend ($)', 'Purchases per Month', 'Returns'])
customer_df['Cluster'] = clusters
 
# Display the clustered data
import ace_tools as tools
tools.display_dataframe_to_user(name="Affinity Propagation Clustering Results", dataframe=customer_df)
 
# Plot clusters
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', edgecolors='k', s=100)
plt.xlabel('Annual Spend ($)')
plt.ylabel('Purchases per Month')
plt.title('Customer Clustering using Affinity Propagation')
plt.show()

Advantages

Interpretability
- Coefficients indicate how each independent variable influences the likelihood of an outcome
- Performs well when the relationship is linear in the log-odds space
Computationally efficient
- Requires less training time than more complex models

Disadvantages

Assumes linearity between independent and dependent variables
- May not always hold true
Struggles with high-dimensional data
- Unless combined with feature selection or regularization techniques like L1 (Lasso) and L2 (Ridge)
Less effective for complex, non-linear relationships
- More sophisticated models may yield better results

Jason's Notebook

Explorer

Logistic Regression

Workflow

Code Example

Advantages

Disadvantages

Graph View

Table of Contents

Backlinks