INFO
Widely used statistical method for binary classification problems, where the outcome variable is categorical and typically represents two classes
- Developed by: David Cox (1958)
- Core Principle: Estimates the probability that a given input belongs to a particular class using the logistic (sigmoid) function
- Search Strategy:
- Trained using maximum likelihood estimation
- Adjusts coefficients to minimize the difference between predicted and actual values
- Outputs probabilities in the range 0 to 1 for decision-making
Workflow
- Data Preparation
- Select relevant features and standardize inputs
- Model Training
- Fit logistic regression using training data
- Optimize coefficients via maximum likelihood
- Prediction & Evaluation
- Predict class labels on test data
- Evaluate using metrics:
- Accuracy: proportion of correct predictions
- Confusion Matrix: shows true positives, true negatives, false positives, false negatives
- Classification Report: includes precision, recall, F1-score
Code Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import AffinityPropagation
from sklearn.preprocessing import StandardScaler
# Generate synthetic customer purchase data
np.random.seed(42)
data = np.array([
[500, 20, 3], [550, 18, 2], [600, 22, 5], # High spenders
[100, 5, 0], [120, 4, 1], [130, 6, 0], # Low spenders
[300, 10, 2], [310, 12, 3], [320, 9, 1], # Mid-range spenders
[400, 15, 4], [420, 14, 3], [430, 16, 4] # Upper-mid-range spenders
])
# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Apply Affinity Propagation
aff_prop = AffinityPropagation(random_state=42)
clusters = aff_prop.fit_predict(data_scaled)
# Convert results to a DataFrame
customer_df = pd.DataFrame(data, columns=['Annual Spend ($)', 'Purchases per Month', 'Returns'])
customer_df['Cluster'] = clusters
# Display the clustered data
import ace_tools as tools
tools.display_dataframe_to_user(name="Affinity Propagation Clustering Results", dataframe=customer_df)
# Plot clusters
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', edgecolors='k', s=100)
plt.xlabel('Annual Spend ($)')
plt.ylabel('Purchases per Month')
plt.title('Customer Clustering using Affinity Propagation')
plt.show()Advantages
- Interpretability
- Coefficients indicate how each independent variable influences the likelihood of an outcome
- Performs well when the relationship is linear in the log-odds space
- Computationally efficient
- Requires less training time than more complex models
Disadvantages
- Assumes linearity between independent and dependent variables
- May not always hold true
- Struggles with high-dimensional data
- Unless combined with feature selection or regularization techniques like L1 (Lasso) and L2 (Ridge)
- Less effective for complex, non-linear relationships
- More sophisticated models may yield better results