INFO
Ensemble learning technique used for classification and regression tasks
Constructs multiple decision trees during training and outputs the mode of individual tree predictions for classification (majority vote)
- Developed by: Leo Breiman and Adele Cutler (2001)
- Core Principle: Combines predictions from multiple decision trees to improve generalization and reduce overfitting
- Search Strategy:
- Introduces randomness by selecting random subsets of features and samples for each tree
- Ensures diversity in decision boundaries
- Aggregates predictions via majority voting (classification) or averaging (regression)
Workflow
- Model Construction
- Generate multiple decision trees using bootstrapped samples
- Randomly select features for each split
- Prediction & Evaluation
- Aggregate predictions from all trees
- Evaluate using metrics:
- Accuracy
- Confusion Matrix
- Classification Report (includes precision, recall, F1-score)
- Analyze feature importance to interpret model behavior
Code Example
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Generate synthetic dataset
np.random.seed(42)
data = pd.DataFrame({
'tenure': np.random.randint(1, 60, 500),
'monthly_charges': np.random.uniform(20, 100, 500),
'total_charges': np.random.uniform(100, 6000, 500),
'customer_support_calls': np.random.randint(0, 10, 500),
'contract_type': np.random.choice([0, 1], size=500), # 0: Month-to-Month, 1: Long-Term
'churn': np.random.choice([0, 1], size=500) # 0: No churn, 1: Churn
})
# Split data into training and testing sets
X = data.drop(columns=['churn'])
y = data['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
# Predictions and evaluation
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
# Feature importance
feature_importances = pd.Series(rf_classifier.feature_importances_, index=X.columns).sort_values(ascending=False)
# Display results
print(f"Model Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
# Plot feature importance
plt.figure(figsize=(8,5))
sns.barplot(x=feature_importances, y=feature_importances.index)
plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.title("Feature Importance in Random Forest Model")
plt.show()Advantages
- Handles large datasets with high-dimensional features
- Less prone to overfitting than single decision trees
- Supports both categorical and numerical data
- Offers interpretability through feature importance analysis
Disadvantages
- Computationally intensive
- Requires significant processing power and memory
- Slower for real-time prediction
- May struggle with imbalanced datasets
- Requires techniques like class weighting or resampling to mitigate bias