INFO
A probabilistic clustering technique that models data as a mixture of multiple Gaussian distributions
- Developed by:
- Karl Pearson (1894, foundational concept)
- Modern EM-based clustering formalized in the 1970s
- Core Principle: Uses soft clustering by assigning probabilities to each point belonging to different clusters
- Search Strategy:
- Expectation-Maximization (EM) algorithm
- Define parameters for each Gaussian component:
- Mean
- Covariance
- Weight
- Iteratively refine parameters via alternating steps:
- Expectation step: assign probabilities to data points
- Maximization step: update parameters based on assigned probabilities
Workflow
- Initialization
- Choose number of Gaussian components
- Initialize means, covariances, and weights
- EM Iteration
- Expectation step: compute probability of each point belonging to each component
- Maximization step: update parameters to maximize likelihood
Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
# Generate synthetic data
np.random.seed(42)
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
# Fit Gaussian Mixture Model
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
# Plot results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k', alpha=0.6)
plt.scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('Gaussian Mixture Model Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()Advantages
- Models elliptical and overlapping clusters
- Soft clustering provides insight into degree of belongingness
- Handles varying cluster densities via covariance adjustment
Disadvantages
- Sensitive to parameter initialization
- May converge to local optima
- Mitigated by multiple initializations
- Assumes data follows Gaussian mixture distribution
- Computationally expensive due to iterative EM steps