Uniform Manifold Approximation and Projection (UMAP)

INFO

State-of-the-art dimensionality reduction technique for visualizing high-dimensional data while preserving its structure

Developed by: Leland McInnes, John Healy, and James Melville (2018)
Core Principle: Uses topological data analysis and manifold learning to construct a high-dimensional graph and optimize a low-dimensional representation
Search Strategy:
- Build a fuzzy topological graph from high-dimensional data
- Optimize a low-dimensional graph to approximate the original structure
- Preserves both global and local relationships

Workflow

Graph Construction
- Define number of neighbors and minimum distance
- Build high-dimensional graph representing data topology
Low-Dimensional Optimization
- Embed data in 2D or 3D space
- Optimize layout using stochastic gradient descent

Code Example

import numpy as np
import pandas as pd
import umap
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
 
# Generate synthetic customer data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=10, n_classes=4, n_informative=5, random_state=42)
 
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Apply UMAP
umap_reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_reducer.fit_transform(X_scaled)
 
# Convert results to DataFrame
df_umap = pd.DataFrame(X_umap, columns=['UMAP1', 'UMAP2'])
df_umap['Cluster'] = y
 
# Plot the UMAP results
plt.figure(figsize=(10, 6))
scatter = plt.scatter(df_umap['UMAP1'], df_umap['UMAP2'], c=df_umap['Cluster'], cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label="Cluster")
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.title('UMAP Projection of Customer Data')
plt.show()

Advantages

Preserves local and global structure in high-dimensional data
Computationally efficient and scalable
Useful as a preprocessing step for clustering
Supports both supervised and unsupervised learning

Disadvantages

Highly sensitive to hyperparameters
- Number of neighbors
- Minimum distance
Does not provide explicit feature importance or variance explained

Jason's Notebook

Explorer

Uniform Manifold Approximation and Projection (UMAP)

Workflow

Code Example

Advantages

Disadvantages

Graph View

Table of Contents

Backlinks