k-Nearest Neighbors (k-NN)

INFO

Simple but effective non-parametric machine learning technique used for both classification and regression tasks
Operates on the principle of similarity, assigning observations based on the majority vote or average of their k-nearest neighbors

Developed by: Fix and Hodges (1951)
Core Principle: Stores the entire training dataset and makes predictions by measuring distance to nearby points
Search Strategy:
- No explicit model is built
- Uses Euclidean, Manhattan, or Minkowski distance to find nearest neighbors
- Choice of k affects bias-variance tradeoff
  - Small k → sensitive to local patterns
  - Large k → smoother decision boundaries, risk of misclassifying minority cases

Workflow

Data Preparation
- Standardize features to ensure fair distance comparisons
Model Training
- Store training data
- No fitting required beyond data storage
Prediction & Evaluation
- Predict class based on majority vote among k-nearest neighbors
- Evaluate using metrics:
  - Accuracy
  - Classification Report (includes precision, recall, F1-score)

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
 
# Sample dataset: Customer age, income, and spending category (0 = Low, 1 = High)
data = {
    'Age': [25, 34, 45, 52, 23, 40, 60, 48, 33, 29],
    'Income': [40000, 60000, 80000, 75000, 30000, 72000, 95000, 88000, 54000, 50000],
    'Spending_Category': [0, 1, 1, 1, 0, 1, 1, 1, 0, 0]  # Target variable
}
 
# Convert to DataFrame
df = pd.DataFrame(data)
 
# Splitting features and target
X = df[['Age', 'Income']]
y = df['Spending_Category']
 
# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Scaling features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# Applying k-NN classifier
k = 3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
 
# Predictions
y_pred = knn.predict(X_test_scaled)
 
# Model evaluation
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
 
# Display results
print(f"Model Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", report)

Advantages

Simple and intuitive
Effective with smaller datasets
Highly interpretable
- No training phase
- Naturally adapts to non-linear decision boundaries

Disadvantages

Computationally expensive for large datasets
- Requires distance calculation to all training points
Suffers from curse of dimensionality
- Performance degrades with many features
Sensitive to noise and imbalanced data
- Outliers or skewed class distributions can distort predictions
- Mitigation strategies:
  - Optimize k
  - Use feature selection
  - Apply distance weighting

Jason's Notebook

Explorer

k-Nearest Neighbors (k-NN)

Workflow

Code Example

Advantages

Disadvantages

Graph View

Table of Contents

Backlinks