ABSTRACT
A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. Unlike a standard Hash Table, it can return False Positives but never False Negatives. It is the ideal solution when memory is limited and a small margin of error is acceptable.
1. The Memory Problem
Imagine you need to store 1 million malicious URLs to protect a browser.
- Hash Table Approach: Storing 1 million strings (approx. 100 bytes each) at a 0.75 load factor would require over 130 MB of RAM.
- Bloom Filter Approach: By storing only bits instead of actual data, you can achieve the same protection using only a few megabytes.
The Trade-off: False Positives
In a security extension, the four possible outcomes of a check are:
- True Positive (TP): Malicious site blocked. (Success!)
- True Negative (TN): Safe site allowed. (Success!)
- False Negative (FN): Malicious site allowed. (Critical Failure!)
- False Positive (FP): Safe site blocked. (Minor Inconvenience).
A Bloom Filter is perfect here because it guarantees zero False Negatives. If the filter says a site is safe, it is definitely safe. If it says a site is malicious, it is probably malicious.
2. How it Works: The Mechanism
A Bloom Filter consists of:
- A Bit Array of size , initialized to all zeros (0)
- Similar to Hash Tables
- different Hash Functions, each mapping an input to one of the array indices.
Insertion Logic
To insert an element :
- Feed into all hash functions to get indices.
- Set the bits at those indices to 1.
Find Logic (Membership Test)
To check if exists:
- Compute the indices using the same hash functions.
- If ANY bit at those indices is 0: The element is definitely not in the set.
- If ALL bits are 1: The element might be in the set (or we encountered a False Positive due to bit overlap).
3. Mathematical Optimization
The probability of a False Positive () depends on (bits), (elements), and (hash functions):
To minimize errors when designing a filter, we use these optimal formulas:
- Optimal number of hash functions:
- Optimal array size:
4. Pseudocode Implementation
insert(x)
insert(x):
for each hash function h_i:
index = h_i(x) % m
bit_array[index] = truefind(x)
find(x):
for each hash function h_i:
index = h_i(x) % m
if bit_array[index] == false:
return false // DEFINITELY NOT PRESENT
return true // POSSIBLY PRESENT5. Summary Comparison
| Feature | Hash Tables | Bloom Filter |
|---|---|---|
| Storage | Stores actual keys | Stores only bits (0/1) |
| Memory Usage | High () | Very Low () |
| Search Time | Average | (Constant functions) |
| False Positives | No | Yes |
| False Negatives | No | No |