ABSTRACT

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. Unlike a standard Hash Table, it can return False Positives but never False Negatives. It is the ideal solution when memory is limited and a small margin of error is acceptable.


1. The Memory Problem

Imagine you need to store 1 million malicious URLs to protect a browser.

  • Hash Table Approach: Storing 1 million strings (approx. 100 bytes each) at a 0.75 load factor would require over 130 MB of RAM.
  • Bloom Filter Approach: By storing only bits instead of actual data, you can achieve the same protection using only a few megabytes.

The Trade-off: False Positives

In a security extension, the four possible outcomes of a check are:

  1. True Positive (TP): Malicious site blocked. (Success!)
  2. True Negative (TN): Safe site allowed. (Success!)
  3. False Negative (FN): Malicious site allowed. (Critical Failure!)
  4. False Positive (FP): Safe site blocked. (Minor Inconvenience).

A Bloom Filter is perfect here because it guarantees zero False Negatives. If the filter says a site is safe, it is definitely safe. If it says a site is malicious, it is probably malicious.


2. How it Works: The Mechanism

A Bloom Filter consists of:

  • A Bit Array of size , initialized to all zeros (0)
  • different Hash Functions, each mapping an input to one of the array indices.

Insertion Logic

To insert an element :

  1. Feed into all hash functions to get indices.
  2. Set the bits at those indices to 1.

Find Logic (Membership Test)

To check if exists:

  1. Compute the indices using the same hash functions.
  2. If ANY bit at those indices is 0: The element is definitely not in the set.
  3. If ALL bits are 1: The element might be in the set (or we encountered a False Positive due to bit overlap).

3. Mathematical Optimization

The probability of a False Positive () depends on (bits), (elements), and (hash functions):

To minimize errors when designing a filter, we use these optimal formulas:

  • Optimal number of hash functions:
  • Optimal array size:

4. Pseudocode Implementation

insert(x)

insert(x): 
    for each hash function h_i:
        index = h_i(x) % m
        bit_array[index] = true

find(x)

find(x):  
    for each hash function h_i:
        index = h_i(x) % m
        if bit_array[index] == false:
            return false // DEFINITELY NOT PRESENT
    return true // POSSIBLY PRESENT

5. Summary Comparison

FeatureHash TablesBloom Filter
StorageStores actual keysStores only bits (0/1)
Memory UsageHigh ()Very Low ()
Search Time Average (Constant functions)
False PositivesNoYes
False NegativesNoNo