Suffix Arrays

ABSTRACT

In the postgenomic era, read mapping is a fundamental task: determining where millions of short DNA “reads” ( $Q$ ) align to a massive reference genome ( $D$ ). While Aho-Corasick is ideal for a fixed set of short motifs, the Suffix Array is the superior choice when the long reference genome is the fixed database. It enables $O (k lo g n)$ substring searches using binary search on a sorted index of all genomic suffixes.

1. Shifting the Paradigm: Database vs. Query

The Aho-Corasick Automaton was used to search for many small motifs in one query sequence. In genomics, the roles are reversed:

The Database ( $D$ ): The reference genome (e.g., 3 billion bases), which is static.
The Query ( $Q$ ): Millions of short reads (e.g., 100 bases each), which change with every patient or experiment.

To optimize this, we preprocess the genome rather than the reads.

2. What is a Suffix Array?

A Suffix Array is a sorted list of all suffixes of a string. However, storing the full strings for every suffix would require $O (n^{2})$ space—roughly an exabyte for a human genome!

The Space-Efficient Solution

Instead of storing the strings, we store only the starting index of each suffix. Because the original genome $D$ is already in memory, we can compare any two suffixes by looking at the characters starting at their respective indices.

Example for $D$ = GCATCGC:

Suffix Index	Suffix String
2	`ATCGC`
6	`C`
1	`CATCGC`
4	`CGC`
5	`GC`
0	`GCATCGC`
3	`TCGC`

The Suffix Array ( $S A$ ): [2, 6, 1, 4, 5, 0, 3]

3. Searching for Reads

To find a read $w$ of length $k$ , we perform a binary search on the Suffix Array. Because the suffixes are sorted alphabetically, all suffixes starting with the same sequence $w$ will be grouped together in a contiguous range.

Finding the Range

We perform two modified binary searches to find the “clump” of matches:

Left Bound: Find the first index $i$ in $S A$ where the suffix starts with $w$ .
Right Bound: Find the last index $j$ in $S A$ where the suffix starts with $w$ .

Every integer in $S A [i ... j]$ represents a starting position in the genome where the read $w$ matches perfectly.

4. Complexity & Performance

Construction: Modern algorithms (like SA-IS) can build the Suffix Array in $O (n)$ time and $O (n)$ space.
Search Time: For a single read of length $k$ , the search takes $O (k lo g n)$ time.
Total Mapping Time: For $m$ reads, the complexity is $O (mk lo g n)$ .

TIP

Parallelization: Because each read search is independent, we can map millions of reads simultaneously across thousands of CPU cores, making this highly efficient for modern sequencing data.

5. Summary Comparison

Feature	Aho-Corasick	Suffix Array
Preprocessed Input	The Motifs (Short)	The Genome (Long)
Data Structure	Trie + Failure/Dict Links	Sorted Integer Array
Search Logic	Finite State Automaton	Binary Search
Best For	Finding many patterns in one sequence	Mapping many reads to one genome

Jason's Notebook

Explorer

Suffix Arrays

1. Shifting the Paradigm: Database vs. Query

2. What is a Suffix Array?

The Space-Efficient Solution

3. Searching for Reads

Finding the Range

4. Complexity & Performance

5. Summary Comparison

Graph View

Table of Contents

Backlinks