ABSTRACT
In the postgenomic era, read mapping is a fundamental task: determining where millions of short DNA “reads” () align to a massive reference genome (). While Aho-Corasick is ideal for a fixed set of short motifs, the Suffix Array is the superior choice when the long reference genome is the fixed database. It enables substring searches using binary search on a sorted index of all genomic suffixes.
1. Shifting the Paradigm: Database vs. Query
The Aho-Corasick Automaton was used to search for many small motifs in one query sequence. In genomics, the roles are reversed:
- The Database (): The reference genome (e.g., 3 billion bases), which is static.
- The Query (): Millions of short reads (e.g., 100 bases each), which change with every patient or experiment.
To optimize this, we preprocess the genome rather than the reads.
2. What is a Suffix Array?
A Suffix Array is a sorted list of all suffixes of a string. However, storing the full strings for every suffix would require space—roughly an exabyte for a human genome!
The Space-Efficient Solution
Instead of storing the strings, we store only the starting index of each suffix. Because the original genome is already in memory, we can compare any two suffixes by looking at the characters starting at their respective indices.
Example for = GCATCGC:
| Suffix Index | Suffix String |
|---|---|
| 2 | ATCGC |
| 6 | C |
| 1 | CATCGC |
| 4 | CGC |
| 5 | GC |
| 0 | GCATCGC |
| 3 | TCGC |
The Suffix Array (): [2, 6, 1, 4, 5, 0, 3]
3. Searching for Reads
To find a read of length , we perform a binary search on the Suffix Array. Because the suffixes are sorted alphabetically, all suffixes starting with the same sequence will be grouped together in a contiguous range.
Finding the Range
We perform two modified binary searches to find the “clump” of matches:
- Left Bound: Find the first index in where the suffix starts with .
- Right Bound: Find the last index in where the suffix starts with .
Every integer in represents a starting position in the genome where the read matches perfectly.
4. Complexity & Performance
- Construction: Modern algorithms (like SA-IS) can build the Suffix Array in time and space.
- Search Time: For a single read of length , the search takes time.
- Total Mapping Time: For reads, the complexity is .
TIP
Parallelization: Because each read search is independent, we can map millions of reads simultaneously across thousands of CPU cores, making this highly efficient for modern sequencing data.
5. Summary Comparison
| Feature | Aho-Corasick | Suffix Array |
|---|---|---|
| Preprocessed Input | The Motifs (Short) | The Genome (Long) |
| Data Structure | Trie + Failure/Dict Links | Sorted Integer Array |
| Search Logic | Finite State Automaton | Binary Search |
| Best For | Finding many patterns in one sequence | Mapping many reads to one genome |