ABSTRACT

In the postgenomic era, read mapping is a fundamental task: determining where millions of short DNA “reads” () align to a massive reference genome (). While Aho-Corasick is ideal for a fixed set of short motifs, the Suffix Array is the superior choice when the long reference genome is the fixed database. It enables substring searches using binary search on a sorted index of all genomic suffixes.


1. Shifting the Paradigm: Database vs. Query

The Aho-Corasick Automaton was used to search for many small motifs in one query sequence. In genomics, the roles are reversed:

  • The Database (): The reference genome (e.g., 3 billion bases), which is static.
  • The Query (): Millions of short reads (e.g., 100 bases each), which change with every patient or experiment.

To optimize this, we preprocess the genome rather than the reads.


2. What is a Suffix Array?

A Suffix Array is a sorted list of all suffixes of a string. However, storing the full strings for every suffix would require space—roughly an exabyte for a human genome!

The Space-Efficient Solution

Instead of storing the strings, we store only the starting index of each suffix. Because the original genome is already in memory, we can compare any two suffixes by looking at the characters starting at their respective indices.

Example for = GCATCGC:

Suffix IndexSuffix String
2ATCGC
6C
1CATCGC
4CGC
5GC
0GCATCGC
3TCGC

The Suffix Array (): [2, 6, 1, 4, 5, 0, 3]


3. Searching for Reads

To find a read of length , we perform a binary search on the Suffix Array. Because the suffixes are sorted alphabetically, all suffixes starting with the same sequence will be grouped together in a contiguous range.

Finding the Range

We perform two modified binary searches to find the “clump” of matches:

  1. Left Bound: Find the first index in where the suffix starts with .
  2. Right Bound: Find the last index in where the suffix starts with .

Every integer in represents a starting position in the genome where the read matches perfectly.


4. Complexity & Performance

  • Construction: Modern algorithms (like SA-IS) can build the Suffix Array in time and space.
  • Search Time: For a single read of length , the search takes time.
  • Total Mapping Time: For reads, the complexity is .

TIP

Parallelization: Because each read search is independent, we can map millions of reads simultaneously across thousands of CPU cores, making this highly efficient for modern sequencing data.


5. Summary Comparison

FeatureAho-CorasickSuffix Array
Preprocessed InputThe Motifs (Short)The Genome (Long)
Data StructureTrie + Failure/Dict LinksSorted Integer Array
Search LogicFinite State AutomatonBinary Search
Best ForFinding many patterns in one sequenceMapping many reads to one genome