1. Data vs. Information
To understand compression, we must distinguish between the physical representation and the conceptual meaning.
- Data: The raw representation of information (e.g., numbers, symbols, measurements).
- Information: The actual content or “message” extracted from that data.
Example
A grade of “100%” is data. The information derived is “this student succeeded in the course.”
2. Entropy: The Measure of Predictability
In Information Theory, Shannon entropy measures the unpredictability of information content.
- High Entropy (Non-Uniform): When symbols appear with balanced frequencies (e.g.,
ACGT). This is “unpredictable” and carries more information per character. - Low Entropy (Uniform): When one symbol dominates (e.g.,
AAAA). This is “predictable” and carries very little information.
The Golden Rule of Compression:
3. Optimizing Storage: Fixed vs. Variable Length
The goal of compression is to make the memory used as close to the information content as possible.
A. Fixed-Length Encoding (Naive)
By default, text files use 8 bits (1 byte) per character (ASCII).
- DNA Example: Since DNA has only 4 letters, we only need 2 bits (log2(4)=2) to represent each.
- A → 00, C → 01, G → 10, T → 11.
- Result: A guaranteed 4-fold reduction in file size (from 8 bits to 2 bits per letter).
B. Variable-Length Encoding (Smart)
We can do better by using our knowledge of character frequency. We assign shorter codes to the most frequent characters and longer codes to rare ones.
- Scenario: A message where A is very common and T is rare.
- Mapping: A →
0, C →10, G →110, T →111. - Trade-off: We lose efficiency on
T, but becauseAappears so much more often, the average bits per character drops significantly.
4. The Overhead Cost
When using frequency-based encoding, we cannot assume the recipient knows our mapping. We must include a header at the start of the file that contains the frequency information or the coding tree.
- Large Files: The overhead is negligible compared to the massive savings.
- Small Files: The overhead might actually make the “compressed” file larger than the original.
Key Takeaway
Entropy defines the theoretical limit of how much we can compress data. If a message has 1.6 bits of entropy per character, we can never safely compress it to 1 bit per character without losing information.