ABSTRACT
Encoding strings as integers is a method that treats an entire string of length as a single numerical value rather than a sequence of independent characters. This approach allows us to reach the theoretical minimum bit-length for encoding, overcoming the inefficiencies found in Fixed Length Character-By-Character Encoding For Strings (Fixed Length CBC).
Theoretical Minimum
When encoding -letter strings over an alphabet , the total number of possible strings is . To map each unique string to a unique binary sequence, we need:
This method is more space-efficient because it utilizes the “gaps” that character-by-character encoding leaves behind when the alphabet size is not a perfect power of 2.
The Encoding Procedure
Step 1: Alphabet Mapping
Establish an arbitrary ordering on the alphabet . Assign each character a value from to .
- Example: If , then .
Step 2: Base Conversion (String to Integer)
Treat the string as a number in base-. Convert the string into a single integer using the assigned values.
- A string in base- is calculated as:
Step 3: Binary Conversion
Convert the resulting base- integer into a base-2 (binary) string. Ensure the output is a fixed-width string of length by adding leading zeros if necessary.
Worked Example: 4-Letter Strings over
Parameters:
- Alphabet size
- String length
- Total possible strings:
- Required bits:
Encoding “BEAD”
-
Values:
-
Base-6 to Integer:
-
Integer to Binary (11-bit fixed width): Padding to 11 bits
Encoding “FFFF”
-
Values:
-
Base-6 to Integer:
-
Integer to Binary:
Comparison of Efficiency
Using the alphabet for a 4-letter string:
| Encoding Method | Calculation | Total Bits |
|---|---|---|
| Fixed Length CBC | 12 bits | |
| Strings as Integers | 11 bits |
Conclusion: By treating the string as a single integer, we save 1 bit per 4 characters in this specific alphabet. Over very long strings, this efficiency gain scales significantly.
Related Notes
- Lossless Encoding – The requirements for one-to-one mapping.
- Fixed Length Character-By-Character Encoding For Strings (Fixed Length CBC) – Comparison of simplicity vs. space efficiency.
- Variable Length Character-By-Character Encoding for Strings (Variable Length CBC) – Another approach to compression based on frequency.