ABSTRACT

Encoding strings as integers is a method that treats an entire string of length as a single numerical value rather than a sequence of independent characters. This approach allows us to reach the theoretical minimum bit-length for encoding, overcoming the inefficiencies found in Fixed Length Character-By-Character Encoding For Strings (Fixed Length CBC).


Theoretical Minimum

When encoding -letter strings over an alphabet , the total number of possible strings is . To map each unique string to a unique binary sequence, we need:

This method is more space-efficient because it utilizes the “gaps” that character-by-character encoding leaves behind when the alphabet size is not a perfect power of 2.


The Encoding Procedure

Step 1: Alphabet Mapping

Establish an arbitrary ordering on the alphabet . Assign each character a value from to .

  • Example: If , then .

Step 2: Base Conversion (String to Integer)

Treat the string as a number in base-. Convert the string into a single integer using the assigned values.

  • A string in base- is calculated as:

Step 3: Binary Conversion

Convert the resulting base- integer into a base-2 (binary) string. Ensure the output is a fixed-width string of length by adding leading zeros if necessary.


Worked Example: 4-Letter Strings over

Parameters:

  • Alphabet size
  • String length
  • Total possible strings:
  • Required bits:

Encoding “BEAD”

  1. Values:

  2. Base-6 to Integer:

  3. Integer to Binary (11-bit fixed width): Padding to 11 bits

Encoding “FFFF”

  1. Values:

  2. Base-6 to Integer:

  3. Integer to Binary:


Comparison of Efficiency

Using the alphabet for a 4-letter string:

Encoding MethodCalculationTotal Bits
Fixed Length CBC12 bits
Strings as Integers11 bits

Conclusion: By treating the string as a single integer, we save 1 bit per 4 characters in this specific alphabet. Over very long strings, this efficiency gain scales significantly.