14. Lexer Architecture: Jump Tables
Once the bounds-check overhead is eliminated via the sentinel byte, the remaining bottleneck in the lexer loop is character dispatch—determining which token processing logic to execute based on the current character.
The Cost of Branching
A standard approach to character dispatch utilizes large switch statements or chained if-else blocks:
switch (c) {
case 'a' ... 'z': return lex_identifier();
case '0' ... '9': return lex_number();
case '+': return lex_plus();
// ...
}While functional, this approach generates heavy control-flow dependencies. The CPU must evaluate branches sequentially or rely on branch predictors which can mispredict on highly variable input (like alternating whitespace, identifiers, and operators).
Table-Driven DFA Dispatch
A high-performance lexer models the transition function ($\delta$) of a Deterministic Finite Automaton using a Jump Table. A jump table uses the numeric (ASCII/UTF-8 byte) value of the character as a direct index into an array of function pointers, state IDs, or handler labels.
This provides $O(1)$ dispatch time with zero conditional branching:
// Direct array lookup
handler = jump_table[current_char];
handler();The Space vs. Speed Trade-off: Row Displacement
A pure 2D state-transition matrix (state_t transition_table[MAX_STATES][256]) provides instantaneous $O(1)$ lookups for any state and any character. However, practical programming languages require thousands of states. A full 2D matrix allocates space for every character in every state, leading to massive memory bloat (megabytes of data). This pushes the table out of the CPU’s fast L1/L2 caches, causing RAM latency to bottleneck the lexer.
To retain $O(1)$ speed while fitting the table in the CPU cache, the matrix is compressed using Row Displacement (a technique used by lexer generators like classic lex, flex, and ocamllex).
Row Displacement Compression
The 2D matrix is highly sparse—most states have a single “default” transition (like an error state) and very few specific character transitions. Row displacement packs these sparse rows into a single, dense 1D array:
- Overlapping: The rows (state vectors) are shifted by varying offsets. Non-conflicting transitions from different states are overlapped into the same empty spaces in the 1D array.
- Disambiguation: Each cell in the 1D array stores not just the target state, but also the input character (or source state) that placed it there.
- Lookup: To find the next state, the lexer looks at
array[offset[current_state] + input_char]. If the disambiguation tag matches, the transition is valid. Otherwise, the lexer falls back to the default transition for that state.
This achieves the extreme memory compactness of an associative list while executing with the raw speed of direct array indexing. For typical language designs, this keeps the lexer’s hot loop entirely within the L1 cache.