15. Lexer Architecture: Directed Acyclic Word Graphs

After the jump table routes the lexer to scan a continuous string of characters (an identifier), the lexer must determine if the string is a user-defined name or a reserved language keyword (e.g., if, class, return).

The Hashing Bottleneck

The conventional approach maps keywords using a Hash Table. While $O(1)$ in theory, computing the hash function for every scanned identifier introduces arithmetic overhead, and traversing the hash map structures risks cache misses.

To eliminate hashing overhead, the keyword dictionary is encoded into a finite state automaton.

Tries vs. DAWGs

A Trie (Prefix Tree) is a standard structure for string matching. It provides deterministic $O(L)$ lookups (where $L$ is string length) by sharing common prefixes. However, Tries fail to share common suffixes, resulting in redundant nodes and a bloated memory footprint.

A DAWG (Directed Acyclic Word Graph)—formally a Minimal Acyclic Finite State Automaton—compresses the dictionary by sharing both prefixes and suffixes. If multiple paths eventually lead to the same sequence of terminating characters, a DAWG merges those paths. The result is the mathematically minimal number of nodes required to represent the keyword set. Because the node count is minimized, the DAWG easily fits into the CPU cache, making the $O(L)$ traversal significantly faster than a Trie or Hash Table.

Incremental Construction

Building a massive Trie and then applying standard DFA minimization (like Hopcroft’s) requires excessive peak memory. Instead, the DAWG is constructed incrementally using the algorithm formalized by Daciuk et al.:

  1. Alphabetical Insertion: Words are inserted into the graph in strict alphabetical order. This guarantees that once a word is added, the algorithm will never need to backtrack into a completely separate branch.
  2. Common Prefix Identification: For each new word, the algorithm finds the longest common prefix shared with the previously inserted word.
  3. Suffix Minimization: Before adding the new suffix, the algorithm traverses the differing suffix of the previous word from the leaves upward. It checks if any of those nodes are identical to nodes already present in the minimized graph.
  4. Merging: Nodes are deemed identical if they share the same terminal state and have identical outgoing edges pointing to identical child nodes. Identical nodes are merged by re-pointing edges to the pre-existing node, deleting the duplicate.
  5. Addition: The unique suffix of the new word is then appended to the common prefix.

By minimizing incrementally during insertion, the memory footprint remains extremely low. The final structure provides an efficient, hash-free, cache-friendly keyword resolver.


Next Module: Parser