16. Parser: Synthetic Indentation Tokens
Before the parser can evaluate control flow, it must understand structural boundaries. For languages that use significant whitespace (the off-side rule) instead of braces ({}), encoding whitespace logic directly into the parser breaks its context-free predictability.
Context-Sensitive Bottlenecks
Parsing indentation directly in the syntax phase forces the grammar to become context-sensitive. The parser would have to maintain state regarding column numbers and tab widths. This violates the strict LL(1) requirement needed for a high-performance predictive parser.
The Synthetic Token Architecture
To preserve a pure, context-free parser, the burden of whitespace is shifted entirely to the Lexical Analyzer. The lexer maintains a stack of indentation depths (integers).
- INDENT: When a logical line begins with a deeper indentation than the top of the stack, the lexer pushes the new depth and emits a synthetic
INDENTtoken. - DEDENT: When a logical line begins with less indentation, the lexer pops values off the stack until the depths match, emitting a synthetic
DEDENTtoken for every pop. - NEWLINE: Standard carriage returns are emitted as
NEWLINEtokens, acting as statement terminators.
// Conceptual Lexer Stack
Stack<int> indentations;
indentations.push(0);By the time the token stream reaches the Syntax Analyzer, the whitespace has been completely abstracted away. The parser simply treats INDENT and DEDENT exactly as a C-compiler treats { and }.