16. ParserSynthetic Indentation Tokens

16. Parser: Synthetic Indentation Tokens

Before the parser can evaluate control flow, it must understand structural boundaries. For languages that use significant whitespace (the off-side rule) instead of braces ({}), encoding whitespace logic directly into the parser breaks its context-free predictability.

Context-Sensitive Bottlenecks

Parsing indentation directly in the syntax phase forces the grammar to become context-sensitive. The parser would have to maintain state regarding column numbers and tab widths. This violates the strict LL(1) requirement needed for a high-performance predictive parser.

The Synthetic Token Architecture

To preserve a pure, context-free parser, the burden of whitespace is shifted entirely to the Lexical Analyzer. The lexer maintains a stack of indentation depths (integers).

  1. INDENT: When a logical line begins with a deeper indentation than the top of the stack, the lexer pushes the new depth and emits a synthetic INDENT token.
  2. DEDENT: When a logical line begins with less indentation, the lexer pops values off the stack until the depths match, emitting a synthetic DEDENT token for every pop.
  3. NEWLINE: Standard carriage returns are emitted as NEWLINE tokens, acting as statement terminators.
// Conceptual Lexer Stack
Stack<int> indentations;
indentations.push(0);

By the time the token stream reaches the Syntax Analyzer, the whitespace has been completely abstracted away. The parser simply treats INDENT and DEDENT exactly as a C-compiler treats { and }.


Next Module: Recursive Descent & Panic Mode