17. Abstract Syntax TreeData-Oriented AST

17. Abstract Syntax Tree: Data-Oriented Design

Traditional compilers model the Abstract Syntax Tree (AST) using the Object-Oriented “Visitor Pattern.” Nodes are represented as complex, heap-allocated classes containing 64-bit pointers to their children (Node* left, Node* right). While pedagogically simple, this scatters data randomly across the heap, destroying CPU cache locality and causing severe performance bottlenecks during semantic analysis.

To achieve significant performance gains—such as Zig’s reported 20% reduction in AST generation time and 40% reduction in IR generation time—this architecture abandons the OOP approach in favor of a Data-Oriented Design (DoD) AST, mapping closely to the design pioneered by Andrew Kelley for the Zig compiler.

Struct of Arrays (SoA)

Instead of an array of objects, the AST is implemented as a flat Struct of Arrays. The data is heavily packed and completely decoupled from memory pointers.

// Conceptual Zig/DoD Layout
const Tree = struct {
    tags: []NodeTag,         // Array of enums (e.g., BinaryOp, Return)
    data: []NodeData,        // Array of fixed-size data (two u32 integers)
    extras: []u32,           // Variable-length payloads
};

If a compiler pass only needs to count the number of return statements, the CPU only loads the tags array into the L1 cache. It never fetches the data array, drastically reducing memory bandwidth usage.

Eliminating 64-bit Pointers

In a Data-Oriented AST, there are zero 64-bit memory pointers.

When a binary addition node needs to reference its left and right operands, it simply stores two 32-bit integers in its NodeData slot. These integers are indices pointing to other rows in the flat arrays.

This optimization halves the memory footprint of node references (reducing an 8-byte pointer to a 4-byte index). Because the AST is just a collection of flat arrays, it can also be entirely managed by a single Arena Allocator and trivially serialized to disk without worrying about pointer serialization.

The extras Array for Variable-Length Nodes

A binary operation requires exactly two child indices, fitting perfectly into the two u32 slots of NodeData. However, a function call might have 5 or 10 arguments.

To handle nodes with variable-length children without resorting to heap-allocated arrays on the node itself, the architecture uses an extras array.

When a node requires more than two 32-bit slots, its NodeData instead stores:

  1. An index pointing into the extras array.
  2. The length of the payload.

The 5 function arguments are stored contiguously in the extras array, preserving perfect cache locality.

Immutable ASTs and Side-Tables

A core consequence of DoD is that AST nodes are packed as tightly as possible. You cannot mutate an AST node during Semantic Analysis to add a pointer to a resolved Symbol or a computed Type, because there is no room in the packed struct to hold it.

Instead, the Semantic Analyzer creates Side-Tables. If the parser emits an identifier at index 42, the Semantic Analyzer will compute its symbol and store a pointer to that symbol in a parallel array at resolved_symbols[42].

This keeps the base AST immutably small while allowing subsequent compiler passes to overlay as much semantic data as they need.


Next Module: Semantic Analysis