10. Context-Free Grammars
A Context-Free Grammar (CFG) is a formal grammar that defines a Context-Free Language (Type 2). CFGs are used in compiler design because they define the recursive syntactical structure of programming languages, which Regular Grammars cannot handle (like matching parentheses or nested if-else blocks).
They are processed by a Pushdown Automaton (PDA) and form the theoretical basis for the Syntax Analysis (Parsing) phase of a compiler.
10.1 Formal Definition of CFG
A Context-Free Grammar is formally defined as a 4-tuple: G = (V, Σ, R, S)
- V (Variables/Non-terminals): A finite set of variables (e.g.,
S, A, B) used to derive strings. - Σ (Terminals): A finite set of terminal symbols (the actual alphabet/characters of the language).
- R (Production Rules): A finite set of rules mapping variables to strings of variables and terminals. In a CFG, the left side of every rule must be exactly one single variable (e.g.,
A → α). - S (Start Symbol): The designated starting variable ($S \in V$).
10.2 Simplifying Context-Free Grammars
CFGs often contain redundant rules, unreachable symbols, or unnecessary complexities. Simplifying a CFG reduces its size while preserving the exact language it generates.
There are three types of redundant productions to remove. They must be removed in this exact order (removing null productions first can create new unit productions, so unit removal must come after):
- Null (ε) Productions
- Unit Productions
- Useless Productions
Step 1: Removal of Null (ε) Productions
A production of the form A → ε is a null production.
To remove them: Find all “nullable” variables (variables that can derive ε). For every rule containing a nullable variable on the right-hand side, add new rules simulating what happens if that variable were replaced by ε. Finally, remove the original A → ε rules.
Step 2: Removal of Unit Productions
A production of the form A → B (where both are single variables) is a unit production.
To remove them: If A → B and B → x | y, you bypass B and directly add A → x | y to the grammar. Once all paths are resolved, delete the unit productions.
Step 3: Removal of Useless Productions
A variable or production is useless if it can never take part in the derivation of a valid terminal string. This is a two-step process that must be done in order:
- Identify and remove non-generating variables: Variables that never eventually resolve to purely terminal strings.
- Identify and remove unreachable variables: After removing non-generating variables, check which variables can no longer be reached from the start symbol
S, and remove them.
(Note: Checking reachability before removing non-generating variables can leave useless symbols in the grammar, as a symbol might look reachable until a non-generating intermediate symbol is deleted).
10.3 Chomsky Normal Form (CNF)
Converting a simplified CFG to a standard format called Chomsky Normal Form (CNF) is a prerequisite for many advanced parsing algorithms (like the CYK algorithm) and proves mathematical properties about tree depths.
A CFG is in CNF if every production rule strictly matches one of these two forms:
A → BC(A variable generating exactly two variables)A → a(A variable generating exactly one terminal)
(Exception: The start symbol can generate ε (S → ε), provided S doesn’t appear on the right side of any rule).
Advantages of CNF
- A derivation of a string of length $n$ will always take exactly $2n - 1$ steps.
- Parse trees for CNF grammars are strictly binary trees. Parsing a CNF grammar using the CYK algorithm takes $O(n^3)$ time.
How to Convert CFG to CNF
- Eliminate the Start Symbol from RHS: If the start symbol
Sappears on the right side of any rule, create a new start symbolS0 → S. - Ensure CFG is Simplified: Remove Null, Unit, and Useless productions (as detailed in 10.2).
- Replace mixed terminals: For rules mixing terminals and variables (like
A → aB), introduce a new variable for the terminal (e.g.,X → a, changing the rule toA → XB). - Break down long rules: For rules with more than two variables (like
A → BCD), cascade them into groups of two (e.g.,A → BXandX → CD).
10.4 Ambiguity in Context-Free Grammars
A CFG is ambiguous if there exists at least one string in its language with multiple valid interpretations. Any of the following are equivalent evidence of ambiguity for a given string:
- It has more than one Leftmost Derivation (LMD).
- It has more than one Rightmost Derivation (RMD).
- It has more than one distinct Parse Tree.
Ambiguity is a major problem in compiler design. For example, consider the “Dangling Else” problem:
S → if E then S | if E then S else S | OTHERIf we parse the string if E1 then if E2 then S1 else S2, it is ambiguous because the else S2 could legally attach to the first if E1 or the second if E2.
Compilers handle ambiguity by rewriting the grammar to be unambiguous, or by applying external rules like operator precedence and associativity.
10.5 Pumping Lemma for CFG
Just as there is a Pumping Lemma for Regular Languages, there is a Pumping Lemma for Context-Free Languages. It is a mathematical negativity test used to prove that a specific language is not context-free.
If a language requires unbounded matching across three or more synchronized components (e.g., $L = {a^n b^n c^n | n \ge 0}$), a CFG (which uses a single stack) cannot track it. (Note: This “three synchronized components” idea is a helpful heuristic for quickly identifying non-context-free languages, but it is not a proof. A formal proof requires using the Pumping Lemma directly—picking a string and demonstrating that every possible way of pumping it violates the required count).
Next: Explore how Pushdown Automata process these languages here.