9. Regular Expressions, Grammar, and Languages
To work with formal languages and string patterns, it is essential to understand regular expressions, regular grammar, and regular languages. They provide the core mathematical rules for defining the strict syntax of programming languages.
9.1 Regular Expressions
Regular expressions are symbolic notations used to define search patterns in strings. They describe regular languages and are heavily relied upon to search for matching patterns within large datasets.
A regular expression represents a regular language if it follows these rules:
- ε (epsilon) is a regular expression representing the language
{ε}(the empty string). - Any symbol a from the input alphabet Σ is a regular expression, representing the language
{a}. - Union (a + b) is a regular expression if a and b are regular expressions, representing the language
{a, b}. - Concatenation (ab) is a regular expression if a and b are regular expressions.
- Kleene star (a)* is a regular expression, meaning zero or more occurrences of ‘a’.
9.2 Regular Grammar
A regular grammar is a formal grammar that generates regular languages. It consists of:
- Terminals: Symbols that form strings (e.g., a, b).
- Non-terminals: Variables used to derive strings (e.g., S, A).
- Production Rules: Rules for transforming non-terminals into terminals or other non-terminals.
- Start Symbol: The non-terminal from which derivations begin.
Types of Regular Grammar
- Right-linear Grammar: All production rules are of the form
A → aBorA → a. Example:S → aS | bS | ε - Left-linear Grammar: All production rules are of the form
A → BaorA → a.
9.3 Regular Languages & Closure Properties
Regular languages are the class of languages that can be represented using finite automata, regular expressions, or regular grammar. Because their structures are rigidly bounded, they can be recognized by an automaton in $O(n)$ time.
Regular languages are closed under several operations:
- Union: If $L_1$ and $L_2$ are regular, $L_1 \cup L_2$ is regular.
- Intersection: If $L_1$ and $L_2$ are regular, $L_1 \cap L_2$ is regular.
- Concatenation: If $L_1$ and $L_2$ are regular, $L_1.L_2$ is regular.
- Kleene Closure: If $L_1$ is regular, $L_1^*$ is regular.
- Complement: If $L(G)$ is regular, its complement $L’(G)$ is regular.
9.4 Designing Finite Automata from Regular Expressions
We can systematically translate any regular expression pattern directly into a state transition diagram (FA) using structural induction.
Example 1: Even number of a’s
Regular Expression: (b|ab*ab*)*
We can construct a finite automaton for this pattern:
- For zero
a’s, it remains inq0(the final state). - For one
a, it transitions fromq0toq1(string is rejected). - For a second
a, it transitions fromq1back toq0(accepted).
Example 2: String with ‘ab’ as a substring
Regular Expression: (a|b)*ab(a|b)*
The automaton will:
- Remain in
q0when readingbs. - Move to
q1after reading ana. - Move to
q2(Accept) if abfollows thea.
Example 3: String with count of ‘a’ divisible by 3
Regular Expression: (aaa)*
(Note: For an automaton where the number of a’s is 3n+1, we would simply make q1 the final state instead of q0.)
9.5 Arden’s Theorem
Arden’s Theorem is used to find the regular expression equivalent to a Deterministic Finite Automaton (DFA) or Nondeterministic Finite Automaton (NFA). It helps solve systems of equations with regular expressions.
The Theorem Statement
Let P, Q, and R be regular expressions over an alphabet Σ. If P does not contain ε (the empty string), the equation R = Q + RP has a unique solution given by:
R = QP*
Here:
- Q: The regular expression independent of
P. - P: The regular expression associated with
R.
This theorem effectively states that if a state can be reached by a sequence without a loop (Q) followed by zero or more loops (P*), the solution is exactly QP*.
9.6 Identifying Regular Languages (Pumping Lemma)
Not all languages are regular. The Pumping Lemma is a mathematical proof technique used to determine if a language is not regular. It is a negativity test (if a language fails it, it’s definitely not regular; but passing it doesn’t guarantee regularity).
For quick exams and analysis, follow these general rules:
- Finite Sets: Every finite set represents a regular language. (e.g., all strings of length = 2).
- Bounded Parameters: Expressions with bounded constants are regular. (e.g.,
L = {a^n b^n | n <= 1000}). - Arithmetic Progression: A pattern that forms an Arithmetic Progression (linear power) is usually regular, while Geometric Progressions (non-linear powers) are not. (Note: This is a common heuristic/shortcut rule of thumb for quickly answering exam questions, not a formal mathematical theorem. Always rely on the Pumping Lemma for formal proofs).
- Infinite Comparisons: If unbounded storage is required for counting and comparison (e.g.,
L = {a^n b^n | n >= 1}), it is not regular because an FA has no memory stack.
9.7 Star Height of Regular Expressions
The Star Height indicates the structural complexity of a regular expression, specifically referring to the maximum nesting depth of Kleene stars (*).
- h(ε) = 0 and h(a) = 0
- h(E ∪ F) = max(h(E), h(F))
- h(E) = h(E) + 1*
Example: The star height of (a* b*)* is 2. The minimum star height required to describe a regular language indicates its complexity.
9.8 Kleene’s Theorem
Kleene’s Theorem states a fundamental equivalence in automata theory:
For any Regular Expression r representing Language L(r), there exists a Finite Automaton that accepts the exact same language, and vice-versa.
Kleene’s theorem provides systematic methods to combine smaller automata into complex ones via operations like Union, Concatenation, and Kleene Closure (Star), linked by intermediate null (ε) transitions.
9.9 Mealy and Moore Machines
Finite state machines can also act as Transducers that produce output based on the input. There are two primary types:
Moore Machines
- Output Dependence: Output depends only on the present state.
- Definition: A 6-tuple
(Q, q0, Σ, O, δ, λ), whereOis the finite set of output symbols (Output Alphabet). The output functionλmapsQ → O. - Output Length: The length of the output is always greater than the input by 1 (because an output is produced for the initial state before any input is read).
Mealy Machines
- Output Dependence: Output depends on both the present state and current input symbol.
- Definition: A 6-tuple
(Q, q0, Σ, O, δ, λ'), whereOis the finite set of output symbols (Output Alphabet). The output functionλ'mapsQ × Σ → O. - Output Length: The length of the output is exactly equal to the length of the input.
(Note: Every Moore machine can be converted into an equivalent Mealy machine, and vice-versa, though the number of states may increase when converting Mealy to Moore).