9. Regular Expressions, Grammar, and Languages

To work with formal languages and string patterns, it is essential to understand regular expressions, regular grammar, and regular languages. They provide the core mathematical rules for defining the strict syntax of programming languages.

9.1 Regular Expressions

Regular expressions are symbolic notations used to define search patterns in strings. They describe regular languages and are heavily relied upon to search for matching patterns within large datasets.

A regular expression represents a regular language if it follows these rules:

ε (epsilon) is a regular expression representing the language {ε} (the empty string).
Any symbol a from the input alphabet Σ is a regular expression, representing the language {a}.
Union (a + b) is a regular expression if a and b are regular expressions, representing the language {a, b}.
Concatenation (ab) is a regular expression if a and b are regular expressions.
Kleene star (a)* is a regular expression, meaning zero or more occurrences of ‘a’.

9.2 Regular Grammar

A regular grammar is a formal grammar that generates regular languages. It consists of:

Terminals: Symbols that form strings (e.g., a, b).
Non-terminals: Variables used to derive strings (e.g., S, A).
Production Rules: Rules for transforming non-terminals into terminals or other non-terminals.
Start Symbol: The non-terminal from which derivations begin.

Types of Regular Grammar

Right-linear Grammar: All production rules are of the form A → aB or A → a. Example: S → aS | bS | ε
Left-linear Grammar: All production rules are of the form A → Ba or A → a.

9.3 Regular Languages & Closure Properties

Regular languages are the class of languages that can be represented using finite automata, regular expressions, or regular grammar. Because their structures are rigidly bounded, they can be recognized by an automaton in $O(n)$ time.

Regular languages are closed under several operations:

Union: If $L_1$ and $L_2$ are regular, $L_1 \cup L_2$ is regular.
Intersection: If $L_1$ and $L_2$ are regular, $L_1 \cap L_2$ is regular.
Concatenation: If $L_1$ and $L_2$ are regular, $L_1.L_2$ is regular.
Kleene Closure: If $L_1$ is regular, $L_1^*$ is regular.
Complement: If $L(G)$ is regular, its complement $L’(G)$ is regular.

9.4 Designing Finite Automata from Regular Expressions

We can systematically translate any regular expression pattern directly into a state transition diagram (FA) using structural induction.

Example 1: Even number of a’s

Regular Expression: (b|ab*ab*)*

We can construct a finite automaton for this pattern:

For zero a’s, it remains in q0 (the final state).
For one a, it transitions from q0 to q1 (string is rejected).
For a second a, it transitions from q1 back to q0 (accepted).

Example 2: String with ‘ab’ as a substring

Regular Expression: (a|b)*ab(a|b)*

The automaton will:

Remain in q0 when reading bs.
Move to q1 after reading an a.
Move to q2 (Accept) if a b follows the a.

Example 3: String with count of ‘a’ divisible by 3

Regular Expression: (aaa)*

(Note: For an automaton where the number of a’s is 3n+1, we would simply make q1 the final state instead of q0.)

9.5 Arden’s Theorem

Arden’s Theorem is used to find the regular expression equivalent to a Deterministic Finite Automaton (DFA) or Nondeterministic Finite Automaton (NFA). It helps solve systems of equations with regular expressions.

The Theorem Statement

Let P, Q, and R be regular expressions over an alphabet Σ. If P does not contain ε (the empty string), the equation R = Q + RP has a unique solution given by:

R = QP*

Here:

Q: The regular expression independent of P.
P: The regular expression associated with R.

This theorem effectively states that if a state can be reached by a sequence without a loop (Q) followed by zero or more loops (P*), the solution is exactly QP*.

9.6 Identifying Regular Languages (Pumping Lemma)

Not all languages are regular. The Pumping Lemma is a mathematical proof technique used to determine if a language is not regular. It is a negativity test (if a language fails it, it’s definitely not regular; but passing it doesn’t guarantee regularity).

For quick exams and analysis, follow these general rules:

Finite Sets: Every finite set represents a regular language. (e.g., all strings of length = 2).
Bounded Parameters: Expressions with bounded constants are regular. (e.g., L = {a^n b^n | n <= 1000}).
Arithmetic Progression: A pattern that forms an Arithmetic Progression (linear power) is usually regular, while Geometric Progressions (non-linear powers) are not. (Note: This is a common heuristic/shortcut rule of thumb for quickly answering exam questions, not a formal mathematical theorem. Always rely on the Pumping Lemma for formal proofs).
Infinite Comparisons: If unbounded storage is required for counting and comparison (e.g., L = {a^n b^n | n >= 1}), it is not regular because an FA has no memory stack.

9.7 Star Height of Regular Expressions

The Star Height indicates the structural complexity of a regular expression, specifically referring to the maximum nesting depth of Kleene stars (*).

h(ε) = 0 and h(a) = 0
h(E ∪ F) = max(h(E), h(F))
h(E) = h(E) + 1*

Example: The star height of (a* b*)* is 2. The minimum star height required to describe a regular language indicates its complexity.

9.8 Kleene’s Theorem

Kleene’s Theorem states a fundamental equivalence in automata theory: For any Regular Expression r representing Language L(r), there exists a Finite Automaton that accepts the exact same language, and vice-versa.

Kleene’s theorem provides systematic methods to combine smaller automata into complex ones via operations like Union, Concatenation, and Kleene Closure (Star), linked by intermediate null (ε) transitions.

9.9 Mealy and Moore Machines

Finite state machines can also act as Transducers that produce output based on the input. There are two primary types:

Moore Machines

Output Dependence: Output depends only on the present state.
Definition: A 6-tuple (Q, q0, Σ, O, δ, λ), where O is the finite set of output symbols (Output Alphabet). The output function λ maps Q → O.
Output Length: The length of the output is always greater than the input by 1 (because an output is produced for the initial state before any input is read).

Mealy Machines

Output Dependence: Output depends on both the present state and current input symbol.
Definition: A 6-tuple (Q, q0, Σ, O, δ, λ'), where O is the finite set of output symbols (Output Alphabet). The output function λ' maps Q × Σ → O.
Output Length: The length of the output is exactly equal to the length of the input.

(Note: Every Moore machine can be converted into an equivalent Mealy machine, and vice-versa, though the number of states may increase when converting Mealy to Moore).

Next: Learn about Context-Free Grammars here.

8. Finite Automata 10. Context-Free Grammars