Chomsky Normal Form


A Context-Free Grammar (CFG) is in Chomsky Normal Form (CNF) if every rule is of the form

    \begin{equation*} A \rightarrow BC \end{equation*}


    \begin{equation*} A \rightarrow a \end{equation*}


    \begin{equation*} S \rightarrow \epsilon \end{equation*}

where A \in V, B, C \in V \setminus \{S\}, a \in \Sigma and S is the start variable. Note that the start variable cannot be present in the RHS of any rule.

CFG to a CFG in CNF

Consider the CFG to generate strings with balanced paranthesis

    \begin{equation*} S \rightarrow (S)S|[S]S|\epsilon \end{equation*}

Every CFG can be converted to a CFG in CNF by following the steps below.

1. No start variable on RHS of a rule

This can be done by adding a new start variable S'.

    \begin{align*} S' &\rightarrow S \\ S &\rightarrow (S)S|[S]S|\epsilon \\ \end{align*}

2. Null Variables

a. Determine which variables are nullable

A variable is nullable if it can generate \epsilon. The nullable variables for the above grammar are

    \begin{equation*} S, S' \end{equation*}

b. Whenever A \rightarrow xBy and B is nullable, add rule A \rightarrow xy

In essence, replace every instance of a nullable variable with \epsilon and consider all possiblities.

    \begin{align*} S' &\rightarrow S|\epsilon \\ S &\rightarrow (S)S|()S|(S)|()|[S]S|[]S|[S]|[]|\epsilon \\ \end{align*}

c. Remove all \epsilon rules except for S \rightarrow \epsilon (if present)

Now that all nullable variables generating \epsilon is accounted for, these rules can be removed.

    \begin{align*} S' &\rightarrow S|\epsilon \\ S &\rightarrow (S)S|()S|(S)|()|[S]S|[]S|[S]|[] \\ \end{align*}

3. Eliminate Unit Rules

Unit rules are of the type A \rightarrow B where both A and B are variables. Replacing B with B‘s RHS expansion will eliminate this unit rule.

    \begin{align*} S' &\rightarrow (S)S|()S|(S)|()|[S]S|[]S|[S]|[]|\epsilon \\ S &\rightarrow (S)S|()S|(S)|()|[S]S|[]S|[S]|[] \\ \end{align*}

4. Ensure that every RHS of length at least 2 contains only variables

This step involves creating new rules of the type A \rightarrow a where A is a variable and a is a terminal. Define the following new rules

    \begin{align*} T_( &\rightarrow ( \\ T_) &\rightarrow ) \\ T_[ &\rightarrow [ \\ T_] &\rightarrow ] \\ \end{align*}

The grammar is then modified to

    \begin{align*} S' &\rightarrow T_(ST_)S|T_(T_)S|T_(ST_)|T_(T_)|T_[ST_]S|T_[T_]S|T_[ST_]|T_[T_]|\epsilon \\ S &\rightarrow T_(ST_)S|T_(T_)S|T_(ST_)|T_(T_)|T_[ST_]S|T_[T_]S|T_[ST_]|T_[T_] \\ \end{align*}

5. Breakup RHS of length greater than 2

RHS of length greater than 2 can broken down into sequential chunks of 2. For example S \rightarrow T_(ST_)S can be written as

    \begin{align*} S &\rightarrow BS \\ B &\rightarrow AT_) \\ A &\rightarrow T_(S \\ \end{align*}

The final grammar in CNF is as follows

    \begin{align*} S' &\rightarrow BS|CS|AT_)|T_(T_)|ES|FS|DT_]|T_[T_]|\epsilon \\ S &\rightarrow BS|CS|AT_)|T_(T_)|ES|FS|DT_]|T_[T_] \\ A &\rightarrow T_(S \\ B &\rightarrow AT_) \\ C &\rightarrow T_(T_) \\ D &\rightarrow T_[S \\ E &\rightarrow DT_] \\ F &\rightarrow T_[T_] \\ T_( &\rightarrow ( \\ T_) &\rightarrow ) \\ T_[ &\rightarrow [ \\ T_] &\rightarrow ] \\ \end{align*}

Leave a Reply

Your email address will not be published. Required fields are marked *