Notes on Formal Languages, Grammars, and Backus-Naur Form (BNF)
An ALPHABET (or a VOCABULARY) is a finite set of symbols.
A STRING/SENTENCE is a finite sequence of symbols taken from a vocabulary.
A LANGUAGE is a (possibly infinite) set of strings.
A GRAMMAR G is a 4-tuple G = <V, A, P, S> where
V = a finite set of non-terminal symbols
A = a finite set of terminal symbols
P = a finite set of rules (called production rules)
S = a distinguished non-terminal symbol called the start symbol
A production rule is of the form: L ::= R, where L and R are finite
sequences of symbols and the symbol "::=" means "can be rewritten as."
A production rule L ::= R is CONTEXT-FREE if L is a single non-terminal.
A grammar G is context-free if all its production rules are context-free.
A DERIVATION using grammar G is a finite sequence of sentences,
D(G) = S0, S1, S2, ... , Sn, such that:
1. S0 is the start symbol of G
2. Each Si is obtained from S(i-1) by application of a rule from G.
I.e., S(i-1) is a sentence of the form aLb,
Si is a sentence of the form aRb, and
G contains the rule L ::= R.
We say that D(G) is a derivation of Sn.
D(G) is a leftmost (or canonical) derivation if each application
of a rule is performed on the leftmost non-terminal in S(i-1) to
produce Si.
Notation: To indicate that sentence Sj can be derived from
sentence Si (in 0 or more steps),
*
we write Si ===> Sj
G
If Sn consists of only terminal symbols, then Sn is in
the language generated by G.
The language generated by G is denoted by L(G);
*
L(G) = { w | S ===> w, w consists of terminal symbols only }
G
The Language of Balanced Parentheses
Examples: (()()) ()((())) ((()())(()))()
A grammar that defines the language whose sentences are
strings of balanced parentheses.
Terminals: { (, ) }
Non-terminals: {S, X} (S = start symbol)
Production Rules:
S ::= X X ::= () X ::= (X) X ::= XX
Derive ( ( ( ) ( ) ) ( ( ) ) ) ( )
| | |__| |__| | | |__| | | |__|
| |______________| |________| |
|________________________________|
S
X
X X
X ( )
( X ) ( )
( X X ) ( )
( X ( X ) ) ( )
( X ( ( ) ) ) ( )
( ( X ) ( ( ) ) ) ( )
( ( X X ) ( ( ) ) ) ( )
( ( ( ) X ) ( ( ) ) ) ( )
( ( ( ) ( ) ) ( ( ) ) ) ( )
Example: Integers without leading zeros
Examples: 712, +44, -8787, 900801
Terminals: {+, -, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Non-terminals: {S, X, D, Y, Z}
Grammar:
S ::= 0
S ::= X
S ::= +X
S ::= -X
Y ::= Z
Y ::= ZY
X ::= D
X ::= DY
D ::= 1
.
.
.
D ::= 9
Z ::= 0
Z ::= D
S is the start symbol.
Derivation
S
+X
+DY
+4Y
+4Z
+4D
+44
Derivation Trees
A DERIVATION TREE for a sentence Sn in L(G), where G is a
context-free grammar, is a tree in which:
1. Each node in the tree is a symbol from G
2. The root is the start symbol for G
3. If node L has children R1, R2, ... , Rn (in left-to-right
order),
then G contains the rule L ::= R where R = R1 R2 ... Rn
4. If node L is a leaf, then L is a terminal symbol or the
symbol for the empty string.
5. If the leaves of the tree are L1, L2, ... , Ln
in left-to-right order, then Sn = L1 L2 ... Ln
A context-free grammar G is UNAMBIGUOUS if for every sentence w
in L(G), there is a unique derivation tree for w (hence a
unique canonical derivation).
A given language will in general be generated by many grammars;
some may be ambiguous and some may not. An INHERENTLY AMBIGUOUS
language is one for which there does not exist an unambiguous
grammar.
Canonical Derivation Derivation Tree
S S
+X / \
+DY + X
+4Y / \
+4Z D Y
+4D | |
+44 4 Z
|
D
|
4
Backus-Naur Form (BNF)
This is just some convenient notation for describing a context-free
grammar more succinctly.
Non-terminal symbols are recognized by being enclosed in angle-brackets.
Terminal symbols are recognized by NOT being enclosed in angle-brackets.
The "|" symbol is used on the right hand side of rules to indicate
alternative choices.
There is no special declaration of the start symbol; it is usually obvious.
Example: G = <P, V, v, S> where
P = {X ::= A, V = {X} S = X
X ::= B,
X ::= C} v = {A, B, C}
In BNF, we can specify G by the one rule:
<X> ::= A | B | C
Integers without leading zeros using BNF
<S> ::= <X> | +<X> | -<X> | 0
<X> ::= <D> | <D><Y>
<Y> ::= <Z> | <Z><Y>
<D> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<Z> ::= 0 | <D>