CSI311: Notes on Grammars

Notes on Formal Languages, Grammars, and Backus-Naur Form (BNF)

An ALPHABET (or a VOCABULARY) is a finite set of symbols.

A STRING/SENTENCE is a finite sequence of symbols taken from a vocabulary.

A LANGUAGE is a (possibly infinite) set of strings.

A GRAMMAR G is a 4-tuple G = <V, A, P, S> where

    V = a finite set of non-terminal symbols
    A = a finite set of terminal symbols
    P = a finite set of rules (called production rules)
    S = a distinguished non-terminal symbol called the start symbol

A production rule is of the form:  L ::= R,  where L and R are finite
sequences of symbols and the symbol "::=" means "can be rewritten as."

A production rule  L ::= R  is CONTEXT-FREE if L is a single non-terminal.

A grammar G is context-free if all its production rules are context-free.

A DERIVATION using grammar G is a finite sequence of sentences,
D(G) = S0, S1, S2, ... , Sn, such that:

  1.  S0 is the start symbol of G
  2.  Each Si is obtained from S(i-1) by application of a rule from G.
      I.e., S(i-1) is a sentence of the form  aLb,
            Si     is a sentence of the form  aRb, and
            G contains the rule  L ::= R.

We say that D(G) is a derivation of Sn.

D(G) is a leftmost (or canonical) derivation if each application
of a rule is performed on the leftmost non-terminal in S(i-1) to
produce Si.

Notation: To indicate that sentence Sj can be derived from
sentence Si (in 0 or more steps),

                   *
we write      Si  ===>  Sj
                   G

If Sn consists of only terminal symbols, then Sn is in
the language generated by G.

The language generated by G is denoted by L(G);

                 *
L(G) = { w | S  ===>  w, w consists of terminal symbols only }
                 G



      The Language of Balanced Parentheses

Examples:  (()())     ()((()))     ((()())(()))()


A grammar that defines the language whose sentences are
strings of balanced parentheses.

Terminals:  { (,  ) }

Non-terminals:  {S, X}     (S = start symbol)

Production Rules:

S ::= X      X ::= ()      X ::= (X)         X ::= XX


Derive       (  (  (  )  (  )  )  (  (  )  )  )  (  )

             |  |  |__|  |__|  |  |  |__|  |  |  |__|
             |  |______________|  |________|  |
             |________________________________|



                S
                X
               X X
              X ( )
            ( X ) ( )
           ( X X ) ( )
         ( X ( X ) ) ( )
        ( X ( ( ) ) ) ( )
      ( ( X ) ( ( ) ) ) ( )
     ( ( X X ) ( ( ) ) ) ( ) 
    ( ( ( ) X ) ( ( ) ) ) ( )
   ( ( ( ) ( ) ) ( ( ) ) ) ( )





Example:          Integers without leading zeros


Examples:  712,  +44,  -8787,  900801

Terminals:  {+, -, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

Non-terminals:  {S, X, D, Y, Z}

Grammar:

  S  ::=  0
  S  ::=  X
  S  ::=  +X
  S  ::=  -X

  Y  ::=  Z 
  Y  ::=  ZY
  X  ::=  D 
  X  ::=  DY

  D  ::=  1
      .
      .
      .
  D  ::=  9

  Z  ::=  0
  Z  ::=  D



S is the start symbol.

                          Derivation

                              S
                             +X
                             +DY
                             +4Y
                             +4Z
                             +4D
                             +44



Derivation Trees


A DERIVATION TREE for a sentence Sn in L(G), where G is a
context-free grammar, is a tree in which:

 1. Each node in the tree is a symbol from G
 2. The root is the start symbol for G
 3. If node L has children R1, R2, ... , Rn (in left-to-right
    order),
     then G contains the rule  L ::= R where R = R1 R2 ... Rn
 4. If node L is a leaf, then L is a terminal symbol or the
    symbol for the empty string. 
 5. If the leaves of the tree are L1, L2, ... , Ln
     in left-to-right order, then  Sn = L1 L2 ... Ln

A context-free grammar G is UNAMBIGUOUS if for every sentence w
in L(G), there is a unique derivation tree for w (hence a
unique canonical derivation).
    

A given language will in general be generated by many grammars;
some may be ambiguous and some may not. An INHERENTLY AMBIGUOUS
language is one for which there does not exist an unambiguous
grammar.



 Canonical Derivation                  Derivation Tree       
                                                     
           S                                 S            
          +X                               /  \         
          +DY                             +    X          
          +4Y                                 /  \
          +4Z                                D    Y       
          +4D                                |    |     
          +44                                4    Z      
                                                  |    
                                                  D    
                                                  |   
                                                  4



Backus-Naur Form  (BNF)

This is just some convenient notation for describing a context-free
grammar more succinctly.


Non-terminal symbols are recognized by being enclosed in angle-brackets.

Terminal symbols are recognized by NOT being enclosed in angle-brackets.

The "|" symbol is used on the right hand side of rules to indicate
alternative choices.

There is no special declaration of the start symbol; it is usually obvious.


Example:   G = <P, V, v, S> where

  P =  {X ::= A,       V =  {X}          S =  X     
        X ::= B,
        X ::= C}       v =  {A, B, C}


In BNF, we can specify G by the one rule:

          <X> ::=  A | B | C



Integers without leading zeros using BNF


   <S>  ::=  <X> | +<X> | -<X> | 0

   <X>  ::=  <D> | <D><Y>

   <Y>  ::=  <Z> | <Z><Y>

   <D>  ::=  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

   <Z>  ::=  0 | <D>