Arnab Chakraborty's compiler tutorial: page 6 of 11

www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Home >

... tutorials >

... ... Compiler

Techniques to write a grammar

When designing a language, one starts with a vague idea about how the programs in that language should look. It requires some effort to turn this vague idea into a concrete grammar. This is precisely what we did in the last page for the AdvLang language. Though it was a simple language the techniques that we used in the last page are actually quite general. In this page we list five important techniques to write down the grammar of a language:

In the last page we have already used the first three. The last two techniques are new. These are the basic techniques. In a later page we shall mention some advanced techniques.

Simple sequencing

To say that something should follow something else in a program just write them in that order in the production rule. For instance, in AdvLang we want the description of a location to follow the name of the location, and the list of exits to follow the description. This sequence is specified in Rule 3:


(Rule 3)  locnSpec : tok_LOCN tok_IDENT 
                nameSpec descrSpec exitList

(By the way, we have presented the rule in two lines for aesthetic reasons only. It has no other significance. You could as well write the rule in one long line.)

If we change the order of the things in the right hand side like this


(Rule 3')  locnSpec : tok_LOCN tok_IDENT 
                                 descrSpec nameSpec exitList

then the description would be required to come before the name.

As another example, consider the C language. In a C function variables must be declared before any other statement. So the rule defining a function body in C would look like


funcBody : declList statementList

Grouping

A good quality for any computer program is modularity : breaking a large piece of code into smaller, manageable chunks of codes in a logical way. A similar idea applies to writing a grammar. Consider rules 3,4 and 5 of the AdvLang grammar:


(Rule 3)  locnSpec : tok_LOCN tok_IDENT 
            nameSpec descrSpec exitList 
(Rule 4)  nameSpec : tok_NAME tok_STRING 
(Rule 5)  descrSpec : tok_DESCR tok_STRING

These could be combined into a single l..o..n..g rule:


(Rule 3+4+5) locnSpec : tok_LOCN tok_IDENT 
                                   tok_NAME tok_STRING 
                                     tok_DESCR tok_STRING 
                                         exitList

Such a long rule is difficult to read, and makes the logic of the rule obscure. So we prefer grouping some of the symbols (i.e., terminals or non-terminals) into new non-terminals, and split long rules into multiple shorter rules, like rules 3,4 and 5. In fact, even then rule 3 is rather long. We could further split it up into two shorter rules by combining tok_LOCN and tok_IDENT into a new non terminal called, say, locnHeader. Then rule 3 would break down into the following two rules:


(Rule 3a)  locnSpec : locnHeader nameSpec 
                                 descrSpec exitList
(Rule 3b) locnHeader : tok_LOCN tok_IDENT

While designing grammars, use new non-terminals liberally. They keep the grammar neat. In a sense they are like functions in C-like languages. Breaking up a long function into shorter functions with intelligent names is often a good idea.

Lists

Lists are definitely the most important part of the grammar of any language. Here are a few examples from standard languages:

Example: In C, a declaration is like
int i,j,k,l;
This contains a list of identifiers. This is a comma-separated list. Here the elements are the identifiers. Most C functions start with a list of such declarations:
int i,j; float f, g; int y;
This list has three elements, each of which is a declaration of one type.

Example: When we define a function in C we use a list of parameters:
void fun(int i, int j, float y) { ... }
The above parameter list has three elements, each element is the declaration of a single parameter. This is a comma-separated list.

To write down the BNF of a list you need at least two production rules.

Example: Let us write down the BNF for a list of numbers, e.g.,
1 2 4 45 3 2
The production rules are:
list : NUM list : list NUM
Here NUM denotes a single number. The first rule says that the list must have at least one NUM. The second rule says that given a list you can extend it by adding another NUM at its end. If we want to allow empty lists, then we can write
list : list : list NUM
Another way to achieve the same thing would be to use the following four rules:
list : list : list1 list1 : NUM list1 : list1 NUM
Here we have introduced a new non-terminal list1, which denotes a list with at least one element. The last two production rules define it as before. Now, a possibly empty list is either empty (first rule) or is a list with at least one element (second rule). In this example, the four-rule version does not have any extra advantage over its two-rule counterpart. But the next example shows why the four-rule version is sometimes more useful.

Example: A comma-separated list of numbers like
1, 2, 4, 65, 3
can be dealt with similarly. If we do not allow the list to be empty then we need two production rules:
list : NUM list : list COMMA NUM
If we allow empty lists:
list : list : list1 list1 : NUM list1 : list1 COMMA NUM
It is not possible to have a two-rule grammar for this case.

Lists often occur with various extra features: terminated by semicolons, enclosed inside parantheses, and so on. All these may be added to the basic structure of list shown in the last two examples.

Example: To make a list of numbers enclosed inside brackets like
[1 2 3 4 5]
we can use the rules:
enclList : LBRACKET list RBRACKET list : NUM list : list NUM
The last two rules are as before. The first rule adds the two brackets around the list.

Each element in the above lists is a number, which is denoted by the token NUM. It is also possible to have lists whose elements are non-terminals. In fact, the AdvLang grammar itself furnishes two examples of this: locnSpecList is a list of the non-terminal locnSpec, and exitList is a list of the non-terminal exit. To write the grammar of a list with non-terminal elements you have to do two things:

The order does not matter to bison. You may define the list first, and then define the typical element, or the other way around.

Example: There is a language called Matlab that allows the user to specify a matrix as follows. If the matrix is

1 2 3

4 5 6

then in Matlab we write
[1, 2, 3; 4, 5, 6]
To write the grammar of a matrix specification we first identify the lists. There are two types of lists in use here. First, each row is a comma-separated list of numbers; second, the matrix is a semicolon-separated list of rows. The grammar for a row is
row : NUM row : row COMMA NUM
Now, a matrix is a semi-colon separated list of rows:
matrix : row matrix : matrix SEMICOLON row
Notice, however, that this grammar does not check if the different rows in a matrix are of the same length or not. It is difficult to check this inside the grammar, and so such checks are left to the semantic analysis phase of the compiler. We shall discuss this later.

Nesting

Nesting means having some structure inside another structure of the same type. For example, an if-block in C can contain another if-block, which in its turn can contain another, and so on.

Example: In Lisp we work with lists inside parantheses like this:
(a b (c d))
Here a, b and (c d) are three elements of the list. The last element is itself another list.
To write down the grammar for nesting we have to first identify the non-terminal that is to be nestable. In the Lisp example we can call this non-terminal parList (for paranthesised list). Next we have to write down its grammar ignoring nesting. If we ignore nesting then the grammar for the Lisp example becomes
parList : ( list ) list : /*empty*/ list : list element element : LETTER
To allow nesting we have to simply let parList be considered as an element:
element : parList

A slightly more involved example is the nesting of if-statements in C. To keep the grammar short we shall not allow any "else" part.

Example: In the following C code snippet we have an if-statement nested inside another. Our aim is to write a grammar that allows such nestings to arbitrary level.
if(a>2) { a = 3; if(b>3) { b = 8; c = 6; } d = 7; }
We start by identifying the nestable non-terminal, which we call ifStmt. Ignoring nesting for the time being we have the grammar
ifStmt : head body head : if ( cond ) cond : VAR > NUM body : { stmtList } stmtList : /*empty*/ stmtList : stmtList stmt stmt : VAR = NUM
Of course, we have simplified things somewhat than in true C. For example, we have made the braces around the body compulsory. Also, we are allowing only the simplest assignment statements and conditions. But these modifications do not alter the main principle that we are illustrating. Now, to introduce nesting, we shall just allow ifStmt to be a special case of stmt:
stmt : ifStmt

Options and alternatives

Many languages allow the user to drop certain specifications. Any dropped specification is automatically supplied by the compiler with default values during semantic analysis.

Example: In Matlab a matrix is specified like
[1, 2, 3; 4, 5, 6]
However, the commas are optional and may be dropped as in
[1 2 3; 4 5 6]
The grammar for this is just like what was shown before, except that the token COMMA is now replaced by a non-terminal optComma which is defined by the rules:
optComma : optComma : COMMA
This non-terminal denotes an optional comma, which may be absent (first rule) or present (second rule). The entire grammar for matrix now becomes
row : NUM row : row optComma NUM optComma : optComma : COMMA matrix : row matrix : matrix SEMICOLON row

Optional things are special cases of a more general concept called alternatives.

Example: A typical C declaration list is of the form
int i, j;
It starts with the name of a data type followed by a list of variable names (identifiers.) We have more than one alternatives for the data type: it may be int or float, or char or double (or more complicated things like structures etc, but to keep the example simple we shall ignore those possibilities.) When a non-terminal has many alternative definitions we use one production rule for each definition. For example, the non-terminal dataType has four rules:
dataType : INT dataType : FLOAT dataType : CHAR dataType : DOUBLE
Here INT, FLOAT, CHAR and DOUBLE denote, respectively, the tokens corresponding to int, float, char and double. Now we can have the rule
declaration : dataType idList SEMICOLON
where idList is a comma-separated list of identifiers:
idList : ID idList : idList COMMA ID

Exercises

Exercise 6.1: Again consider a list of numbers enclosed inside brackets like
[1 2 3 4 5]
Is the following grammar appropriate for this?
list : LBRACKET NUM RBRACKET list : LBRACKET list NUM RBRACKET

[Click here for solution.]

Exercise 6.2: We have already seen that Matlab writes matrices as a semicolon-separated list of rows, and each row is a comma-separated list of numbers. We have also seen that the commas are optional. Actually, the semicolons are also optional: you may use newlines in place of semicolons. Thus, both the following mean the same thing in Matlab:
[1 2 3 ; 4 5 6] [1 2 3 4 5 6]
Write down a grammar for Matlab matrices that allows this. Assume that the token NL denotes a newline.

[Click here for solution.]

Exercise 6.3: Matlab also allows the empty matrix denoted by
[]
as well as a matrix with empty rows:
[;;]
This denotes a matrix with three empty rows. Update your grammar from the last exercise to allow this.

[Click here for solution.]

Exercise 6.4: Pascal is a language very much like C. Consider the following statement list in C:
{ temp = a; a = b; b = temp; }
The corresponding Pascal code is
begin temp := a; a := b; b := temp end
Notice that there is no semicolon after the last statement (b := temp). Both the code snippets give a list of statements (enclosed inside curly braces for C, and inside begin and end for Pascal). Based on these two lists answer the following questions:

Is the C list a semicolon-separated list of statements?
Is the Pascal list a semicolon-separated list of statements?
Write the grammar that will accept a list of statements in C. Here you can use the non-terminal stmt to denote the body of a statement without the semicolon.
Do the same for Pascal.
Suppose someone types a semicolon after the last statement in a Pascal list. Will your Pascal grammar accept this? (Actual Pascal grammar does.)

[Click here for solution.]

Exercise 6.5: Why is the following grammar not correct for comma-separated lists of numbers (empty list allowed)?
list : list : list COMMA NUM
Also discuss
list : list : NUM list : list COMMA NUM

Exercise 6.6: The LISP language allows the user to write arbitrary lists where each element is either a number or another list, e.g.,
(1 2 (2 3 (0) 3 ( ) ) 34).
Each list is enclosed in parantheses. Empty lists are allowed and are denoted by ( ). Write down the grammar for this in BNF.

Exercise 6.7: Certain dialects of C (e.g., Turbo C) allow function definitions like the following:
void fun(int i, int j=2, float f=0.7, float g) { ... }
Each element in the parameter list(parmList) is a parameter specification (parmSpec). Now there are two alternative form of parmSpec. The first alternative is like
int i
where a data type is followed by a variable name. The other alternative is like
int j=2
where we specify a default value. Write down the grammar of parmList.

Exercise 6.8: Write the grammar of a semicolon-terminated list of numbers. The list must have at least one number.

Exercise 6.9: What would the second BNF rule be if we choose to append the new locnSpec to the head of the old locnSpecList?

Exercise 6.10: Write down the two BNF rules defining exitList. Remember that some locations may not have any exit at all, so the game effectively terminates if the player accidentally enters such a location (through a one-way entrance.)

hit counter