|
www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com |
|
|
Techniques to write a grammar
When designing a language, one starts with a vague idea about how the
programs in that language should look. It requires some effort to turn
this vague idea into a concrete grammar. This is precisely what we did in
the last page for the AdvLang language. Though it was a simple
language the techniques that we used in the last page are actually quite
general. In this page we list five important
techniques to write down the grammar of a language:
Simple sequencing
Grouping
Lists
Nesting
Options and alternatives
In the last page we have already used the first three. The last two
techniques are new. These are the basic techniques.
In a later page we shall mention some advanced techniques.
Simple sequencing
To say that something should follow something else in a program just write
them in that order in the production rule. For instance, in AdvLang
we want the description of a location to follow the name of the location,
and the list of exits to follow the description. This sequence is
specified in Rule 3:
(Rule 3) locnSpec : tok_LOCN tok_IDENT
nameSpec descrSpec exitList
(By the way, we have presented the rule in two lines for aesthetic
reasons only. It has no other significance. You could as well write the rule in
one long line.)
If we change the order of the things in the right hand side like this
(Rule 3') locnSpec : tok_LOCN tok_IDENT
descrSpec nameSpec exitList
then the description would be required to come before the name.
As another example, consider the C language. In a C function variables
must be declared before any other statement. So the rule defining a
function body in C would look like
funcBody : declList statementList
Grouping
A good quality for any computer program is modularity :
breaking a large piece of code into smaller, manageable chunks of codes in
a logical way. A similar idea applies to writing a grammar. Consider rules
3,4 and 5
of the AdvLang grammar:
(Rule 3) locnSpec : tok_LOCN tok_IDENT
nameSpec descrSpec exitList
(Rule 4) nameSpec : tok_NAME tok_STRING
(Rule 5) descrSpec : tok_DESCR tok_STRING
These could be combined into a
single l..o..n..g rule:
(Rule 3+4+5) locnSpec : tok_LOCN tok_IDENT
tok_NAME tok_STRING
tok_DESCR tok_STRING
exitList
Such a long rule is difficult to read, and makes the logic of the rule
obscure. So we prefer grouping some of the symbols (i.e., terminals
or non-terminals) into new non-terminals, and split long rules into
multiple shorter rules, like rules 3,4 and 5. In fact, even then rule 3 is
rather long. We could further split it up into two shorter rules by
combining tok_LOCN and tok_IDENT into a new non terminal
called, say, locnHeader. Then rule 3 would break down into the
following two rules:
(Rule 3a) locnSpec : locnHeader nameSpec
descrSpec exitList
(Rule 3b) locnHeader : tok_LOCN tok_IDENT
While designing grammars, use new non-terminals liberally. They keep the
grammar neat. In a sense they are like functions in C-like languages.
Breaking up a long function into shorter functions with intelligent names
is often a good idea.
Lists
Lists are definitely the most important part of the grammar of any
language. Here are a few examples from standard languages:
| Example:
In C, a declaration is like
int i,j,k,l;
This contains a list of identifiers. This is a comma-separated list.
Here the elements are the identifiers.
Most C functions start with a list of such
declarations:
int i,j;
float f, g;
int y;
This list has three elements, each of which is a declaration
of one type.
|
| Example:
When we define a function in C we use a list of parameters:
void fun(int i, int j, float y) { ... }
The above parameter list has three elements, each element is the
declaration of a single parameter. This is a comma-separated list.
|
To write down the BNF of a list you need at least two production rules.
| Example:
Let us write down the BNF for a list of numbers, e.g.,
1 2 4 45 3 2
The production rules are:
list : NUM
list : list NUM
Here NUM denotes a single number. The first rule says that the list must
have at least one NUM. The second rule says that given a list you
can extend it by adding another NUM at its end.
If we want to allow empty lists, then we can write
list :
list : list NUM
Another way to achieve the same thing would be to use the
following four rules:
list :
list : list1
list1 : NUM
list1 : list1 NUM
Here we have introduced a new non-terminal list1, which denotes a list
with at least one element. The last two production rules define it as
before. Now, a possibly empty list is either empty (first rule) or is a list with at
least one element (second rule). In this example, the four-rule version
does not have any extra advantage over its two-rule counterpart. But the
next example shows why the four-rule version is sometimes more useful.
|
| Example:
A comma-separated list of numbers like
1, 2, 4, 65, 3
can be dealt with similarly. If we do not allow the list to be empty then
we need two production rules:
list : NUM
list : list COMMA NUM
If we allow empty lists:
list :
list : list1
list1 : NUM
list1 : list1 COMMA NUM
It is not possible to have a two-rule grammar for this case.
|
Lists often occur with various extra features: terminated by semicolons,
enclosed inside parantheses, and so on. All these may be added to the
basic structure of list shown in the last two examples.
| Example:
To make a list of numbers enclosed inside brackets like
[1 2 3 4 5]
we can use the rules:
enclList : LBRACKET list RBRACKET
list : NUM
list : list NUM
The last two rules are as before. The first rule adds the two brackets
around the list.
|
Each element in the above lists is a number, which is denoted by the token
NUM. It is also possible to have lists whose elements are
non-terminals. In fact, the AdvLang grammar itself furnishes two examples
of this: locnSpecList is a list of the non-terminal locnSpec, and exitList
is a list of the non-terminal exit. To write the grammar of a
list with
non-terminal elements you have to do two things:
write down the rule(s) defining a typical
element, and
define the list in terms of the non-terminal.
The
order does not matter to bison. You may define the list first, and
then define the typical element, or the other way around.
| Example:
There is a language called Matlab that allows the user to specify a matrix
as follows. If the matrix is
then in Matlab we write
[1, 2, 3; 4, 5, 6]
To write the grammar of a matrix specification we first identify the
lists. There are two types of lists in use here. First, each row is a
comma-separated list of numbers; second, the matrix is a
semicolon-separated list of rows. The grammar for a row is
row : NUM
row : row COMMA NUM
Now, a matrix is a semi-colon separated list of rows:
matrix : row
matrix : matrix SEMICOLON row
Notice, however, that this grammar does not check if the different rows in
a matrix are of the same length or not. It is difficult to check this
inside the grammar, and so such checks are left to the semantic analysis
phase of the compiler. We shall discuss this later.
|
Nesting
Nesting means having some structure inside another structure of the
same type. For example, an if-block in C can contain another
if-block, which in its turn can contain another, and so on.
| Example:
In Lisp we
work with lists inside parantheses like this:
(a b (c d))
Here a , b and (c d) are three
elements of the list. The last element is itself another
list.
To write down the grammar for nesting we have to first identify
the non-terminal that is to be nestable. In the Lisp example we
can call this non-terminal parList (for paranthesised
list). Next we have to write down its grammar ignoring
nesting. If we ignore nesting then the grammar for the Lisp
example becomes
parList : ( list )
list : /*empty*/
list : list element
element : LETTER
To allow nesting we have to simply let parList be
considered as an element:
element : parList
|
A slightly more involved example is the nesting of
if -statements in C. To keep the grammar short we
shall not allow any "else" part.
| Example:
In the following C code snippet we have an
if -statement nested inside another. Our aim is to
write a grammar that allows such nestings to arbitrary level.
if(a>2) {
a = 3;
if(b>3) {
b = 8;
c = 6;
}
d = 7;
}
We start by identifying the nestable non-terminal, which we call
ifStmt. Ignoring nesting for the time being we have the
grammar
ifStmt : head body
head : if ( cond )
cond : VAR > NUM
body : { stmtList }
stmtList : /*empty*/
stmtList : stmtList stmt
stmt : VAR = NUM
Of course, we have simplified things somewhat than in true C. For
example, we have made the braces around the body
compulsory. Also, we are allowing only the simplest assignment
statements and conditions. But
these modifications do not alter the main principle that we are
illustrating. Now, to introduce nesting, we shall just
allow ifStmt to be a special case of stmt:
stmt : ifStmt
|
Options and alternatives
Many languages allow the user to drop certain specifications. Any dropped
specification is automatically supplied by the compiler with default
values during semantic analysis.
| Example:
In Matlab a matrix is specified like
[1, 2, 3; 4, 5, 6]
However, the commas are optional and may be dropped as in
[1 2 3; 4 5 6]
The grammar for this is just like what was shown before, except that the
token COMMA is now replaced by a non-terminal optComma
which is defined by the rules:
optComma :
optComma : COMMA
This non-terminal denotes an optional comma, which may be absent (first
rule) or present (second rule). The entire grammar for matrix now becomes
row : NUM
row : row optComma NUM
optComma :
optComma : COMMA
matrix : row
matrix : matrix SEMICOLON row
|
Optional things are special cases of a more general concept called
alternatives.
| Example:
A typical C declaration list is of the form
int i, j;
It starts with the name of a data type followed by a list of variable
names (identifiers.) We have more than one alternatives for the data type:
it may be int or float, or char or double (or more complicated things like
structures etc, but to keep the example simple we shall ignore those
possibilities.) When a non-terminal has many alternative definitions we
use one production rule for each definition. For example, the non-terminal
dataType has four rules:
dataType : INT
dataType : FLOAT
dataType : CHAR
dataType : DOUBLE
Here INT, FLOAT, CHAR and DOUBLE denote,
respectively, the tokens
corresponding to int, float, char and
double. Now we can have the rule
declaration : dataType idList SEMICOLON
where idList is a comma-separated list of identifiers:
idList : ID
idList : idList COMMA ID
|
Exercises
| Exercise 6.1:
Again consider a list of numbers enclosed inside brackets like
[1 2 3 4 5]
Is the following grammar appropriate for this?
list : LBRACKET NUM RBRACKET
list : LBRACKET list NUM RBRACKET
|
| [Click here for solution.] |
| Exercise 6.2:
We have already seen that Matlab writes matrices as a semicolon-separated
list of rows, and each row is a comma-separated list of numbers. We have
also seen that the commas are optional. Actually, the semicolons are also
optional: you may use newlines in place of semicolons. Thus, both the
following mean the same thing in Matlab:
[1 2 3 ; 4 5 6]
[1 2 3
4 5 6]
Write down a grammar for Matlab matrices that allows this. Assume that the
token NL denotes a newline.
|
| [Click here for solution.] |
| Exercise 6.3:
Matlab also allows the empty matrix denoted by
[]
as well as a matrix with empty rows:
[;;]
This denotes a matrix with three empty rows. Update your grammar from the
last exercise to allow this.
|
| [Click here for solution.] |
| Exercise 6.4:
Pascal is a language very much like C. Consider the following statement
list in C:
{
temp = a;
a = b;
b = temp;
}
The corresponding Pascal code is
begin
temp := a;
a := b;
b := temp
end
Notice that there is no semicolon after the last statement (b :=
temp). Both the code snippets give a list of statements (enclosed
inside curly braces for C, and inside begin and end for
Pascal). Based on these two lists answer the following
questions:
Is the C list a semicolon-separated list of statements?
Is the Pascal list a semicolon-separated list of statements?
Write the grammar that will accept a list of statements in C. Here
you can use the non-terminal stmt to denote the body of a
statement without the semicolon.
Do the same for Pascal.
Suppose someone types a semicolon after the last statement in a
Pascal list. Will your Pascal grammar accept this? (Actual Pascal grammar
does.)
|
| [Click here for solution.] |
| Exercise 6.5:
Why is the following grammar not correct for comma-separated lists of
numbers (empty list allowed)?
list :
list : list COMMA NUM
Also discuss
list :
list : NUM
list : list COMMA NUM
|
| Exercise 6.6:
The LISP language allows the user to write arbitrary lists where each
element is either a number or another list, e.g.,
(1 2 (2 3 (0) 3 ( ) ) 34).
Each list is enclosed in parantheses. Empty lists are allowed and are
denoted by ( ). Write down the grammar for this in BNF.
|
| Exercise 6.7:
Certain dialects of C (e.g., Turbo C) allow function definitions
like the following:
void fun(int i, int j=2, float f=0.7, float g) { ... }
Each element in the parameter list(parmList) is a parameter specification
(parmSpec). Now there are two alternative form of parmSpec. The
first alternative is like
int i
where a data type is followed by a variable name. The other alternative is
like
int j=2
where we specify a default value. Write down the grammar of parmList.
|
| Exercise 6.8:
Write the grammar of a semicolon-terminated list of numbers. The list must
have at least one number.
|
| Exercise 6.9:
What would the second BNF rule be
if we choose to append the new locnSpec to the head
of the old locnSpecList?
|
| Exercise 6.10:
Write down the two BNF rules defining exitList.
Remember that some locations may not have any exit at all, so the game
effectively terminates if the player accidentally enters such a
location (through a one-way entrance.)
|
|