Arnab Chakraborty's compiler tutorial: page 7 of 11

www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Home >

... tutorials >

... ... Compiler

Derivation of a program

For the purpose of syntax analysis a program is just a sequence of tokens. A grammar of a language is a way to specify which sequences are valid programs. There are two aspects to this: first, every correct program must fit the grammar, and second, no incorrect program should fit it.

Example: Consider the language of comma-separated lists (empty list allowed.) Someone has proposed the following grammar for this language:
list : list : NUM list : list COMMA NUM
Any comma-separated list will fit into this grammar. However, still the grammar is not correct, because it also allows the following as a valid list!
, 3

Given a grammar and a program (as a sequence of tokens) it is important to be able to check whether the program fits the grammar. This is like hand-compiling the program. There are two standard ways to do this:

Parse trees

Consider the following grammar for comma-separated lists of numbers:


list : NUM
list : list COMMA NUM

We want to check if the list

2 , 3, 4

fits this grammar or not. After lexical analysis this program becomes

NUM COMMA NUM COMMA NUM

This sequence of terminals has the parse tree shown below.

Parse tree

Let us explain its meaning. A parse tree shows the original program (as a sequence of tokens) at the bottom and the starting non-terminal at the top. The production rules are applied step by step to move up the tree. The above tree, for instance, is constructed in three steps: First, we recognise the leftmost NUM as a list (applying the first production rule), producing diagram (a) shown below. Then we apply the second production rule to arrive at (b). Finally we apply the second rule again to reach (c), which is the final parse tree.

Parse tree steps

Parse trees are sometimes difficult to draw by hand. Another way to represent the same information is by using derivations. Here we start from the starting non-terminal (list, in this case) and keep on expanding the rightmost non-terminal at each step until we arrive at the program.


list --> list COMMA NUM            (rule 2)
     --> list COMMA NUM COMMA NUM  (rule 2)
     --> NUM COMMA NUM COMMA NUM   (rule 1)

Of course, there is nothing profound about expanding the rightmost non-terminal. You could just as well expand any other non terminal. However, the if you always expand the rightmost (or always the leftmost) non-terminal then you will get exactly one derivation for each parse tree. The importance will be clear if we work with a program and grammar that admit more than one parse trees. Such a grammar is called an ambiguous grammar. The next example presents an ambiguous grammar for lists.

Example: Again consider lists of numbers. This time we shall use a different grammar.
list : NUM list : list COMMA list
The first rule says that the smallest list consists of a single number. The second rule says that a longer list may be made by adding one list after another with a comma between them. This is a correct grammar in the sense that all lists (and nothing else) fit into it. But it is ambiguous. To see this let us parse the following list:
1, 2, 3
Each of the following is a valid parse tree.

Different parse trees for the same list

Ambiguous grammars are bad. To see why, notice that parsing a given program is basically grouping tokens into non-terminals. The parse tree presents the order in which the grouping is to be done. For instance, the left parse tree above imples the groupings:

(((1) , (2)), (3))

while the right parse tree corresponds to

((1) , ((2), (3)))

The order of grouping controls the meaning of a program. This is true even in human languages like English. Consider the sentence

I saw a girl with a telescope

This can have two different meanings depending on groupings. If I write

(I saw a girl) with a telescope

then this means that "I saw a girl, and that remarkable feat was achieved by means of a telescope." If, on the other hand, I write

I saw (a girl with a telescope)

then the meaning is: "There was a girl with a telescope, and I saw her." The fact that same sentence could be interpreted in two different ways means English grammar is ambiguous. However, this ambiguity is not a problem in our everyday lives, because the meaning is usually clear from the context. For a computer language, however, an ambiguous grammar can cause a lot of problem, and so we shall always avoid ambiguous grammars. The compiler writing softwares that we shall use have built-in checks against ambiguous grammars.

Top-down and bottom-up

When we express a parse tree as a (rightmost) derivation, we are writing the tree in a top-down manner. We start with the non-terminal at the top of the tree, and then gradually work our way down to the leaf nodes containing the tokens. Most compilers proceed in the opposite direction: bottom-up.

Example: Consider again the derivation that we saw earlier.
list --> list COMMA NUM (rule 2) --> list COMMA NUM COMMA NUM (rule 2) --> NUM COMMA NUM COMMA NUM (rule 1)
Here is its bottom-up version:
NUM COMMA NUM COMMA NUM list COMMA NUM COMMA NUM list COMMA NUM list
These are precisely steps in the rightmost derivation.

This is the order in which compilers parse a program. It is important to be able to generate this bottom-up sequence by "hand parsing" directly, rather than by first finding the rightmost derivation and then reversing it. Imagine the program presented to you as a sequence of tokens. You are reading the tokens one by one from left to right. After reading each token you have to check if it is possible to apply some production rule to the tokens read so far.

To make this idea into an algorithm we introduce two steps called shift and reduce. In a shift step we read the next token:

NUM COMMA NUM NUM COMMA NUM

Here the

marks how much we have already read. We shall call it the "dot". Everything to its left have been read, everyhing to its right are unread. In a shift move we read the next token after the dot, i.e, the dot moves one step to the right. During parsing there may be both tokens and non-terminals to the left of the dot. However, only tokens can follow the dot. Thus, the following is allowed:

list COMMA NUM

but not

NUM COMMA list

Here the list after the

is not allowed.

The other move is called a reduce move. Here we apply a production rule to combine some tokens and/or non-terminals into a single non-terminal, like this:

NUM COMMA NUM ¹ list COMMA NUM

Here we have reduced the leftmost NUM to list by rule 1. The number of the production rule used has been written after the arrow. A reduce step does not change the position of the dot. Also, there is another constraint. The new non-terminal introduced in a reduce step must occur immediately to the left of the dot. The following, for example, is not a valid reduce step:

NUM COMMA NUM ¹ list COMMA NUM (This is wrong!)

The trouble is that there is a COMMA between the newly created list and the dot.

Example: If we apply shift and reduce steps to the program
NUM COMMA NUM
we get the following:
NUM COMMA NUM
    NUM COMMA NUM
    ¹list COMMA NUM
    list COMMA NUM
    list COMMA NUM
    ¹list COMMA list
    ²list

How do we know when to apply shift and when reduce? And which rule to reduce by in the latter case? There are various algorithms to decide this. Bison, for example, uses an algorithm called LALR(1). we shall go into these details later. For now let us notice the following. A shift move is always possible until the dot reaches the end (i.e the entire program has been read). After that no more shift is possible.

hit counter