|
www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com |
|
|
Derivation of a program
For the purpose of syntax analysis a program is just a sequence of
tokens. A grammar of a language is a way to
specify which sequences are valid programs. There are two aspects to this:
first, every correct program must fit the grammar, and second, no
incorrect program should fit it.
| Example:
Consider the language of comma-separated lists (empty list allowed.)
Someone has proposed the following grammar for this language:
list :
list : NUM
list : list COMMA NUM
Any comma-separated list will fit into this grammar. However, still the
grammar is not correct, because it also allows the following as a valid
list!
, 3
|
Given a grammar and a program (as a sequence of tokens) it is important to
be able to
check whether the program fits the grammar. This is like
hand-compiling the program. There are two standard ways to do this:
Parse trees,
Rightmost derivations.
Parse trees
Consider the following grammar for comma-separated lists of numbers:
list : NUM
list : list COMMA NUM
We want to check if the list
2 , 3, 4
fits this grammar or not. After lexical analysis this program becomes
NUM COMMA NUM COMMA NUM
This sequence of terminals has the parse tree shown below.
|
Parse tree |
Let us explain its meaning. A parse tree shows the original program (as a sequence of
tokens) at the
bottom and the starting non-terminal at the top. The production rules are
applied step by step to move up the tree. The above tree, for instance, is
constructed in three steps:
First, we recognise the leftmost NUM as a list (applying the first
production rule), producing diagram (a) shown below. Then we apply the
second production rule to arrive at (b). Finally we apply the second rule
again to reach (c), which is the final parse tree.
|
Parse tree steps |
Parse trees are sometimes difficult to draw by hand. Another way to
represent the same information is by using derivations. Here we start from
the starting non-terminal (list, in this case) and keep on expanding the
rightmost non-terminal at each step until we arrive at the program.
list --> list COMMA NUM (rule 2)
--> list COMMA NUM COMMA NUM (rule 2)
--> NUM COMMA NUM COMMA NUM (rule 1)
Of course, there is nothing profound about expanding the rightmost
non-terminal. You could just as well expand any other non
terminal. However, the if you always expand the rightmost (or always the
leftmost) non-terminal then you will get exactly one derivation for each
parse tree. The importance will be clear if we work with a program and
grammar that admit more than one parse trees. Such a grammar is called an
ambiguous grammar. The next example presents an ambiguous grammar
for lists.
| Example:
Again consider lists of numbers. This time we shall use a different
grammar.
list : NUM
list : list COMMA list
The first rule says that the smallest list consists of a single
number. The second rule says that a longer list may be made by adding one
list after another with a comma between them. This is a correct grammar in the sense that all lists
(and nothing else) fit into it. But it is ambiguous. To see this let us
parse the following list:
1, 2, 3
Each of the following is a valid parse tree.
|
Different parse trees for the same list |
|
Ambiguous grammars are bad. To see why, notice that parsing a given
program is basically grouping tokens into non-terminals. The parse tree
presents the order in which the grouping is to be done. For instance, the
left parse tree above imples the groupings:
(((1) , (2)), (3))
while the right parse tree corresponds to
((1) , ((2), (3)))
The order of grouping controls the meaning of a program. This is true even
in human languages like English. Consider the sentence
I saw a girl with a telescope
This can have two different meanings depending on groupings. If I write
(I saw a girl) with a telescope
then this means that "I saw a girl, and that remarkable feat was achieved
by means of a telescope." If, on the other hand, I write
I saw (a girl with a telescope)
then the meaning is: "There was a girl with a telescope, and I saw her."
The fact that same sentence could be interpreted in two different ways
means English grammar is ambiguous. However, this ambiguity is not a
problem in our everyday lives, because the meaning is usually clear from
the context. For a computer language, however, an ambiguous grammar can
cause a lot of problem, and so we shall always avoid ambiguous
grammars.
The compiler writing softwares that we shall use have built-in checks
against ambiguous grammars.
Top-down and bottom-up
When we express a parse tree as a (rightmost) derivation, we are writing
the tree in a top-down manner. We start with the non-terminal at the top
of the tree, and then gradually work our way down to the leaf nodes
containing the tokens. Most compilers proceed in the opposite
direction: bottom-up.
| Example:
Consider again the derivation that we saw earlier.
list --> list COMMA NUM (rule 2)
--> list COMMA NUM COMMA NUM (rule 2)
--> NUM COMMA NUM COMMA NUM (rule 1)
Here is its bottom-up version:
NUM COMMA NUM COMMA NUM
list COMMA NUM COMMA NUM
list COMMA NUM
list
These are precisely steps in the rightmost derivation.
|
This is the order in which compilers parse a program. It is
important to be able to generate this bottom-up sequence by "hand parsing"
directly, rather than by first finding the rightmost derivation and then
reversing it. Imagine the program presented to you as a sequence of
tokens. You are reading the tokens one by one from left to
right. After reading each token you have to check if it is possible
to apply some production rule to the tokens read so far.
To make this idea into an algorithm we introduce two steps called
shift and reduce. In a shift step we read the next
token:
NUM COMMA NUM NUM COMMA NUM
Here the marks how much we have already read. We shall call it the
"dot". Everything
to its left have been read, everyhing to its right are unread. In
a shift
move we read the next token after the dot, i.e, the dot moves one step to the right. During parsing there may be
both tokens and non-terminals to the left of the dot. However, only
tokens can follow the dot. Thus, the following is allowed:
list COMMA NUM
but not
NUM COMMA list
Here the list after the is not allowed.
The other move is called a reduce move. Here we apply a production
rule to combine some tokens and/or non-terminals into a single
non-terminal, like this:
NUM COMMA NUM 1 list COMMA NUM
Here we have reduced the leftmost NUM to list by rule
1. The number of the production rule used has been written after the
arrow. A reduce step does not change the position of the dot. Also, there
is another constraint. The new non-terminal introduced in a reduce step
must occur immediately to the left of the dot. The following, for
example, is not a valid reduce step:
NUM COMMA NUM 1 list
COMMA NUM (This is wrong!)
The trouble is that there is a COMMA between the newly created
list and the dot.
| Example:
If we apply shift and reduce steps to the program
NUM COMMA NUM
we get the following:
NUM COMMA NUM
NUM COMMA NUM
1list COMMA NUM
list COMMA NUM
list COMMA NUM
1list COMMA list
2list
|
How do we know when to apply shift and when reduce? And which rule to
reduce by in the latter case? There are various algorithms to decide
this. Bison, for example, uses an algorithm called LALR(1). we shall go
into these details later. For now let us notice the following.
A shift move is always possible until the dot reaches the end (i.e the
entire program has been read). After that no more shift is possible.
|