Arnab Chakraborty's advanced compiler tutorial: page 6 of 3

www.angelfire.com/dragon/letstry `cwave04 at yahoo dot com`
Home > ... tutorials > ... ... Compiler (advanced)

Is ambiguity the only enemy?

All our conflicts so far originated from ambiguities (either just in the grammar, or present intrinsically in the language itself). However, it is possible to have conflicts even when the grammar is free of ambiguity! Before we see an example of this, a little bit of theory may help.

Bison is a tool that takes a grammar as input and produces a compiler as output. If the grammar is ambiguous then it produces an error message. So we can easily imagine that bison has some inbuilt check against ambiguity. Now it is a remarkable fact of mathematical logic that it is impossible to write a program that will always sucessfully detect ambiguity in a grammar! I am not kidding! Nobody can ever write such a program even if allowed any amount of RAM and hard drive! If you are new to such "impossibilities" then this might very well strike you as odd. But there is a celebrated theorem called Goedel's Incompleteness Theorem that may be used to prove such "impossibility" claims.

Let us clearly understand what this impossibility means for bison. Whatever check bison may keep against ambiguity cannot be fool-proof. So one of two things must happen:

either
the check is a bit too weak, in which case some ambiguous grammars will sneak in.
or
the check is is a bit too stringent, in which case some unambiguous grammars will be left out (along with all the ambiguous ones).

The first case will have terrible consequence (you may have one parse tree in mind, while bison ends up producing another). The second case is not too bad hoping that not too many useful unambiguous grammars are left out. It is somewhat like choosing the security level of a firewall: better be too stringent than too lenient. The situation is is depited in the diagram below.

The criterion that bison uses is not very complicated. It always looks at the next terminal before making its decision about shifting or reducing. If it is not clear which one to do even after looking at the next terminal, then bison complains about a conflict. Bison will never look ahead more than one terminal. So grammars where the conflict could be resolved by loking at, say, the next two terminals are flagged by bison as bad.

The next example illustrates this situation.

Example: This is actually a continuation of the last example where our attempt to create a C-like language fell flat because functions "clashed" with arrays. We suggested use of different delimiters for functions and arrays (say, square brackets for arrays, and round brackets for functions, just as in C).
But some smart guy comes up with a different suggestion: what about if we do not allow functions to have parameters? Then a function will look like
abc()
while an array with the same name will look like
abc(2)
Accordingly we change the grammar to the following.
stmt : lhs EQ rhs lhs : var | arrayElt var : ID arrayElt : arrayName LPAR index RPAR arrayName : ID index : expr expr : NUM | var | arrayElt | func rhs : var | arrayElt | func func : funcName LPAR RPAR funcName : ID
Unfortunately, bison still makes the same complaint, though the grammar is no more ambiguous. To see why, consider the program
x = abc()
When the compiler reads the abc (which is an ID) it cannot decide if it is a arrayName or a funcName just by knowing that next terminal is a LPAR. The conflict could be resolved if only bison were less lazy, and had looked ahead at the terminal after the LPAR (the RPAR there would have then clearly showed that we are in a function call). But alas, bisons have too short necks to look ahead beyond the very next terminal, and are too obstinate to proceed further before making up their minds!

This die-hard habit of bison's of never looking ahead beyond more than the next 1 terminal has earned its favourite grammars a special name: LALR(1). Here the first LA stands for Look Ahead, and the 1 inside the parantheses tells us the range of looking ahead. If you are curious about the LR part it has a less exciting genesis: the L means bison-generated compilers read a program Left-to-right (just like any ordinary mortal). The R means bison generates the Rightmost derivation (i.e., the newly generated non-terminal after each reduce step is immediately to the left of the dot).

"Clashing" of different parts of a grammar sometimes happens because of using two different non-terminals for the same purspose.

Example: Some books have authors, while some have editors (like anthologies). Is this idea properly captured in the grammar below?
authorOrEditor : author | editor author : NAME editor : NAME
The answer is "No", because when the compiler meets a NAME it has no way of knowing if it is an author or an editor. This will give rise to a reduce-reduce conflict.
We should use a grammar like the following:
authorOrEditor : AUTH author | EDTR editor author : NAME editor : NAME
or,better still,
authorOrEditor : AUTH NAME | EDTR NAME

Using different non-terminals for the same purpose often occurs when more than one people work on the same grammar. Suppose that you have written a long grammar with lots of non-terminals. I want to add some extra features that needs a list of floating point numbers somewhere. So I create a non-terminal called fpList. Now you already had a non-terminal called intList (list of integers) in your gammar. Also an integer may be considered as a special case of a floating point number. Then my fpList may clash with your intList, because a list of integers can be construed as either an intList or a fpList. I said "may" because the clash will be averted if the two lists occur in completely different scenarios (e.g., an intList must always be inside round brackets, while a fpList can never have round brackets around itself).

Exercise 6.2: We want a language with statements like
john : author mary : editor
Discuss the following grammars to achieve this:

authorOrEditor : authorName COLON AUTHOR | editorName COLON EDITOR authorName : NAME editorName : NAME

authorOrEditor : NAME COLON AUTHOR | NAME COLON EDITOR

creator : NAME COLON role role : AUTHOR | EDITOR

Exercise 6.3: Solve the same exercise once again, but this time without the colons.

Exercise 6.4: We know how to make a comma-separated list of items. Which of the following variant(s) is(are) intrinsically ambiguous? Why? Give an unambiguous grammar for each of the rest.

A comma-separated list where an item is optional

A comma-separated list where a comma is optional

A comma-separated list where both an item and a comma is optional