Arnab Chakraborty's advanced compiler tutorial: page 4 of 3

www.angelfire.com/dragon/letstry `cwave04 at yahoo dot com`
Home > ... tutorials > ... ... Compiler (advanced)

What do the states mean?

The states encode some summary information about the current scenario, like what we have read so far, what reductions have been done etc. The states do not contain the entire history, just the barest minimum to continue the parsing. To debug a parser it is important to know what each state means. The .output file gives brief hints about the meaning of each state. These constitute the second part in the description of each state. Let us take a look at state 2 from test.output once again.

Part of test.output

state 2
list -> list . N (rule 2)	2
$ go to state 4 N shift, and go to state 3

Explanation:

2:

This means we have just now finished reading a list and expect to see an N. Thus the reading head is just after a list. We do not know if indeed the next symbol is going to be an N, but that is what the parser is expecting to see.

Bison does not include the second parts for states 0 and the last two states. This is because their interpretations are trivial:

The first state expects to see a valid program (goes without saying!)
The last two states are reached when we have finished reading the program (and are facing the $). So they do not expect to see anything. They know that they are standing bang in front of the $.

Ambiguous grammars

We are learning debugging a grammar. But all our grammars so far are lacking one vital ingerdient for debugging: they do not have any bug! Let us rectify this inconveneint state of affairs by deliberately introducing one.

Example: We shall work with the following grammar for lists of numbers (denoted by the terminal N).
list : N list : list list
The idea behind the grammar is simple. A single number is a valid list. If we append one list after another then we get a new list.
However, this grammar is ambiguous because the program
N N N
has two possible parse trees shown below.

Let us see how bison deals with this grammar. First, we feed the following file into bison:

amb.y

%{ %} %token N %% list : N | list list %%

Bison prints the following cryptic message:
amb.y contains 1 shift/reduce conflict.
So something has gone wrong. But what? To find out we have to use the commandline option -v:
bison -v amb.y
and then look at the generated amb.output file.
The file starts with a summary of the problem:
State 3 contains 1 shift/reduce conflict.
We still do not know what a "shift/reduce conflict" is, but we can guess that bison is unable to decide whether to shift or to reduce in state 3. Here is the description of this state:

Part of test.output

state 3

list -> list . list (rule 2) list -> list list . (rule 2)
2

N shift, and go to state 1 N [reduce using rule 2 (list)] $default reduce using rule 2 (list) list go to state 3
3

Explanation:

2:

This part now contains two dotted rules. This means state 3 is reached when either we have finished reading a list
and expecting another; or when we have finished reading two lists. The fact that there are two dotted rules in this state is not a danger signal by itself.

3:

It is in the thrid part that we see the trouble. If the next symbol is an N, then bison gets confused: whether to shift or reduce by rule 2.

Well, that is as much as we can expect from bison by way of diagnosis. Now we have to track the cause behind this error using some trial-and-error and educated guessing.

Remember that ambiguity is a flaw of the grammar. It means that there is at least one program that is syntactically correct according to this grammar, but that still has more than one parse tree. Ambiguity is not a flaw of an individual program. It is quite possible that some program written according to an ambiguous grammar has exactly one parse tree.

This may sound confusing. Let us compare debugging a grammar with debugging a C software. Suppose that you have written a C software, and sold it. A user sends you the feedback that for some input your software generates a segmentaton fault at line number 234. However, the user does not mention what particular input has caused the fault.

So you know that your C software is not free of bugs, since at least one input has caused it to trip. It would definitely been of help to know that troublesome input, but all that you know is the nature of the problem (segmentaton fault) and its location in the code (line number 234).

How do you proceed your C software based on this? There are basically two techniques:

Try out some random inputs hoping to reproduce the error. This is a simple method that has a rather low chance of success if the software is complex.
Look carefully at line number 234, and mentally visualise scenarios that might cause the code at that line to break. This requires some debugging experience, but is more effective than the naive method. Indeed, this is how most progrmmers debug their programs.

Well, the situtation is very similar for debugging a grammar. We shall first try the naive approach, which will be quite effective here since our grammar is very simple.

But first we shall construct a table from amb.output, just as we did in the last page.

	`N`	$	`list`
⁰	s1		2
¹	r1	r1	r1
²	s1	4	3
³	s1/r2	r2	3
⁴		5
⁵	acc	acc

Notice how we have represented the shift-reduce conflict.

Example: Consider the program
N N N
We shall use the table to hand parse this program.
⁰
⁰ N ¹
⁰ list ²
⁰ list ² N ¹
⁰ list ² list ³
Well, we are now at state 3, and the next symbol is an N, so we have hit the shift-reduce conflict. So now we have two ways to proceed, each should take us to a valid but different parse tree. Let us try shifting first.
⁰ list ² list ³ N ¹
⁰ list ² list ³ list ³
⁰ list ² list ³
⁰ list ²
⁰ list ⁴
⁰ list ⁵
and finally we accept the program. It should not be difficult to see that this parsing corresponds to the parse tree (a) shown earlier.

Exercise 4.1: Check that resolving the shift-reduce conflict in favour of reducing produces parse tree (b).