www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

Debugging a grammar

Recall from the elementary tutorial that parsing basically consists of grouping terminals into non-terminals, and the non-terminals into still higher level non-terminals until we reach the topmost non-terminal. We have also mentioned how this hierarchical grouping may be visually presented as a parse tree. If it is possible to group the terminals in more than one way to arrive at the topmost non-terminal, then we have ambiguity. In this conection we had mentioned that the English language is ambuguous since it allows sentences like
I saw a girl with a telescope,
that allows two diffrerent possible groupings leading to two different meanings:
I saw (a girl with a telescope),
and
(I saw a girl) with a telescope.
Finally we had remarked that we shall always steer clear of ambuguous grammars. Indeed bison has built-in checks to detect ambiguity in the grammar file given to it.

Unfortunately, the world is far from perfect. Though we would love to shun ambuguous grammars by all means, we often end up writing such grammars by mistake! Bison never fails to detect the ambiguity, and prints some diagnostic messages trying to tell you the reason behind the trouble, it needs some expertise to interpret the messages, and to remove the ambiguity. This expertise is essential for writing a grammar on one's own, just as debugging ability is essential for software development.

Here we shall learn some useful techniques for debugging a grammar when bison complains about it.

Making bison verbose

Bisons, as any zoologist will confirm, are not particularly talkative beasts. The bison software also is rather terse in its output. However, it is possible to make it more verbose by using the commandline option -v as follows:

bison -v test.y

where test.y is the name of your grammar file. This causes bison to output an extra file called test.output (it just appends .output after your file root). This file contains valuable debugging information (though in a somewhat cryptic way). We shall now learn how o read this file.

Example: We shall work with the following grammar for lists of numbers (denoted by the terminal N).

list : N
list : list N 

The idea behind the grammar should be already familiar. A single number is a valid list. If we append a number after an existing list then we get a new list.

Let us write the grammar in a file called test.y:
test.y


%{
%}

%token N

%%

list : N
     | list N

%%

Now issue the command

bison -v test.y

Open the geenrated test.output file with your favourite editor.

The file starts with a detailed description of the grammar and the terminals and non-terminals used in it. Each rule is given a reference number, and each terminal is also given a reference number.

Next comes the most important part: the list of states. We shall learn in a moment what a "state" is in this context.

A quick glance through test.output shows that there are 6 states in all (more complicated grammars will have more states). Each state has a typical structure. Consider state 2 as an example
Part of test.output

state 2
1
    list  ->  list . N   (rule 2)
2
    $   	go to state 4
    N   	shift, and go to state 3
3

Explanation:

1:
This is the state number. The numbers start from 0.

2:
This is a grammar rule with a dot somewhere in the r.h.s. In this example there is only one such rule. It is possible to have a longer list. This part could be absent also.

3:
The thrird part contains a list of "what to do next" actions. Each action starts with a terminal/non-terminal, followed by what to do if we find that terminal. We shall explain the actions soon. The dollar symbol ($) is a special terminal denoting the end-of-file character.

The second part is optional. It is missing for states 0, 4 and 5, in our example.

We shall start with the third part. This part tells the parser how to make the parse tree.

Example: Consider the program
N N N
We already know that the parse tree is

Let us see how we arrive at this parse tree using the third parts of the states.

We shall read the terminals one by one, shifting or reducing at each step. At each step we shall be in a certain state. The inituial state is state 0.
0
Now look at state 0 in test.y. It says if the next symbol is the terminal N then shift and go to state 1. The next symbol for us is indeed an N, so we shift:
0 N 1
So now we have to look up state 1 in test.output. Here it says always reduce by rule 1. The symbol $default means "always", as can be guessed). So we throw away the N (and the state 1 after it) from our list and write list instead:
0 list
So it is as if we have got list at state 0. So we need to consult state 0 again. There it said that if the next symbol is a list then go to state 2. So we have
0 list 2
State 2 says that if the next symbol is N then shift and go to state 3:
0 list 2 N 3
State 3 asks us to always reduce the list and N into a list. This means throwing away the list and N (along with the states sanwitched between and after them) and write list instead:
0 list
So we back to the old situation: getting a list at state 0. We already know the prescription: go to state 2.
0 list 2
Now we are again in a familiar situation where the same old steps repeat:
0 list 2 N 3
0 list 2
We have consumed the enture program. So the next terminal is the end-of-file charcater or $. It is meaningless to read this (since we cannot make the reading head cross the end-of-file). So state 2 simply advises us to go to 4, i.e., change the 2 to 4:
0 list 4
Remember that the next terminal is still $. So state 4 bids us to go to state 5:
0 list 5
and state 5 tells us to accept the program as a valid one.

Well, it was a long example. But the process of looking up the states in the test.output file to decide the next step can be simplified considerably if we represent the states in a tabular form. The table will have one row for each state, and one column for each symbol (terminal/non-terminal/end-of-file). The table for our test.output is shown below.
N $ list
0 s1 2
1r1r1r1
2s34
3r2r2
45
5accacc
Each cell in this table has one of 5 types:
  1. s followd by a state number: this means shoft and goto this state.
  2. r followed by a rule number: this means reduce by this rule.
  3. just a state number: this means go to this state.
  4. acc: this means accept the program as a valid one.
  5. empty: if you ever come to such a cell, then your program has some syntax error.

Example: Let us see how the parser behaves when we feed a wrong program to it. Well, in our super-simple grammar there is exactly one wrong program: the empty file! This means we face the $ right at state 0. The table shows that the corresponding cell is empty, indicating a syntax error.

It is possible to show that the other empty cells are unreachable (even by wrong programs).

Exercise 3.1: Generate the .output file for our AdvLang language using bison. Then construct the table from it. Use the table to hand parse a syntactically correct AdvLang program and then a syntactically wrong one.


PrevNext
© Arnab Chakraborty (2007)