Arnab Chakraborty's compiler tutorial: page 5 of 11

www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Home >

... tutorials >

... ... Compiler

Syntax analysis

The output of lexical analysis, as we have already seen, is a sequence of tokens. Now, the token sequence of a valid AdvLang program must conform to some structural rules, or syntax, as they are called.

Here is an example. A typical exit from a location looks like


north flag

meaning "if you go north from the current location you will go to the location called flag". In terms of tokens, this is


tok_DIRN tok_IDENT

Here the tok_IDENT must come after the tok_DIRN. It is a mistake to write


flag north

So we have the syntax rule:

An exit consists of a TOK_DIRN and a TOK_IDENT. The TOK_IDENT must come after the TOK_DIRN.

We specify this rule as


exit : tok_DIRN tok_IDENT

This is called a grammar rule or production rule written in the Backus-Naur Form (BNF). Read it as

An exit is made of a tok_DIRN followed by a tok_IDENT.

The colon (:) means "is made of." A production rule defines the structure of some part of a program. Here we are defining the structure of an exit. Such parts are called non-terminals. Thus exit is a non-terminal. It is written to the left of the colon. So we call it the left hand side of the rule.

How to ride a `bison`

Syntax analysis is more complex than lexical analysis. However, just as we used the free utility program flex to write a lexical analyser, we can use another free utility called bison to generate the syntax analyser.

The bison software, like flex, is a free utility program available from www.gnu.org. It is an improved version of a similar utility program called yacc. yacc or bison is part of many unix-like systems and also of Mac OS X.

Recall that we had put our token-lexeme table in a file called adv3.lex (after solving the exercise in the last page) and then had fed the file to flex. Similarly, here we shall specify the syntax rules inside a file called adv1.y and invoke bison on it as


bison adv1.y

The file adv1.y is called the bison input file, and it must have extension .y.

So far we have discussed only one grammar rule. The adv1.y file containing only this rule looks as follows.

adv1.y:

%{

#include <stdio.h>

%}

%token tok_DIRN tok_IDENT

%%

exit : tok_DIRN tok_IDENT {printf("Found an exit!\n");}

%%


int yyerror(char *errMessage) {
   printf("Trouble: %s\n",errMessage);
}

main() {
  yyparse();
}

Let us explain what this file means. A bison input file has 4 parts. The parts are delimited by special symbols: %{, %} and %%. All these special symbols must start from the first column.

#include

printf

stdio.h

%token

Bison

#define

compile.h

[Thanks again, Danny, for reminding me to add this line!]

printf

yyerror(char *errMessage)

AdvLang

main

yyparse()

flex

yylex()

bison

yyparse()

yylex()

yyerror(char *errMessage)

At this stage (after solving the exercise in the last page) you should have a flex input file adv3.lex, and a bison input file adv1.y. Now give the following commands to the computer. The comments in red tell you the results of the commands.


flex adv3.lex
   [lex.yy.c is created]
bison -d -o compile.c adv1.y
   [compile.c and compile.h are created]
cc -o compile compile.c lex.yy.c
   [compile is created]

(If you are using Windows, then you would get a file called compile.exe instead of compile.) To try out our compiler let us create a sample AdvLang file test1.al containing a single exit:

test1.al:

north flag

Compile it as


compile < test1.al

You should get the output


Found an exit!

Now modify the file test1.al so that it not a valid exit anymore (for example, change the "north" to "northwest", which is not a direction that AdvLang understands). Then the output will be


Trouble: Syntax error.

Syntax rules are all in terms of the tokens only, and not the lexemes. Thus the rule that "a location cannot have two different north exits" is not a syntax rule, since it involves the lexeme north. Such rules are called semantic rules and will be dealt with during semantic analysis.

More rules

We have already defined the grammar for one part of an AdvLang program, namely exits. The same technique can be applied to some other parts as well. For example, we specify the name of a location as


NAME "House"

In terms of tokens this is


tok_NAME tok_STRING

So we have the rule

A name speification is a made of a tok_NAME followed by a tok_STRING.

Using BNF we write


nameSpec : tok_NAME tok_STRING

Here we have introduced a new non-terminal called nameSpec. Similarly, for the desctiptions we have the rule


descrSpec : tok_DESCR tok_STRING

The specification of the starting location is made of the token tok_START followed by tok_IDENT. In BNF we write this as:


startSpec : tok_START tok_IDENT

Making the list of exits

Expressing lists in BNF is slightly tricky. Be prepared to go through the following explanation more than once. Once learned, this will be a valuable tool in your arsenal.

Each list needs at least two BNF rules to define it. The first rule tells us what the shortest possible list looks like. The next rule tells us how to create a new list from an old list by the minimum possible addition. The shortest possible exitList is a list with no exits! (For instance, there was no exit from the Marsh in the sample game). So the first rule is:


exitList :

There is nothing to the right of the colon. The second rule extends an existing exitList by adding one more exit to it:


exitList : exitList exit

Notice that by virtue of this rule we can keep on adding exit-s to get as long an exitList as we want.

Trying it out

Change the adv1.y file to adv2.y as follows.

adv2.y:

<First and second parts as before>
%%
exitList :  {printf("Starting with an empty list!\n");}
exitList : exitList exit {printf("Adding an exit!\n");}
exit : tok_DIRN tok_IDENT {printf("Found an exit!\n");}
%%
<Fourth part as before>

Apply bison to it as before. Now change test1.al to the file test2.al containing a list of exits:

test2.al:

north flag
south house

Then apply your compiler on this file: (the red lines are the computer outputs)


compile < test2.al 
Starting with an empty list!
Found an exit!
Adding an exit!
Found an exit!
Adding an exit!

Combining rules

We have already defined different parts of a location (e.g., nameSpec, descrSpec, exitList). Now we shall combine these into a single rule for specifying a location.:


locnSpec : tok_LOCN tok_IDENT 
                      nameSpec descrSpec exitList

This means: A location specification starts with the token tok_LOCN followed by a tok_IDENT. Then comes the name specification followed by the location description which is followed by a list of exits.

After adding this rule and the rules for nameSpec and descrSpec, the file adv2.y becomes adv3.y that looks like the following.

adv3.y:

<First part as before>
%token tok_NAME tok_DIRN tok_IDENT tok_DESCR tok_LOCN tok_STRING
%%
locnSpec : tok_LOCN tok_IDENT nameSpec descrSpec exitList
            {printf("Completed a location!\n");}
nameSpec : tok_NAME tok_STRING
            {printf("Name specification!\n");}
descrSpec : tok_DESCR tok_STRING
            {printf("Description specification!\n");}
exitList :  {printf("Starting with an empty list!\n");}
exitList : exitList exit {printf("Adding an exit!\n");}
exit : tok_DIRN tok_IDENT {printf("Found an exit!\n");}
%%
<Fourth part as before>

To test this we shall use test3.al:

test3.al:

LOCATION house
NAME "Your house"
DESCRIPTION "This is your house."
east forest

Create the compiler using bison as before. Apply the compiler on test3.al as:


compile < test3.al
Name specification
Description specification
Starting with an empty list!
Found an exit!
Adding an exit!
Completed a location!

The order in which the computer prints the above lines, depends on the order in which the production rules are applied. The "Completed a location!" message comes at the very end. This is because the compiler must first recognise all the different parts before it can recognise the location. The order in which production rules are used is of great importance, and will be discussed at more length later.

Making the list of locations

An AdvLang contains a list of location specifications. Making the list of locnSpec is much like making exitList. Each list needs at least two BNF rules to define it. The first rule tells us what the shortest possible list looks like. In case of locnSpecList, the shortest list consists of a single element (locnSpec.) We write this in BNF as


locnSpecList : locnSpec

The next rule tells us how to create a new list from an old list by the minimum possible addition. This may be done by appending one more locnSpec to an existing locnSpecList.


locnSpecList : locnSpecList locnSpec

Wrapping it up

An AdvLang program consists of a list of locations followed by the specification of the starting location. So we have the rule:


program : locnSpecList startSpec

So finally we have 10 production rules that define all the non-terminals. These production rules together form the grammar of AdvLang:


(Rule 1)  program : locnSpecList startSpec 
(Rule 2)  startSpec : tok_START tok_IDENT
(Rule 3)  locnSpec : tok_LOCN tok_IDENT 
            nameSpec descrSpec exitList 
(Rule 4)  nameSpec : tok_NAME tok_STRING 
(Rule 5)  descrSpec : tok_DESCR tok_STRING 
(Rule 6)  exit : tok_DIRN tok_IDENT 
(Rule 7)  locnSpecList : locnSpec 
(Rule 8)  locnSpecList : locnSpecList locnSpec 
(Rule 9)  exitList : 
(Rule 10) exitList: exitList exit

hit counter