Arnab Chakraborty's compiler tutorial: page 9 of 11

www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Home >

... tutorials >

... ... Compiler

Semantic analysis

An AdvLang program (like programs in any other language) consists of a sequence of lexemes. Lexical analysis reduces this information into a sequence of tokens. For instance the programs in the left and right boxes are quite different.

LOCATION cave
NAME "A cave"
DESCRIPTION "A granite cave"
north hill
START_AT cave

LOCATION castle
NAME "A diamond castle"
DESCRIPTION "A diamond castle"
east lake
START_AT cave

But both have the same token sequence:


tok_LOCN tok_IDENT
tok_NAME tok_STRING
tok_DESCR tok_STRING
tok_DIRN tok_IDENT
tok_START tok_IDENT

This reduced information is enough for the syntax analysis phase to check the syntactic correctness of the program. The reduction of lexemes to tokens is helpful because the number of possible lexemes is huge, while the number of possible tokens is small.

Now, our final aim is not merely to check the syntax of an AdvLang program, but to produce the final low-level output code in html and javascript. For this the reduced information contained in the sequence of tokens is not enough. So when we replace lexemes by tokens duing lexical analysis we also need to keep track of the actual lexemes. In this page we shall learn how to do this.

Since lexical analysis reduces the data by replacing the lexemes by tokens, each token with more than one lexeme in it involves information loss. To prevent this loss, bison associates a C variable with each occurence of a token to store additional information. Thus, each occurence of tok_DIRN has with it an integer, which stores the actual direction coded as, say, NORTH=0, SOUTH=1 etc. Such C variables are called attributes. Here is a list of all AdvLang tokens with more than one lexeme in them. We also show their attributes. Note that the attribute for each token contains enough information to recover the original lexeme.

	Token	Attribute
	`tok_DIRN`	int whichDirn;
	`tok_IDENT`	*char str;**
	`tok_STRING`	*char str;**
	`tok_DESCR`	*char str;**

For tokens with finitely many lexemes (eg tok_DIRN) it is usual to have an integer attribute. For tokens with infinitely many lexemes (eg, tok_IDENT) one usually has a char* attribute to store the entire lexeme.

Telling bison about the attributes

We need to tell bison about the token-attribute table above. We do this in the second part of the bison input file as shown in the green box to the left.

adv4.y:


%{
    First part
    as before
%}
%union{
  int whichDirn;
  char *str;
}

%token <whichDirn> tok_DIRN
%token tok_LOCN
%token tok_DESCR
%token tok_START
%token tok_NAME
%token <str> tok_IDENT
%token <str> tok_STRING

%%
    Third part
    contains all the
    rules
%%
    Fourth part
    as before

The new addition is shown in red. Note that the word %union must start from the first column. Also the the entire %union construct must precede the %token's. The resulting bison file is in adv4.y, which contains all the grammar rules.

Next we need to add some instruction to the flex input file so that flex remembers to fill in the attributes when it replaces the lexemes by tokens. This requires a communication between bison and flex because the attributes are defined in the bison file and yet it is flex that puts values in them. This bison-flex communication is achieved via a C variable called yylval. It is defined by bison, and again declared by flex as extern. Both the definition and the declaration are done automatically, and we do not need to do them explicitly. We just use yylval as a ready-made variable. It has the type given by the %union that we specified in the bison input file. In our case it is:

union {
  int whichDirn;
  char *str;
}

Inside the adv3.lex file change the line

north { return tok_DIRN; }

north { yylval.whichDirn = 0; return tok_DIRN; }

The other directions are handled similarly. There is another ready-made variable that flex defines for us. It is called yytext. It is of type char*, and stores the most recently encountered lexeme. We are going to need it now. Change the command

{ return tok_IDENT; }

{ yylval.str = strdup(yytext); 
       return tok_IDENT; }

[Thanks to Danny for correcting a mistake here.] The resulting flex input file is in adv4.lex. Here the standard C function strdup is used to duplicate a string. Given a string argument, it allocates fresh memory of the same size as the string, and then copies the string to that memory. This function is declared in the header file string.h that we have included in the begining of adv4.lex.

hit counter