|
www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com |
|
|
Semantic analysis
An AdvLang program (like programs in any other language)
consists of a sequence of lexemes. Lexical analysis
reduces this information into a sequence of tokens.
For instance the programs in the left and right boxes
are quite different.
|
LOCATION cave
NAME "A cave"
DESCRIPTION "A granite cave"
north hill
START_AT cave
|
LOCATION castle
NAME "A diamond castle"
DESCRIPTION "A diamond castle"
east lake
START_AT cave
|
But both have the same token sequence:
tok_LOCN tok_IDENT
tok_NAME tok_STRING
tok_DESCR tok_STRING
tok_DIRN tok_IDENT
tok_START tok_IDENT
This reduced information is enough for the
syntax analysis phase to check the syntactic correctness of the
program. The reduction of lexemes to tokens
is helpful because the number of possible lexemes is huge,
while the number of possible tokens is small.
Now, our final aim
is not merely to check the syntax of an AdvLang program,
but to produce the final low-level output code in
html and javascript. For this the reduced information contained
in the sequence of tokens is not enough.
So when we replace lexemes by tokens duing lexical analysis
we also need to keep track of the actual lexemes. In this page we
shall learn how to do this.
Since lexical
analysis reduces the data by replacing the lexemes by tokens,
each token with more than one lexeme in it involves information loss.
To prevent this loss, bison associates a C variable
with each occurence
of a token to store additional information. Thus, each occurence of
tok_DIRN has with it an integer, which stores the actual direction
coded as, say, NORTH=0, SOUTH=1 etc. Such C variables are called
attributes.
Here is a list of all AdvLang tokens with more than one
lexeme in them. We also show their attributes. Note that
the attribute for each token contains enough information to
recover the original lexeme.
| Token | Attribute |
| tok_DIRN |
int whichDirn;
|
| tok_IDENT |
char *str;
|
| tok_STRING |
char *str;
|
| tok_DESCR |
char *str;
|
For tokens with finitely many lexemes (eg tok_DIRN) it is usual
to have an integer attribute. For tokens with infinitely many
lexemes (eg, tok_IDENT) one usually has a char*
attribute to store the entire lexeme.
Telling bison about the attributes
We need to tell bison about the token-attribute table above.
We do this in the second part of the bison input file as shown
in the green
box to the left.
adv4.y:
%{
First part
as before
%}
%union{
int whichDirn;
char *str;
}
%token <whichDirn> tok_DIRN
%token tok_LOCN
%token tok_DESCR
%token tok_START
%token tok_NAME
%token <str> tok_IDENT
%token <str> tok_STRING
%%
Third part
contains all the
rules
%%
Fourth part
as before
|
The new addition is shown in red. Note that the word
%union
must start from the first column. Also the the entire
%union construct must precede the %token's. The
resulting bison file is in adv4.y,
which contains all the grammar rules.
Next we need to add some instruction to the flex input file
so that flex remembers to fill in the attributes when it
replaces the lexemes by tokens. This requires a communication
between bison and flex because the attributes are
defined in the bison file and yet it is flex
that puts values in them. This bison-flex communication
is achieved via a C variable called yylval. It is defined
by bison, and again declared by flex as
extern. Both the definition and the declaration
are done automatically, and we do not need to do them
explicitly. We just use yylval as a ready-made variable.
It has the type given by the %union that we specified in
the bison input file.
In our case it is:
union {
int whichDirn;
char *str;
}
Inside the adv3.lex file change the line
north { return tok_DIRN; }
to
north { yylval.whichDirn = 0; return tok_DIRN; }
The other directions are handled similarly. There is another ready-made
variable that flex defines for us. It is called
yytext. It is of type char*, and stores the most
recently encountered lexeme. We are going to need it now. Change the command
{ return tok_IDENT; }
to
{ yylval.str = strdup(yytext);
return tok_IDENT; }
[Thanks to Danny for correcting a mistake here.]
The resulting flex input file is in adv4.lex. Here the standard C function
strdup is used to duplicate a string. Given a string argument,
it allocates fresh memory of the same size as the string, and then
copies the string to that memory. This function is declared in the
header file string.h that we have included in the begining of adv4.lex.
|