|
www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com |
|
|
Lexical analysis
Lexemes and tokens
Let's take a closer look at the high-level code in toy.al.
It consists of a sequence of "words" separated by whitespaces, right?
The technical term for a "word" is lexeme. Thus, LOCATION
is a lexeme, and lake is another. Now, like
most modern computer languages, we shall make AdvLang a free-format
language. This means that the amount of whitespace between two lexemes is
unimportant, so long as there is at least one whitespace to keep them
separate. For example, the C program
int i; i = 2;
mean the same thing as
int i;
i = 2 ;
In both these cases we have the lexemes: int, i,
= and ;.
But the following program (where there is no space to separate
int from i) is different:
inti; i = 2;
Here inti will be treated as a single lexeme (causing an error).
From the compiler's perspective, a program
is just a list of lexemes. Here is a list of all the lexemes used
in our example, toy.al.
LOCATION, NAME, DESCRIPTION, north, south, east, west, START_AT
|
house, obelisk, marsh, treasure, flag, forest
|
"Your house", "Obelisk", "Marsh", "Treasure", "Flag", "Forest"
"You are standing\nin front of your house.\nPaths lead towards east and west.",
"A big obelisk is\nstanding before you. You can either go east or west or south."
"You are slowly swallowed by the mud.\nWhat a terrible end!",
"A chest full of treasure is lying at your feet!\nExits are towards north, east and west.",
"A flag is fluttering at a crossing.\nPaths go in all four directions,
except west."
"A north-south road goes through a thick forest.\nA trail goes towards east."
|
Notice that we have grouped these into three coloured blocks.
The lexemes in
the last block are really not "words", they are strings,
ie,
sequences of characters delimited by double quotes. But since the
strings behave as single entities, we consider
them as lexemes.
The lexemes in the first block are called keywords, since
their meanings remain the same in every AdvLang program.
And these are all the keywords that AdvLang has,
and an AdvLang user has no choice about them. Among these,
the keywords north, south, east and west share one common thing,
namely they all specify directions.
Unlike the keywords, the lexemes in the second and third blocks
are all chosen by the user. We have already mentioned that the lexemes
in the third block are called strings. We shall
call the lexemes in
the second block as identifiers, since they
identify the locations.
Based on the above description we can group all possible
AdvLang lexemes into a number
of bunches. Each bunch is called a token or terminal.
Here we are talking about all possible lexemes.
Some lexeme(s) may not appear in some AdvLang programs.
| Name of token | Lexemes in the token |
| tok_DIRN | north, south, east, west |
| tok_LOCN | LOCATION |
| tok_DESCR | DESCRIPTION |
| tok_START | START_AT |
| tok_NAME | NAME |
| tok_IDENT | any single word |
| tok_STRING | any sequence of characters
inside double quotes |
Observe that the keywords (except north, south etc) form tokens
on their own.
From lexeme to token
The first phase in compilation is to replace each lexeme by the token
it belongs to. This is called lexical analysis. We shall learn
later how to instruct the computer to
achieve this. For the time being let's take a look at what lexical
analysis
does to toy.al. Fir simplicity we shall show only part of the
file here:
| Original program | After lexical analysis |
|
LOCATION house
NAME "Your house"
DESCRIPTION "You are standing...west."
east flag
west forest
.
.
.
START_AT house
|
tok_LOCN tok_IDENT
tok_NAME tok_STRING
tok_DESCR tok_STRING
tok_DIRN tok_IDENT
tok_DIRN tok_IDENT
.
.
.
tok_START tok_IDENT
|
Notice that at the end of lexical analysis we have a file consisting
of only the tokens.
Using flex/lex
Now it is time to learn how to perform lexical analysis using
a computer. If you are well-versed in C or some similar
programming language, then you may try to write your own code
to do the lexical analysis. But there is a much simpler method
that most people use. They use a free utility program called
flex that writes the C program for them! We shall learn
now how to use it.
Flex is available from a number of sources.
If you are using any unix-like system, then most likely you
already have it. It is also available by default in Mac OS X.
In any case, you can download it for free from www.gnu.org.
|
To use flex we need to tell it what the possible lexemes
are, what the tokens are, and which lexeme belongs to which token.
In other words, we need to feed the token-lexeme table into
flex. This is done by writing the table in a particular way
in a file with extension .lex.
Open your favourite text editor and
create a file called adv1.lex as shown left.
adv1.lex:
%{
#include "compile.h"
%}
%%
LOCATION {return tok_LOCN;}
north {return tok_DIRN;}
south {return tok_DIRN;}
%%
yywrap() {
return 1;
}
|
Make sure that the %%'s start from the first column.
Also, the lexemes LOCATION, north
and south
must start from the first column.
We shall learn later the details of these. But here is a quick
explanation
before proceeding further. In the first part (between the %{
and %} we can have any C-like #includes.
Here we have included a header file called "compile.h". We shall
create this file in a moment. It will just hold the token definitions.
The part of the file between the two
%%'s is where we put the token-lexeme table.
Here we have implemented only three lexemes
LOCATION, north
and south.
Each line between the two %%'s starts with a lexeme.
Then, inside the {} we return the
corresponding token. Ignore the yywrap() function for the time
being. It is automatically called when the end of file is
reached. Returning 1 means "everything is OK".
This file is the input to flex. Based on the token-lexeme table in
it, flex will produce a C program file called lex.yy.c.
Run flex with the following command.
flex adv1.lex
flex will automatically generate the file lex.yy.c.
Among other things, this file contains a
machine-generated C functon called yylex(). Each time
this function
is called, it looks for the next occurence of any of the specified
lexemes. If it finds one it returns the corresponding token.
compile.c:
#include <stdio.h>
#include "compile.h"
main() {
int tok;
while(1) {
tok = yylex();
switch(tok) {
case tok_LOCN:
printf("tok_LOCN");
break;
case tok_DIRN:
printf("tok_DIRN");
break;
}
}
}
|
To test it we shall write a small C program compile.c and a
header file compile.h.
Let us explain the header file first. Its contents are shown in the box to the left. It merely defines the tokens as integers. The numbers
10 and 11 are chosen arbitrarily. You could as well make them 100 and 200 (but not 0, which has a special use).
compile.h:
#define tok_LOCN 10
#define tok_DIRN 11
|
The program compile.c is shown in the box to the right.
Notice that compile.h
is #included in it. The purpose of main() is
obvious. Compile compile.c to produce an executible called
compile, say. You may use the command
cc -o compile compile.c lex.yy.c
[Thanks to Danny for correcting a mistake in the line
above!]
Now comes the fun part. Run compile on toy.al
as follows.
compile < toy.al
The entire toy.al will be printed on the screen, but all
occurences of LOCATION will now be replaced by
tok_LOCN and all occurences of north
and south by tok_DIRN.
Implementing any token with finitely many lexemes in it
is similar. But tokens with inifinitely many lexemes in them
are slightly trickier. Let us, for instance, consider the token
tok_IDENT. Since there are inifinitely many possible
location names that the AdvLang programmer can choose
from, it is impossible to list them exhaustively in
adv1.lex. We need some compact way of expressing that
inifinitely long list in a finite amount of space.
Flex uses regular expressions to do this. This is not a
tutorial on regular expressions, so we shall explain only
the particular regular expressions that we shall use. For
a general introduction to regular expressions see
here.
Recall that tok_IDENT, like any other token, is
a bunch of lexemes. Any lexeme that is made of (a positive
number of)
uppercase or lowercase
letters from a to z is a member of this bunch. The only exceptions
are the keywords: LOCATION, north etc. The set of all uppercase and
lowercase letters from a to z is written as the regular expression
[a-zA-Z]
The set of all lexemes made by writing a positive number of these
letters is denoted by the regular expression
[a-zA-Z]+
Note that this set is infinite. Some members of this set are
cave, rtjskgj, asErtf, GHLO, NAME
In particular, all the AdvLang keywords are also in this set.
To arrive at the set tok_IDENT we need to eliminate
the keywords from the regular expression [a-zA-Z]+ . The flex
input file to do this is shown in the box to the left.
adv2.lex:
%{
#include "compile.h"
%}
%%
LOCATION {return tok_LOCN;}
north {return tok_DIRN;}
south {return tok_DIRN;}
[a-zA-Z]+ {return tok_IDENT;}
%%
yywrap() {
return 1;
}
|
Note that it is essentially the same adv1.lex
as before, except that
we have added a line starting with [a-zA-Z]+ after the
lines involving the keywords. Don't forget to start [a-zA-Z]+
from the first column.
When flex looks for lexemes in an AdvLang program,
it does so in the order of the lexemes
given in adv2.lex. For instance, when it comes across
a lexeme, flex first checks if it is LOCATION. If it is,
flex
at once returns the token tok_LOCN, without trying to
match any further. If the lexeme is not LOCATION
then flex sees if it matches north, and so on. If nothing
matches the lexeme, then flex just prints it on the screen. Thus, the
only way flex can return tok_IDENT, is when a lexeme
is not LOCATION, north or south,
but belongs to the set [a-zA-Z]+ . Please take some time to understand
this point. This tells us why the keyword lexemes should be listed
before the regular expression lexemes.
| Exercise 4.1: A good exercise to consolidate your understanding at this point
would be to add relevant lines between the two %%'s
to account for all the possible lexemes. The case for
strings is slightly tricky owing to the presence of double quotes. Here is
a hint: the double quote character is represented as "\"" and any positive
number of non-double-quote characters is denoted by
[^"]+ The ^ inside the square brackets means "not".
|
| [Click here for solution.] |
|