Arnab Chakraborty's compiler tutorial: page 4 of 11

www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Home >

... tutorials >

... ... Compiler

Lexical analysis

Lexemes and tokens

Let's take a closer look at the high-level code in toy.al. It consists of a sequence of "words" separated by whitespaces, right? The technical term for a "word" is lexeme. Thus, LOCATION is a lexeme, and lake is another. Now, like most modern computer languages, we shall make AdvLang a free-format language. This means that the amount of whitespace between two lexemes is unimportant, so long as there is at least one whitespace to keep them separate. For example, the C program


  int i; i = 2;

mean the same thing as


  int       i; 
  i = 2        ;

In both these cases we have the lexemes: int, i, = and ;. But the following program (where there is no space to separate int from i) is different:


  inti; i = 2;

Here inti will be treated as a single lexeme (causing an error).

From the compiler's perspective, a program is just a list of lexemes. Here is a list of all the lexemes used in our example, toy.al.

LOCATION, NAME, DESCRIPTION, north, south, east, west, START_AT

house, obelisk, marsh, treasure, flag, forest

"Your house", "Obelisk", "Marsh", "Treasure", "Flag", "Forest"
"You are standing\nin front of your house.\nPaths lead towards east and west.",
"A big obelisk is\nstanding before you. You can either go east or west or south."
"You are slowly swallowed by the mud.\nWhat a terrible end!",
"A chest full of treasure is lying at your feet!\nExits are towards north, east and west.",
"A flag is fluttering at a crossing.\nPaths go in all four directions, except west."
"A north-south road goes through a thick forest.\nA trail goes towards east."

Notice that we have grouped these into three coloured blocks. The lexemes in the last block are really not "words", they are strings, ie, sequences of characters delimited by double quotes. But since the strings behave as single entities, we consider them as lexemes. The lexemes in the first block are called keywords, since their meanings remain the same in every AdvLang program. And these are all the keywords that AdvLang has, and an AdvLang user has no choice about them. Among these, the keywords north, south, east and west share one common thing, namely they all specify directions.

Unlike the keywords, the lexemes in the second and third blocks are all chosen by the user. We have already mentioned that the lexemes in the third block are called strings. We shall call the lexemes in the second block as identifiers, since they identify the locations.

Based on the above description we can group all possible AdvLang lexemes into a number of bunches. Each bunch is called a token or terminal. Here we are talking about all possible lexemes. Some lexeme(s) may not appear in some AdvLang programs.

	Name of token	Lexemes in the token
	tok_DIRN	north, south, east, west
	tok_LOCN	LOCATION
	tok_DESCR	DESCRIPTION
	tok_START	START_AT
	tok_NAME	NAME
	tok_IDENT	any single word
	tok_STRING	any sequence of characters inside double quotes

Observe that the keywords (except north, south etc) form tokens on their own.

From lexeme to token

The first phase in compilation is to replace each lexeme by the token it belongs to. This is called lexical analysis. We shall learn later how to instruct the computer to achieve this. For the time being let's take a look at what lexical analysis does to toy.al. Fir simplicity we shall show only part of the file here:

Original program

After lexical analysis


LOCATION house
NAME "Your house"
DESCRIPTION "You are standing...west."
east flag
west forest
  .
  .
  .
START_AT house


tok_LOCN tok_IDENT
tok_NAME tok_STRING
tok_DESCR tok_STRING
tok_DIRN tok_IDENT
tok_DIRN tok_IDENT
  .
  .
  .
tok_START tok_IDENT

Notice that at the end of lexical analysis we have a file consisting of only the tokens.

Using flex/lex

Now it is time to learn how to perform lexical analysis using a computer. If you are well-versed in C or some similar programming language, then you may try to write your own code to do the lexical analysis. But there is a much simpler method that most people use. They use a free utility program called flex that writes the C program for them! We shall learn now how to use it.

Flex is available from a number of sources. If you are using any unix-like system, then most likely you already have it. It is also available by default in Mac OS X. In any case, you can download it for free from www.gnu.org.

To use flex we need to tell it what the possible lexemes are, what the tokens are, and which lexeme belongs to which token. In other words, we need to feed the token-lexeme table into flex. This is done by writing the table in a particular way in a file with extension .lex.

Open your favourite text editor and create a file called adv1.lex as shown left.

adv1.lex:

%{
#include "compile.h"
%}

%%
LOCATION   {return tok_LOCN;}
north      {return tok_DIRN;}
south      {return tok_DIRN;}
%%
yywrap() {
  return 1;
}

Make sure that the %%'s start from the first column. Also, the lexemes LOCATION, north and south must start from the first column.

We shall learn later the details of these. But here is a quick explanation before proceeding further. In the first part (between the %{ and %} we can have any C-like #includes. Here we have included a header file called "compile.h". We shall create this file in a moment. It will just hold the token definitions. The part of the file between the two %%'s is where we put the token-lexeme table. Here we have implemented only three lexemes LOCATION, north and south. Each line between the two %%'s starts with a lexeme. Then, inside the {} we return the corresponding token. Ignore the yywrap() function for the time being. It is automatically called when the end of file is reached. Returning 1 means "everything is OK".

This file is the input to flex. Based on the token-lexeme table in it, flex will produce a C program file called lex.yy.c. Run flex with the following command.


flex adv1.lex

flex will automatically generate the file lex.yy.c. Among other things, this file contains a machine-generated C functon called yylex(). Each time this function is called, it looks for the next occurence of any of the specified lexemes. If it finds one it returns the corresponding token.

compile.c:

#include <stdio.h>


#include "compile.h"

main() {
  int tok;

  while(1) {
    tok = yylex();
    switch(tok) {
      case tok_LOCN:
        printf("tok_LOCN");
        break;
      case tok_DIRN:
        printf("tok_DIRN");
        break;
    }
  }
}

To test it we shall write a small C program compile.c and a header file compile.h. Let us explain the header file first. Its contents are shown in the box to the left. It merely defines the tokens as integers. The numbers 10 and 11 are chosen arbitrarily. You could as well make them 100 and 200 (but not 0, which has a special use).

compile.h:


#define tok_LOCN 10
#define tok_DIRN 11

The program compile.c is shown in the box to the right. Notice that compile.h is #included in it. The purpose of main() is obvious. Compile compile.c to produce an executible called compile, say.

You may use the command


cc -o compile compile.c lex.yy.c

[Thanks to Danny for correcting a mistake in the line above!]

Now comes the fun part. Run compile on toy.al as follows.


compile < toy.al

The entire toy.al will be printed on the screen, but all occurences of LOCATION will now be replaced by tok_LOCN and all occurences of north and south by tok_DIRN.

Implementing any token with finitely many lexemes in it is similar. But tokens with inifinitely many lexemes in them are slightly trickier. Let us, for instance, consider the token tok_IDENT. Since there are inifinitely many possible location names that the AdvLang programmer can choose from, it is impossible to list them exhaustively in adv1.lex. We need some compact way of expressing that inifinitely long list in a finite amount of space.

Flex uses regular expressions to do this. This is not a tutorial on regular expressions, so we shall explain only the particular regular expressions that we shall use. For a general introduction to regular expressions see here.

Recall that tok_IDENT, like any other token, is a bunch of lexemes. Any lexeme that is made of (a positive number of) uppercase or lowercase letters from a to z is a member of this bunch. The only exceptions are the keywords: LOCATION, north etc. The set of all uppercase and lowercase letters from a to z is written as the regular expression

[a-zA-Z]

The set of all lexemes made by writing a positive number of these letters is denoted by the regular expression

[a-zA-Z]+

Note that this set is infinite. Some members of this set are

cave, rtjskgj, asErtf, GHLO, NAME

In particular, all the AdvLang keywords are also in this set. To arrive at the set tok_IDENT we need to eliminate the keywords from the regular expression [a-zA-Z]+ . The flex input file to do this is shown in the box to the left.

adv2.lex:

%{
#include "compile.h"
%}
%%
LOCATION   {return tok_LOCN;}
north      {return tok_DIRN;}
south      {return tok_DIRN;}
[a-zA-Z]+  {return tok_IDENT;}
%%
yywrap() {
  return 1;
}

Note that it is essentially the same adv1.lex as before, except that we have added a line starting with [a-zA-Z]+ after the lines involving the keywords. Don't forget to start [a-zA-Z]+ from the first column.

When flex looks for lexemes in an AdvLang program, it does so in the order of the lexemes given in adv2.lex. For instance, when it comes across a lexeme, flex first checks if it is LOCATION. If it is, flex at once returns the token tok_LOCN, without trying to match any further. If the lexeme is not LOCATION then flex sees if it matches north, and so on. If nothing matches the lexeme, then flex just prints it on the screen. Thus, the only way flex can return tok_IDENT, is when a lexeme is not LOCATION, north or south, but belongs to the set [a-zA-Z]+ . Please take some time to understand this point. This tells us why the keyword lexemes should be listed before the regular expression lexemes.

Exercise 4.1: A good exercise to consolidate your understanding at this point would be to add relevant lines between the two %%'s to account for all the possible lexemes. The case for strings is slightly tricky owing to the presence of double quotes. Here is a hint: the double quote character is represented as "\"" and any positive number of non-double-quote characters is denoted by [^"]+ The ^ inside the square brackets means "not".

[Click here for solution.]

hit counter