Arnab Chakraborty's DFA tutorial: page 3 of 11

www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Home >

... tutorials >

... ... DFA

Why use finite automata?

This is the third page of our tutorial, and it may seem somewhat late now to ask this question. Indeed, this is the question that a student may (and should) justly demand an answer to even before starting to read the tutorial! But the only reason that we have strained the reader's patience so far is: had we listed the possible uses of finite automata earlier, it would most possibly have made him/her think that finite automata are very difficult things to understand. Indeed, there does not seem to be another belief that is so wrong, yet so popular, as that "brilliant concepts are difficult concepts". Well, we for one do not believe in this motto, and hope to find the reader on our side once he/she completes reading the tutorial.

An internal state machine is a machine with a memory. We go on a bit further and say, that any machine with a memory is an internal state machine. The finiteness condition in the definition of finite automata is the natural restriction imposed by the fact that computers cannot handle infinite sets. (However, there are some important variants of finite automata that sort of allow infinite sets. We shall learn about some such variants in a later page.) Viewed in this light, the above question reduces to "Why use machines with memories?"

To the modern reader the answer should be more than obvious. What is a computer but a machine with a memory? Almost every application that runs on a computer is some form of finite automata or other. Here is a list that makes no pretence of being comprehensive:

Well, if we go on like this the list will extend up to infinity. So we shall pause here, and show one very important class of application namely language recognition. This application will show how finite automata are used in editors and in the first phase of compilers.

A concrete application : Lexical analysis

Consider a huge C-code with lots of embedded comments. As the reader should know already, C-comments are delimited by a starting /* and a trailing */, and comments cannot be nested. When a compiler reads the code it strips all the comments off the code. How is this stripping done? To understand this we shall write a that portion of the compiler which strips comments. Our compiler will read the program file character by character. Since the program may be very huge, it makes no attempt to store huge chunks of the code at a time. Keeping this restriction in mind we know that our compiler (which will presently emerge as a finite automaton) has input set consisting of all the possible characters that may be present in C code. Now let us delve into the mind of the compiler.

When it starts reading the file it is outside any comment. It goes on reading character after character from the file (and tossing the characters in the output file) until it meets the first /. Then it knows that t has encountered a possible start of comment. To make sure it reads the next character. If it is a * then the compiler is certain that it is inside a comment. In this case the compiler keeps on reading the file, but now throws away the characters in the garbage bin. If it was not a * then of course the compiler understands that it is outside any comment and keeps on copying the characters read into the output file. In our imaginary voyage through the mind of the compiler, let us again imagine ourselves in the situation where the compiler is inside a comment. If it happens to meet a * it knows that it may be a possible end of comment. So after trashing the * it reads the next character. If it is a / then the comment has indeed ended and the compiler is again in a region that is outside any comment.

Let us take one more look at the above confusing description paying particular attention to the bold phrases. There are four of them:

The behaviour of the compiler depends which of the four states ie is currently in. Well, there is no hiding after we have already used the word "state". Yes, these are the 4 states of the finite automaton. What are the inputs? We have already said that the input set consists of all possible characters that may be found in a C-file. But that's a pretty large set. One further look at the lat paragraph will, however, convince the reader that all that the automaton really has to know is whether the character is a * or / or something else. But is that all? Not quite! Our automaton needs some way to know when the end of file has been reached. You surely cannot leave the machine to imitate poor Oliver, "always asking for more"! So we need another special input symbol which we shall call EOF (End Of File). Thus we shall have four inputs:

*

/

How many outputs shall we have? Well, we shall need 7 of them:

Without further ado we give the two tables corresponding to the output controlling and the state controlling functions:

Inputs

States

	`*`	`/`	EOF	<anything else>
outside any comment	PC	S	D	PC
possible start of comment	T	PSC	PSD	PSC
inside comment	S	T	E	T
possible end of comment	T	T	D	T

[Thanks to Eimantas Vaiciunas for helping me correct a mistake in this table.]

The state controlling function is equally obvious.

Inputs

States

	`*`	`/`	EOF	<anything else>
outside any comment	outside	start	outside	outside
possible start of comment	inside	outside	outside	outside
inside comment	end	inside	inside	inside
possible end of comment	inside	outside	inside	inside

Here, in the second table, "start" is an abbreviation for "possible start of comment", "inside" is an abbreviation for "inside comment", and so on. The reader is requested to try his/her hand at implementing the above finite automaton in some standard computer language like C or ForTran.

In the next page we shall continue with a more sophisticated version of the same example.

hit counter