www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

Some alternatives

Consider the AdvLang language once again. When one writes a program in the AdvLang language one essentially expresses one's idea about an adventure game. The AdvLang compiler turns this idea into a full-fledged implementation. Similarly, the C program

for(i=0;i<10;i++) 
  printf("Hello");

expresses the programmer's intention to print "Hello" 10 times. It is the job of the compiler to produce suitable machine language instructions to fulfil this intention.

This is true about any language. A language is a way to allow a user to express his/her ideas in a simple way, and a compiler is a software to turn those ideas into a concrete implementation. But designing a language and its compiler is not the only mean to this end. In this page we shall discuss three alternatives to compiler construction. All the three techniques will produce a "language" (some way to express ideas) and a "compiler" (some automated way to implement those ideas) without going through the hassles of writing an elaborate flex-bison-based compiler. The alternatives are all less powerful than what we have already learned, but their simplicity may more than offset their lack of power for small scale applications.

Writing C functions and macros

This is a very powerful alternative to building a compiler. In any compiler the final output is actually produced by the actions attached with the grammar rules. The grammar simply coordinates when these actions are triggered. So why not write these actions as single C functions/macros and simply call them directly. For instance the above AdvLang program may be written as:

#include <game.h>

Location house;
house = newLocation("Your house",
                    "You are standing\nin front of your
	 	    house.\nPaths lead towards east and west.");
addExit(house,EAST,flag);
addExit(house,WEST,forest);

startAt(house);

Here Location is the pointer to some suitable structure, and newLocation, addExit are functions that you shall write once and for all (and shipped via game.h). The end-user need not know about these, or even may not know C.

The advantage is obvious: you do not need to write any lexical analyser or parser (you are just borrowing those from the user's C compiler). You do have to write the structure for location and the functions addExit etc, but anyway you had to write them as actions in a bison file. There are even more advantage for a user who knows C. Suppose that we want to make an adventure game that is a maze of 100 locations arranged in a 10 by 10 grid. Such a thing now be created easily using a for-loop in C.

The disadvantage is that such a "language" is very difficult to debug or maintain. Imagine someone forgetting to create a location that is mentioned somewhere in the game. The C compiler will not generate any error message for this omission. Only there will be some nasty stuff (anything from a segmentation fault, access violation to wrong messages to system crash) during run time.

In short, such a "language" is not structured. And without structuring it is impossible to have automated error checkings.

LaTeX (which may easily win a comptetion for being the world's worst-designed language) is just one such "unstructured" beast hacked out of macros written in TeX (which by itself is a brilliant low-level language).

Having said all these, I must hasten to add that for a small scale language this approach is indeed pretty cool. In fact, I would strngly suggest that you always start out like this when you want to design a new language. Once you are happy with the "unstructured" language you can later put it in a flex-bison wrapper (using your "unstructured" functions as actions).

OpenGL is a popular example of an "unstructured language". It is basically a bunch of functions.

XML

The main job of a parser is to build the parse tree of a program. In a sense, the parse tree contains all the information needed by the compiler. The design of the grammar, the efforts to avoid shift-reduce conflicts, all revolve around the central theme: the parse tree.

But for many small scale problems the end-user may be be ready to proviude the parse tree directly, obviating the need for the compiler to construct it. This is precisely the reason why XML is so popular, though many users may not be thinking of parse trees when they write an XML file. Consider the AdvLang program

LOCATION house
NAME "Your house"
DESCRIPTION "You are standing\nin front of your house.\nPaths lead towards east and west."
east flag
west forest

START_AT house

Isn't the same information conveyed in the following XML snippet?

<ADVENURE start="house">
  <LOCATION id="house">
    <NAME>Your house</NAME>
    <DESCRIPTION>You are standing\nin front of your house.\nPaths
      lead towards east and west.</DESCRIPTION>
    <EXIT dir="east">flag</EXIT>
    <EXIT dir="west">forest</EXIT>
  </LOCATION>
</ADVENURE>

Well, you may not quite like the appearance of those tags. They do not look clean. But the advantage is that now you do not need to design a compiler to parse it. Just use a standard XML processing (say XSL and a standard browser) to work with the tree directly. Any XML processor already checks nesting of the tags, so that provides a primitive syntax. If you want some additional restrictions then an extra DTD may help.

The advantage of the approach is that such a language may be created very easily on the fly. One can add extra features to the language equally easily.

The disadvantage is that XSL provides only limited programming functionalities. In particular, its cross referencing scheme is notorious. So this approach is not a good choice for languages where some variable is used in many times throughout a program. But a language based on only "local information" is an ideal candidate for XMLifying!

VRML is a "language" that is built like this using XML.

Embedding one language inside another

This is a nice technique that is gaining currency rapidly. The idea originates from the need to add extra features to some existing language. For example, Perl, Python and Ruby allow multi-line strings. Thus you can write

print '''Usage notes for my program:

Please type
superprog [options] filename
where [options] is one or more of
  a : abort
  b : burst
  c : cancel
  d : die
  e : explode (or E : expire)'''

Unfortunately C does not allow such a thing. Can we add design a language that will be C plus this feature? Clearly, rewriting the C complier is out of the question. It is in such a scenario that the embedding trick is useful.

All that we need to do is this: write a program that takes a multiline string and adds a backslash just before every newline in it. Now make your program read a C program copying everything until it comes to the first occurence of '''. From then on it keeps on adding backslashes before newlines until it meets the closing '''. Now compile the resulting C program. You can consider this simple program as the front end of the your compiler, the resulting c program as the intermediate representation and the actual C compiler as the backend. You may even pack the front and back ends in a single shell script (that will delete the intemediate C file after the compilation is over). The fornt end may be written in any language. I find Perl or Ruby a very good choice. You can also do it using flex.

Many standard "languages" use this technique already:
  1. C preprocessor: Though we may be used to considering the C preprocessing directionves like #define, #if #elif etc as part of C, in fact these constitute a language of their own embedded inside C. The c preprocessor (which is the compiler for that embedded language) merely scans through the C program to operate on the lines starting with # treating everything else basically as meaningless characters (except substituting the macros).
  2. JSP: There is much noise about the Java Server Pages (JSP) technology, which is little more than our simple technique applied to embed ordinary multiline texts inside Java servlets. Indeed, I once wrote a little 10-line Perl code to create my own JSP.
  3. Bison: Yes, even bison itself uses the same trick in its gramar file, which is just a grammar emebedded in a C program. The %{, %} and %%'s all keep the embedded grammar separate from the C part.
All the examples so far are where the technique is used sequentially. First the embedded compiler runs to process the embedded parts, and then the main compiler runs. Sometimes the the two compilers run independently or in parallel. Then the original file is like two interlaced programs. Each compiler knows how to isolate its own part from the rest. Here are some real examples:
  1. Java and Javadoc: The inline documentation facility of Java is basically a very simple scripting language (much like our AdvLang) embedded inside Java. The embedding is marked by a special commening scheme to keep the regular javac from getting confused. By running the javac compiler we can produce the .class file as usual. By running the javadoc compiler we can produce the html documentation files.
  2. Postscript and DSC:
  3. Postscript is a language for producing printable documents containing text and graphics. However, in its original incarnation it was meant for the printer, and did not allow easy navigation in an onscreen viewer. The navigation features were added to it by embedding a script-like language called DSC (Display Setting Control) in a regular postscript file. A Postscript file without this embedded part will print all right, but cannot be navigated using a viewer (e.g., cannot jump to a page, go to the previous page etc).

Prev
© Arnab Chakraborty (2007)