Toy - A Small Compiler for the Amiga To most people, compilers are black boxes. They read an input text file containing source and magically write out an object file containing equivalent machine instructions. In actuality, there is nothing magical about a compiler. A compiler doesn't even have to be very complex or very large. Since programmers are quite demanding, however, the compiler writer (being a demanding programmer), is not likely to be satisfied with a compiler that is less than perfect. Thus, the typical compiler soons ceases to be simple and small. Such minor details as complex data types, good code, good error recovery, etc., can add thousands of lines to a compiler. Such an embellished compiler is likely not fully understood by its author, let alone anyone else. So, when people have requested the source to my Draco compiler, so they can learn from it and add some stuff they want, I have been somewhat reluctant. To be honest, it is a mess inside. This situation led me to write a new compiler, one designed to be simple and understandable, but nevertheless a real compiler for the Amiga. The result of this is the Toy compiler, which I will describe in this series of articles. It is quite simple in operation, and uses the same techniques that the Draco compiler itself uses. It is written in Draco, which is freely available, so people can tinker with it without having to buy a commercial compiler. In spite of its simplicity, it is a true compiler. It goes from a source file to an executable object file containing MC68000 instructions. It does not use an interpreter, or threaded code, or any other such technique for execution - it emits straightforward (and not overly efficient) instructions to accomplish the steps specified by the input source program statements. In this first article of the series, I will describe the language that is being compiled, giving some example programs written in Toy. I will also outline the general structure of compilers, and the Toy compiler in particular. Techniques used in commercial and experimental compilers will be mentioned, as will many of the problems that compiler writers must contend with, and the decisions they must make. The Toy Language One of the first decisions to be made when designing a compiler is that of deciding what language is to be compiled. To keep the compiler simple, we want a simple language. This pretty well rules out a complete version of an existing language. Most Amiga programmers would prefer a C - like language, since that is what they use. I don't feel that is a good choice for a number of reasons. C is normally used with a preprocessor (either a separate program or built-in to the compiler), and having one would increase the complexity of the project. C's syntax is ambiguous (e.g. "fred();" is either a declaration or a call, depending on where you see it), thus complicating any error recovery. C's inside-out declarations are a pain to handle (it took me a couple of tries in the C compiler I did at work). Lastly, C's automatic declarations and unusual ideas about scope (anything declared as 'extern' is global, regardless of where you declare it) add unneeded hazards. I didn't see any gain in inventing a whole new language, however, so I went with what I am most familiar with - Draco. The Toy language is a proper subset of Draco. All of the Toy programs I have written compile and run just fine under the Draco compiler. For those not familiar with Draco, I will describe the Toy language here, using a simple grammar and some examples. A Toy program consists of any number of procedures, which may or may not return a result. Each procedure can have named, typed parameters. Variables global to the file and local to procedures exist, and have similar types. Executable statements include the assignment statement, the 'if' statement, the 'while' statement, and some simple input/output statements. Operators usable in expressions include '+', '-', '*' and '/'. Available in conditions are the standard comparison operators. In the following grammar, the '::=' symbol means that the term on the left can be any of the '|' - separated sequences on the right. Items in apostrophes must appear exactly as given. Items in angle brackets ('<' and '>') must be replaced by one of their right-hand-sides in order to make an acceptable program. A following '*' means that the term can appear any number of times (including none). ::= * * ::= ';' ::= ::= 'int' | 'bool' ::= | ',' ::= Any sequence of letters, digits and underscores, starting with a letter or an underscore ::= 'proc' '(' ')' ':' * 'corp' ';' ::= | ::= | ';' ::= 'int' ::= | 'void' ::= | ::= | ';' ::= | | | | | ::= 'if' 'then' * 'fi' ::= 'elif' 'then' ::= | 'else' ::= 'while' 'do' 'od' ::= 'readln' '(' ')' ::= 'write' '(' ')' ::= | ',' ::= | ::= ':=' ::= ::= | ::= '=' | '~=' | '<' | '<=' | '>' | '>=' ::= | ::= '+' | '-' ::= | ::= '*' | '/' ::= | '-' | '+' ::= | | | '(' ')' | 'true' | 'false' ::= '(' ')' ::= | ',' ::= Any sequence of decimal digits Many parts of the Toy compiler's parser will follow this grammar quite closely. Note in particular that the definition of eventually cycles around to mentioning on the right hand side. This recursive definition is echoed in the recursive descent parser used in Toy. The compiler places many limitations on acceptable programs which are not described by the grammar. Types of grammars exist which can contain all of the semantic information needed, but building a correct one of those is more work than building a compiler, so I have not done so. Those interested in such things should read the Algol68 report. (I'm being sarcastic here - the report is, in my opinion, unreadable without a great deal of study.) There are two main types of limita- tions. The first type are due to the specific implementation of the compiler. For example, each 'if' statement is limited, by an array in the parsing code, to 10 'elif' parts. This could be handled by using a dynamically allocated linked list, but is not a problem in a toy compiler. The other type of limitation imposed by the compiler is that of semantic correctness - the compiler will reject things that don't make sense, even if they are correct according to the above grammar. E.g. bool flag; flag := 6; is flagged as an error, even though it fits the grammar just fine. This is because Toy is a strongly typed language. The compiler also checks such things as parameter counts and types on procedure calls. The alert reader will already have noticed that Toy isn't a very impressive language. You cannot write the world's best Amiga program using it. It is pretty difficult to write ANYTHING useful in it. It has only two data types - 'int' and 'bool'. It has no arrays, no structures, no pointers, no named constants, and a pretty wimpy I/O system. Adding any of these things would have significantly increased the size and complexity of the Toy compiler, making it less suitable as a tutorial compiler. Interested programmers can add many features to Toy themselves. I will be giving hints on how to go about this as I discuss the various portions of the compiler. In the following discussions of the Toy language and compiler, I will assume that the reader is familiar with at least one programming language, and will be able to understand the general outline of what the various programs are doing without me explaining what all of the various language constructs mean. A detailed understanding is not necessary to get the general outline of how the compiler works. Sample Programs A tradition I've held on to for my compilers is that the first program I try to write is one that prints out "Hello there world!". Here is the Toy version: proc main()void: write("Hello there world!\n"); corp; As the grammar indicates, a Toy program can have a series of procedure bodies. This program has one procedure, called 'main', which has no parameters (there is nothing between the parentheses after the procedure name), and returns no result - the special result-word 'void' means that no result is returned. 'write', in this case, causes the given string to be written out. The pair of characters '\n' represent the newline character, and indicate that the output line should be ended at that point. 'corp' is simply 'proc' spelled backwards - program constructs in Toy are usually ended by a backwards word which matches a beginning word for that construct. This style is strongly disliked by several vocal people. I first thought that way, but changed my mind after using it for a while. It makes it quite clear what construct is being closed, something which C's braces or Pascal's BEGIN - END pairs fail to do. My second program is traditionally a recursive solution to the "Towers of Hanoi" problem. In the unlikely event that you haven't heard of it, I'll describe it briefly. The problem set-up consists of three vertical pegs, on which are placed several round disks. The disks all start on the first peg, and must be moved to the third peg, with the second peg available for intermediate operations. The disks are of varying sizes, and start out in order, with the largest on the bottom, and the smallest on the top. They must end up the same way on the third peg. The only valid move is to move a single disk from one peg to another. At no time may a disk rest on another which is smaller. The story says that there are some priests on a mountaintop who started doing the problem when the world was created, and when they are done, the world will end. It turns out that the traditional 64 disks would take so long to move according to the rules, that we need have no fear of the world ending. The standard solution is a recursive one consisting of three steps, dealing with N disks: 1) Recursively using this method, move the top N - 1 disks from the starting peg to the spare peg. 2) Move the N'th disk from its starting peg to its target peg. 3) Recursively using this method, move the top N - 1 disks from the spare peg to the target peg. There is a special case - if N = 0, don't do anything. Those not familiar with recursion may want to try this out using stacks of small pieces of paper. Remember that the meanings of 'start peg', 'spare peg' and 'target peg' change as you recurse down. Listing 1 shows the Toy implementation of this solution to the puzzle. I don't recommend that you try it with 64 disks - your Amiga will turn to dust long before it is done! Four or five disks makes a nice test. This program is already showing some weaknesses in Toy - a separate routine 'printPeg' is used to print out the name of a peg. Proper languages would allow string variables, which could be printed out directly. Procedure 'hanoi' implements the above solution. Moving a disk consists of printing out a message saying where to move it. Procedure 'main' (all Toy programs must have a procedure called 'main' - this is where execution starts) asks for and reads in the count of disks to be used, then does the top level call to 'hanoi'. Listing 2 shows 'shoot.t', a program which actually manages to implement a little game in Toy. It uses the cursor and colour control of the Amiga's console device to provide very primitive graphics and animation. The program draws a simple scene consisting of the ground, a "gun" (a square box sitting on the ground) and a "target" (a different- coloured diamond shape, also sitting on the ground). Shells are shot from the gun at a forty-five degree angle at a speed specified by the player. Their course is shown on the screen and the game is over when one hits the target dead on. The game maintains a score which is the number of shells shot, and maintains a message area for things like error comments. "shoot" is sufficiently complicated that I added some comments. For the non-C and non-Draco programmers out there, comments in Toy (not mentioned at all in the above grammar!) consist of any text between an opening /* and a closing */. The program starts off by declaring a bunch of global variables. I have capitalized their names to make them look like constants, but the Toy compiler takes no notice of that and would be quite happy to let me change their values. They are setup once in the main program, then not changed throughout the game. Following some global variables, the code starts. 'random' is the first Toy procedure we've seen that returns a result. The result type of the procedure is specified as 'int', by the 'int' after the parameters in the procedure header. The actual value returned is given by the expression before the 'corp'. This point also is not mentioned by the grammar. The 7 procedures after 'random' send escape sequences to the console window to accomplish various actions, such as moving the cursor, specifying boldface or normal type, and changing the fore- ground colour of the text. Consult the Amiga technical manuals for details on these escape sequences. Next are several routines which use these special control routines to do higher-level game-related actions, such as drawing the ground, gun and target. 'hitTarget' is one of the largest procedures in the program - it plots the course of the thrown shell, drawing it as an asterisk, and checking to see if it goes off the edge of the screen, hits the target, or just hits the ground. A 'BOOOM' is left wherever the shell hits the ground. Note the final 'if' statement in the routine - in a more complete language, the condition would have used a logical 'and' operation, rather than an extra 'if' statement. The final routine, 'main' is, as usual, where execution starts. Since Toy has no way to pre-declare procedures, 'main' must be the last procedure in the source file, since every other procedure is either called by 'main' or by something that 'main' calls, etc. Note that this restriction means that mutually recursive procedures cannot be used in Toy. The program requests the screen size so that it knows how to do the cursor addressing for correct positioning. The size should be 23 for a normal non-interlaced screen, and 48 for an interlaced one. The console device can tell the program how big the window is, but since Toy can only read integer values, it would not be possible to understand the string that the console device returns. General Outline of the Toy Compiler The Toy compiler is completely stand-alone. It does not need any other programs or libraries to do its job. In particular, it does not make use of any compiler-construction tools, such as lexer generators or parser generators. The parser generator Yacc (Yet Another Compiler- Compiler) has been discussed by Eric Giguere in his series of articles in "Transactor for the Amiga". Yacc and its relatives are often used on UNIX systems to generate parsers of various kinds. Along with it, UNIX offers 'Lex', a program to generate lexical scanners. These, and other similar tools, can simplify some aspects of compiler writing, but I prefer the flexibility of custom code, so have not used any for either Draco or Toy. Like any other program which reads input, writes output, parses command lines and prints error messages, a compiler requires an interface to the operating system. In the case of a compiler, the interface is made more complex by the strict structure required of the output machine-code file. In the case of AmigaDOS, the structure of an executable file is described in the "AmigaDOS Technical Reference Manual". Note that there are two variants of an object file. The first variant, normally produced by compilers and assemblers, is an unlinked one. This form is not directly executable by the operating system since it makes references to support and library routines which are provided by the compiler libraries and/or operating system. Such a file is linked with similar files containing libraries and interfaces to the operating system to produce a final, fully linked object file. This latter type of file IS directly executable by the operating system. I chose to keep Toy as simple and complete as possible, so it directly produces the final linked version. This means that the user does not need access to a linker (e.g. either ALink or BLink) in order to compile and run Toy programs. This implies that the Toy compiler itself must supply any support routines that are needed. For Toy, these consist of the ability to read and write integers and to write strings. Note that the two Toy source files dealing with the system interface and generation of the run-time routines are the only ones which are system dependent. The remainder of the compiler does not care where it is running, or where its output is going - it is only concerned with the input source and the generated MC68000 instructions. The system interface supplies input data in a raw form - as a sequence of ASCII characters. The first step in compilation is to break this sequence up into larger, more meaningful pieces. These pieces are such things as identifiers, reserved words such as "proc", "if", etc., numbers, punctuation marks, string constants, etc. This process is called "lexical analysis", "lexical scanning" or "tokenization". The output from this phase of a compiler is a sequence of "tokens" which represent the input program at a higher level than that of a sequence of characters. The tokenizer is also responsible for skipping over comments, blank lines, spaces, etc. Many compilers do not actually keep the tokenized form of the program around - they simply call a routine to get the next token when it is needed, and then discard it when they go on to the one after that. Toy uses this technique. There are not many error messages associated with tokenization - about the only thing that can be complained about are characters not in the set of characters that the language uses. For a C compiler, the sequence of events described here is somewhat different. First, the input text is very roughly tokenized - just enough for the preprocessor to spot its directives. The resulting "preprocessor tokens" are handled by the preprocessor to do macro replacement and conditional compilation. The output from the preprocessor is another text stream. The compiler proper must then tokenize this stream yet again for its own use. It is almost possible to do away with the double tokenization, but the strict definition of the C language does not really allow this. I tried it for the ANSI C compiler I wrote at work, and it is that area of the compiler that most needs redoing. The sequence of tokens generated by the tokenizer must be parsed according to the grammar for the language. In most compilers, the parser also issues syntax-related error messages, attempts error recovery, and often does semantic (meaning) checking as well. These activities are all done by the Toy parsing routines. Parsing the input consists of identifying the grammar rules which fit the input, and interpreting that input according to those rules. E.g. the only grammar rule that the sequence 'if a < b then write("Hello\n"); fi;' fits is that for an 'if' statement. The condition is the subsequence 'a < b', which itself breaks down through the expression grammar into . There are no , and the body has only one , which is a . At this point the compiler writer has another option. Should an internal data structure be built which describes the parsed program? Such an internal form can make certain kinds of optimizations easier. For example, common sub-expression analysis, which attempts to avoid recalculating an expression whose value was just recently computed, usually operates by scanning such an internal structure (usually in a tree-form) for similar sub-trees. Also, such a representation can be used to isolate the target CPU dependencies and the input language dependencies into separate sections of the compiler. Newer compiler systems are able to use this technique to provide a single "back-end" (optimization and code generation) to several different "front-ends" (tokenization and parsing), thus drastically reducing the work needed to create an entire family of compilers. Lattice C appears to use a technique like this - the "quad" files are a quad (four element) representation of the parsed program. This theoretically allows them to use the same "front-end" for the Amiga compiler as, say, for the MSDOS compiler. The Toy compiler, which is designed to be small and easy to understand, does not do this. Instead, the parsing code makes direct calls to code-generation routines, which generate the code for the constructs being parsed. This, unfortunately, prevents a clean separation between the parsing and the code generation. As you would expect, the Toy compiler has little in the way of optimization. What it does have is there mainly for illustrative purposes and is simply some checks for special cases which can easily be done better. The final stage of compilation is that of code generation. This stage typically does a lot of bit fiddling to construct the target machine instructions. This is where the compiler writer curses the CPU designers for not making the machine any where near as orthogonal as they claim to have. In a global sense, the MC68000 is reasonably orthogonal - the form and coding of addressing modes is consistent, and the positioning of most instruction fields is consistent. There are a number of nuisances, however. A minor, but easily spotted, example is the encoding of the operand length. This is sometimes in the form of a two-bit field, but other times in the interpretation of a three-bit field. The MOVE instruction uses a two-bit field, but uses a different encoding than all of the other uses. The EOR instruction appears to follow the pattern of the ADD, SUB, AND, etc. instructions, but it is actually missing many of the modes - the encodings for those are used for other, totally different instructions. One of the winners on the MC68000 has to be the shift/rotate instruction. There are two forms of it, one operating on registers, and one operating on memory operands. The memory operand form is distinguished by a size field of binary 11. The register form has a bit which specifies whether another field is an explicit shift count or is the number of a D-reg which contains the shift count. Another field, in different places in the two forms, says what kind of shift is being done - logical, arithmetic, rotate, or rotate with extend. The code generation stage of a compiler thus has to contend with whatever idiosyncrasies the target CPU has. It also has to contend with the limitations of the CPU. These include such things as the number of registers available, the special-purpose use of some registers (the dreaded Intel 8086 family is notorious for this), and the size of various instruction fields (e.g. a 16 bit signed displacement on some modes, but an 8 bit signed displacement on others). Again, there are several approaches to code generation. Many of the newer compilers are table driven - they have a table of the instructions and forms that the target CPU has, along with some kind of encoded description of what they do. The compiler then attempts to match the requirements of the structure it is trying to generate code for against the table of instructions, to find one that matches. For any but the most simple structures, there will be no match, so the compiler will resort to heuristics like loading operands into registers, then trying again. For the simpler cases, there might be more than one match, so the compiler tries to decide which is "cheaper". Once more, the Toy compiler doesn't do any of this - it has simple, special purpose code for each thing it tries do. The form of these routines will be discussed in full in the article on Toy's code generation. The Final Example As a final example of what the Toy compiler does, I will show the stages of compilation for a small program. This will be referenced by the later articles in order to illustrate details of the compilation process. Here is the initial program source: int N, Sum; proc main()void: Sum := 0; readln(N); while N ~= 0 do Sum := Sum + N; readln(N); od; write("The sum is: ", Sum, "\n"); corp; The tokenizer reduces the program to the following sequence of tokens: 'int' ',' ';' 'proc' '(' ')' 'void' ':' ':=' ';' 'readln' '(' ')' ';' 'while' '~=' 'do' ':=' '+' ';' 'readln' '(' ')' ';' 'od' ';' 'write' '(' ',' ',' ')' ';' 'corp' ';' Note that the tokenizer has totally discarded all of the pretty spacing and indentation. There are a very large number of intermediate parsings that could be shown. One somewhere in the middle is: ';' 'proc' '(' ')' ':' ';' ';' 'while' 'do' ';' ';' 'od' ';' ';' 'corp' ';' Note that two 's are needed to allow a semicolon after the last statement in a (before the 'od' and 'corp'). The assembler code generated for this program is: 0000: link.w fp,#0 0004: movem.l ,-(sp) 0008: clr.w d7 000a: move.w d7,Sum 0010: jsr toy_readLine 0016: jsr toy_readInt 001c: move.w d0,d7 001e: move.w d7,N LOOP: 0024: move.w N,d7 002a: clr.w d6 002c: cmp.w d6,d7 002e: sne d7 0030: andi.w #1,d7 0034: tst.w d7 0036: beq LOOPEXIT 003a: move.w Sum,d7 0040: move.w N,d6 0046: add.w d6,d7 0048: move.w d7,Sum 004e: jsr toy_readLine 0054: jsr toy_readInt 005a: move.w d0,d7 005c: move.w d7,N 0062: bra LOOP LOOPEXIT: 0064: jsr 16(pc) 0068: "The sum is: " 0076: jsr toy_printString 007c: move.w Sum,d7 0082: move.w d7,-(sp) 0084: jsr toy_printInt 008a: jsr 4(pc) 008e: "\n" 0090: jsr toy_printString 0096: movem.l (sp)+, 009a: unlk fp 009c: rts Finally, the AmigaDOS object module hunks produced are: hunk_code: 0000: 4e56 0000 48e7 7f00 4247 33c7 0000 0116 *NV..H...BG3.....* 0010: 4eb9 0000 0000 4eb9 0000 0000 3e00 33c7 *N.....N.....>.3.* 0020: 0000 0114 3e39 0000 0114 4246 be46 56c7 *....>9....BF.FV.* 0030: 0247 0001 4a47 6700 002c 3e39 0000 0116 *.G..JGg..,>9....* 0040: 3c39 0000 0114 de46 33c7 0000 0116 4eb9 *<9.....F3.....N.* 0050: 0000 0000 4eb9 0000 0000 3e00 33c7 0000 *....N.....>.3...* 0060: 0114 60c0 4eba 0010 5468 6520 7375 6d20 *..`.N...The sum * 0070: 6973 3a20 0000 4eb9 0000 0000 3e39 0000 *is: ..N.....>9..* 0080: 0116 3f07 4eb9 0000 0000 4eba 0004 0a00 *..?.N.....N.....* 0090: 4eb9 0000 0000 4cdf 00fe 4e5e 4e75 0000 *N.....L...N^Nu..* hunk_reloc32: hunk 3: 134 hunk 2: 146 120 hunk 5: 86 24 hunk 4: 80 18 hunk 1: 126 94 74 66 60 38 32 12 hunk_end The actual executable object file created contains the Toy startup and run-time library before these hunks, but they are fixed for all Toy programs, so they are not shown here. About the only part of the original program that is still recognizable to a human reader is the string "The sum is: ". Next Month The next article in this series will examine the data structures used by the Toy compiler and will present the tokenizer and the symbol table handler in detail.