Toy - A Small Compiler for the Amiga Part 4 - Code Generation In many ways, the most interesting part of a compiler is the code generator. The previous parts of the Toy compiler that we have seen have all been mostly independent of the nitty-gritty stuff of actually generating the desired machine code. If we are generous in our definition, code generation can also include all of the optimizations, which, to many, are the only really interesting parts of a compiler. Unfortunately, Toy is a very simple compiler, and doesn't really have any optimization. The couple of examples shown are just that: examples. I will start out first with a simple discussion of some of the alternatives in code generation. Next I will briefly discuss the code in Toy's code generator (which is very straightforward). This is followed by a run-through of generating code for a simple example, and by a short discussion of some optimizations. Alternatives As with most parts of the compiler, there are several ways to approach the task of code generation. In a desire to avoid a lot of hand-written code, and to keep the code generator as independent of the target CPU as possible, many modern compilers use a table driven approach. This requires that the parsed program be represented in some sort of tree form. The table entries contain a coded representation of a subtree, along with the assembler instructions to be generated, with appropriate substitutions from the subtree. As an example, suppose a compiler had an expression node that looked like this: add-32 / \ / \ DReg(4) const(1) This represents the desire to do a 32 bit add of a value in a data register and a constant. The value is in register 4, and the constant to be added is '1'. An MC68000 code generator might have table entries somewhat like this: add-32 A, const(1..8) :: "addq.l #%1,A%2" 2 add-32 D, const(1..8) :: "addq.l #%1,D%2" 2 add-32 A, const() :: "adda.l #%1,A%2" 6 add-32 D, const() :: "add.l #%1,D%2" 6 add-32 A, const(-32768..32767) :: "lea %1(A%2),A%2" 4 where the fields are: node type to match operand #1 (here a register of type A or D) operand #2 (here a numeric constant, perhaps restricted) assembler string to generate size in bytes The operand fields are numbered, which allows their values to be substituted into the assembler string. Our sample node above would thus match both the second and the fourth entries. The compiler could correctly generate either one, but since the second is four bytes shorter, it would prefer that one. In reality, a compiler targeting the 68000 family would likely encode the useability of many different addressing modes in the same table entry, and would also encode the operand sizes. Large tables are usually left, however, and the compiler spends a lot of time figuring out which entries to use. Note that the above table entries contained a string for the emitted assembler code. This implies that such a compiler would produce assembler source which would have to be assembled after the compilation phase. Aztec C for the Amiga operates in this fashion, as do many C compilers based on the original pcc (portable C compiler). Different table entries could be used to build machine code directly. A distinct advantage of table-driven code generation is that, given a sufficiently general representation, only the table need be changed to generate code for a new CPU; the table matcher and associated code stays the same. One of the critical resources in today's CPU's is the register set. Good code requires good use of the available registers. As many variables as possible should be in registers (unless, of course, doing so results in poorer code!). The reason for this is quite simple - registers can be addressed directly, without an explicit address, thus saving instruction bytes. They are also located in the CPU chip itself, so no memory cycles are needed to load or store them. Compilers need some scheme whereby registers are matched up with values and variables that are being used. One of the best schemes is "register colouring", which "paints" each value with the colour (number) of the register that it will be in. A rule here, of course, is that no two values can have the same colour at the same time. This type of scheme ignores any "register" specification that the programmer gave, since it can usually do a better job anyway. Another common scheme is to have a register allocator which allocates registers to values as needed, and then does "spilling" to the stack when the registers run out and more are needed. The latter is what is done in Draco, and partially in Toy (although without spilling). There are of course many important details that I am ignoring (such as register colouring taking a fairly long time as it traverses the graph structures built to represent the program). Closely related to register allocation is storage allocation. Where do all of the variables go? For a 68000 family CPU, the local variables and parameters are normally in registers or on the stack. Global variables are in locations fixed when the program is loaded, and are referenced using full 32 bit absolute addresses (the so-called "large" model for the 68000) or using 16 bit offsets from a base register (the "small" model). Both of the commercial C compilers allow you to choose which of these models you want to use. The first allows as much global variable space as you want, but is bulkier and slower. The second is quicker/faster but limits you to 64K worth of global variables, and ties up a critical address register. Both Toy and Draco allow only the large model. Modifying Toy to use the small model for global variables should not be difficult - the interested programmers are encouraged to try it for themselves. Code Generation in Toy Toy's code generator is shown in listing 1, file "codeGen.d". It is a fairly long file, but has no complex routines in it. In the description which follows, I will assume some familiarly with MC68000 assembler language, but not with the detailed format of machine code. When generating code, it is useful to classify the target instructions so that a common interface can be used. The 68000 doesn't fit perfectly into such a classification, but it is still useful. For the purposes of Toy, I have split the instructions up as follows: opSpecial - those that don't fit any other form opSingle - operations on a single operand. The operand can be any size (byte, word, long) and is addressed by any addressing mode which yields modifiable data opEA - these are very similar to 'opSingle', except that no explicit size is given, only an effective address opRegister - these contain an effective address and a register number - they have two operands opModed - these contain a register, an effective address, and an op-mode, which specifies size and direction of operation opMove - these are the MOVE instructions. They have two effective addresses opBranch - these are PC-relative branches, which may be conditional or unconditional In general (with many exceptions) an MC68000 instruction can be split up as follows: 4 bit primary op-code 3 bit register number 3 bit op-mode 6 bit effective address Many individual instructions treat some of the fields as additional op- code bits, and many use only the bottom two bits of the op-mode as a size indicator. The many constants defined at the beginning of "codeGen.d" are straight out of the MC68000 reference manuals. They include such things as the range of D-regs that Toy will use, names for the stack pointer and frame pointer (more on that later), names for the various addressing modes that the CPU has, names for the operand sizes that are available, and names for the opcodes that Toy will use. Procedure 'genWord' is used for all emitted code. It adds a 16 bit word to the generated instruction stream. If the fixed-size buffer (here 10000 bytes) overflows, it aborts the compilation. The various 'opXXX' routines use 'genWord' to emit specific classes of instructions. They build the instruction word by shifting the various fields into their proper places. This level of indirection may seem pointless, but this classification of instructions and the use of a common generation point is very useful when adding a peep-hole optimizer. Next come all of the routines we have seen called from the parser. I won't describe them in detail, since they are all straightforward once the overall context of code generation is known. Using registers is simple, but where EXACTLY are the various variables and parameters going to be? Global variables will be stored in a contiguous chunk of memory allocated by AmigaDOS when the Toy program is loaded. In AmigaDOS terms, they will be in a BSS hunk. For simplicity, each variable occupies 2 bytes, thus, we need only add 2 to a global counter for each global variable declared. This gives us the offset of the next global variable in the one chunk (the first being at offset 0). The offset is stored in the variable's 'se_value' field. When a function with parameters is called, we will set up the stack like this: +----------------+ | parameter #1 | +----------------+ | parameter #2 | +----------------+ | | | parameter #n | +----------------+ | return address | +----------------+ | previous FP | FP ----->+----------------+ | local | | variables | +----------------+ | saved | | registers | SP ----->+----------------+ What does all of this mean, and how does it happen? 'FP' stands for "frame pointer", and is a register that we will use to contain the address of the above "stack frame" for this procedure call. Note that it points directly at the local variables, so they can be accessed via an offset from the frame pointer. Also, there are a fixed number of bytes (8) between where the frame pointer points and the parameters, thus they too can be accessed via the frame pointer (A6). Recall that the procedure parser called 'allocateVariable' for each parameter as they were encountered. It would then be easy to assign offsets 0, 2, 4, 6, etc. to the parameters. From the above diagram, however, that would be incorrect. The last parameter is at offset 8, the next at offset 8 + 2, the next 8 + 2 * 2, etc. and the first is at offset 8 + N * 2, where N is the number of parameters, which is not known until all of the parameters are counted. This is the reason for the 'fixLocals' call that is made by 'procEndLocals' - it adds "ParameterSize + 8" to the 'se_value' field of all parameters and local variables, which gives them their correct frame pointer offset. Assuming an empty stack, the sequence of events done to create a stack frame is (recall that the MC68000 stack grows downwards - pushing something decreases the address in the stack pointer): push 1st parameter push 2nd parameter ... push nth parameter do a JSR to the procedure, which pushes the return address do a LINK #M,FP instruction, which pushes the current value of FP, copies SP into FP, then adds M (a negative number, to allocate the space for the local variables) to SP save registers by pushing them onto the stack The operation of 'allocateVariable', 'procStartParameters', 'procEndParameters', 'procEndLocals' and 'codeInit' should now be fairly clear. 'codeTerm' generates code to reverse the actions done by the entry code: restore registers by popping them from the stack do an UNLK FP to remove the space for the local variables and pop the previous value of FP pop the return address into a temporary (A0) increment SP to remove the parameters return, by jumping indirect through A0 Here we see our first "optimization". If there are no parameters to this procedure, the last three steps above can be shortened into just emitting a simple "RTS" instruction. The MC68010/20/30/40 all have an "RTD" (return and deallocate) instruction which is able to do the return and parameter popping in one instruction. This could be generated in response to an appropriate compiler flag. The calls to 'codeStartProc' and 'codeEndProc' are to the system interface. They ready it for a new function whose name is given, and then request that it save away (or write directly) the code that was generated for it. Variable 'NextRegister' is used to implement a small stack of registers for expression evaluation. It starts out at D_REG_FIRST (7), which is the highest numbered data register. It can go as low as D_REG_LAST (2) for usable registers. Allocating a register consists of decrementing 'NextRegister'. If it is below D_REG_LAST and a register is needed, our small stack has overflowed, and the compiler aborts. The standard fix for this situation is to "spill" registers onto the stack. E.g. if NextRegister = D_REG_LAST - 1 (1) when a register is needed, move the contents of D7 to the stack and reuse D7. It is then necessary to make sure that all needed register values are actually in the registers when an instruction using them is generated. Draco uses a routine called 'needRegs' for this purpose. Example of Code Generation As an example, we will return to the small program fragment given in the first article in this series: int N, Sum; proc main()void: Sum := 0; readln(N); while N ~= 0 do Sum := Sum + N; readln(N); od; write("The sum is: ", Sum, "\n"); corp; The calls to the code generator for this file go like this: Call effect/notes -------------------------------------------------------------- setGlobalSize(276) space used by run-time-system (ignored in the rest of this) allocateVariable() se_value := 0; globalSize := 2 allocateVariable() se_value := 2; globalSize := 4 procStartParameters() parSize := 0 procEndParameters() localSize := 8 procEndLocals() fixLocals(8); localSize := 0 codeInit(
) LINK #0,FP MOVEM.L ,-(SP) NextRegister := 7 pushConstant(0) CLR.W D7 NextRegister := 6 popVariable() NextRegister := 7 MOVE.W D7,global_at_0 callSpecial(rc_readLine) JSR _toy_readLine callSpecial(rc_readInt) JSR _toy_readInt MOVE.W D0,D7 NextRegister := 6 popVariable() NextRegister := 7 MOVE.W D7,_global_at_2 LOOP_POSITION := getPos() LOOP_POSITION := 0x24 pushVariable() MOVE.W _global_at_0,D7 NextRegister := 6 pushConstant(0) CLR.W D6 NextRegister := 5 compareValues(<~=>) NextRegister := 6 CMP.W D6,D7 SNE D7 ANDI.W #1,D7 BRANCH_POSITION := branchForward(true) NextRegister := 7 TST.W D7 BEQ.W ??? BRANCH_POSITION := 0x38 pushVariable() MOVE.W _global_at_2,D7 NextRegister := 6 pushVariable() MOVE.W _global_at_0,D6 NextRegister := 5 binaryOp(<+>) NextRegister := 6 ADD.W D6,D7 popVariable() NextRegister := 7 MOVE.W D7,_global_at_2 callSpecial(rc_readLine) JSR _toy_readLine callSpecial(rc_readInt) JSR _toy_readInt MOVE.W D0,D7 NextRegister := 6 popVariable() NextRegister := 7 MOVE.W D7,_global_at_0 branchTo(LOOP_POSITION) BRA.S offset 0x24 patchBranch(BRANCH_POSITION) branch at 0x36 to 0x64 emitString("The sum is: ") JSR offset 0x76 "The sum is :" callSpecial(rc_printString) JSR _toy_printString pushVariable() MOVE.W _global_at_2,D7 NextRegister := 6 callSpecial(rc_printInt) NextRegister := 7 MOVE.W D7,-(SP) JSR _toy_printInt emitString("\n") JSR offset 0x90 "\n" callSpecial(rc_printString) JSR _toy_printString codeTerm(
) MOVEM.L (SP)+, UNLK FP RTS Four very minor optimizations have occurred here. Two are using a CLR instruction to put a 0 into a register. The third is using the RTS for procedure exit instead of the longer sequence. The last is using a short backwards branch in the "while" loop. Note the use of a JSR around a string constant - it places the address of the string constant onto the stack, which is precisely where it is needed when calling _toy_printString. This trick saves having a separate buffer somewhere for string constants. Unfortunately, if confuses the heck out of disassemblers and debuggers! Optimizations We have seen examples of the "optimizations" in the Toy compiler. They are just special cases where it is obvious that a better code sequence exists. Others could be added (such as using a MOVEQ in 'pushConstant' for values from -128 to +127). Doing this sort of thing makes the code better, but it doesn't fix any of the big problems. The most glaring problem in Toy is the code for a comparison. This was the simplest way to allow both boolean variables and comparisons to work the same in conditional branches. Generating a value in a register is not necessary unless the value is being assigned to something. Where Toy currently generates (from the above sequence): CMP.W D6,D7 SNE D7 ANDI.W #1,D7 TST.W D7 BEQ.W label we would like to see something like: CMP.W D6,D7 BEQ label Notice that the condition (NE on the Scc, EQ on the Bcc) has been reversed. This is because, on a "while" statement (or on an "if" statement for that matter), we branch around the body or alternative if the condition is NOT met. Since there can be only one conditional branch being generated at a time, we could store the condition when 'compareValues' is called, and generate the proper conditional branch when 'branchForward' is called. We would also need special code in 'popVariable' to generate the 'bool' value before trying to store it to a variable. (If this isn't clear, consider what would happen if this optimization were done and we had a statement like "flag := a < b;".) The next most obvious problem is that all operations are done only in registers. The 68000 is quite capable of adding a value in a variable directly into a register, for instance. In Draco, this sort of thing is handled in two ways. The first has the compiler carrying around explicit descriptions of values, so it can generate a memory operation when appropriate. The second is fixes in the peepholer. MOVE instructions on the 68000 can have two effective addresses, so they can move a value from one variable (or a constant) directly to another, without using a register at all. Also available are immediate-mode instructions which can operate on a memory operand and a constant. Applying these to the code for the above sample program, we could get: link.w fp,#0 movem.l ,-(sp) clr.w Sum jsr _toy_readLine jsr _toy_readInt move.w d0,N LOOP: tst.w N beq LOOPEXIT move.w Sum,d7 add.w N,d7 move.w d7,Sum jsr _toy_readLine jsr _toy_readInt move.w d0,N bra LOOP LOOPEXIT: jsr 16(pc) "The sum is: " jsr _toy_printString move.w Sum,-(sp) jsr _toy_printInt jsr 4(pc) "\n" jsr _toy_printString movem.l (sp)+, unlk fp rts This, with minor differences (the LINK/UNLK are not there since there are no local variables, the TST of N is done with a MOVE to D7, the strings are at the end of the code and hence referenced with a PEA, and only registers D7 and D6 are saved/restored) is the code that Draco generates. So, we must consider all of these as "easy" optimizations. A "middle" optimization is a good peephole optimizer. This technique is so named because it views the generated code through a small "peephole", and simply replaces a short sequence of instructions with another that is equivalent, but shorter/faster. A good peepholer can make up for a poor code generator. I prefer not to overload the peepholer, however, since it is called upon for every instruction generated. For example, many people consider changing multiply/divide by a power of two into a shift (signed divide is a bit different) to be a peepholer's job. I consider it to be a special case which should be done earlier. The peepholer is typically the place where the CPU's "funny" instructions or modes can be used, or where unfortunate code sequences, which occur because of the design of the code generator, can be fixed up. An example of the latter for Draco are the "opModed" instructions, which take two operands. For complex expressions, the operations are always done from memory to register, but in some simple cases, the register to memory variants will work. E.g. in register int i; int j; j := j + i; the default code generated is something like move.w j,d7 add.w reg_of_i,d7 move.w d7,j but the peepholer can change it to add.w reg_of_i,j Other peepholes, involving the 68000's auto-increment addressing mode, can yield considerable improvements. A frustrating aspect of many optimizations is that they affect each other. E.g. in the above example, the difference is that in the optimized case, the value of 'j' is not in a register at the end. If the very next code used 'j', the first sequence could be better. Other "medium" optimizations are register- and condition-tracking. Here the compiler keeps track of what will be in the scratch registers and in the CPU's condition codes throughout the code generation. If a value is required that is already available, it is just used directly, rather than being re-fetched or re-computed. "Hard" optimizations include things like register colouring, automatically assigning variables to registers, code motion (e.g. moving a piece of code out of a loop since it always gets the same result), strength reduction (e.g. changing a bunch of array indexing operations into pointer increments and dereferences), dead code elimination (removing code that cannot be executed), etc. Applying these sorts of things, a good compiler could turn our example into: move.w d2,-(sp) ; only low 16 bits used clr.w d2 bra ENTRY LOOP: add.w d0,d2 ENTRY: jsr _toy_readLine jsr _toy_readInt tst.w d0 bne LOOP ; runs faster this way pea STRING1 jsr _toy_printString move.w d2,-(sp) jsr _toy_printInt pea STRING2 jsr _toy_printString move.w (sp)+,d2 rts STRING1: ; can pack them tighter "The sum is: " ; down here STRING2: "\n" Conclusion