Toy - A Small Compiler for the Amiga
		      Part 4 - Code Generation


	In many ways, the most interesting part of a compiler is the code
    generator. The previous parts of the Toy compiler that we have seen
    have all been mostly independent of the nitty-gritty stuff of actually
    generating the desired machine code. If we are generous in our
    definition, code generation can also include all of the optimizations,
    which, to many, are the only really interesting parts of a compiler.
    Unfortunately, Toy is a very simple compiler, and doesn't really have
    any optimization. The couple of examples shown are just that: examples.

	I will start out first with a simple discussion of some of the
    alternatives in code generation. Next I will briefly discuss the code
    in Toy's code generator (which is very straightforward). This is
    followed by a run-through of generating code for a simple example, and
    by a short discussion of some optimizations.


    Alternatives

	As with most parts of the compiler, there are several ways to
    approach the task of code generation. In a desire to avoid a lot of
    hand-written code, and to keep the code generator as independent of the
    target CPU as possible, many modern compilers use a table driven
    approach. This requires that the parsed program be represented in some
    sort of tree form. The table entries contain a coded representation of
    a subtree, along with the assembler instructions to be generated, with
    appropriate substitutions from the subtree. As an example, suppose a
    compiler had an expression node that looked like this:

				add-32
				/    \
			       /      \
			   DReg(4)  const(1)

    This represents the desire to do a 32 bit add of a value in a data
    register and a constant. The value is in register 4, and the constant
    to be added is '1'. An MC68000 code generator might have table entries
    somewhat like this:

	    add-32 A, const(1..8) :: "addq.l #%1,A%2" 2
	    add-32 D, const(1..8) :: "addq.l #%1,D%2" 2
	    add-32 A, const() :: "adda.l #%1,A%2" 6
	    add-32 D, const() :: "add.l #%1,D%2" 6
	    add-32 A, const(-32768..32767) :: "lea %1(A%2),A%2" 4

    where the fields are:

	node type to match
	operand #1 (here a register of type A or D)
	operand #2 (here a numeric constant, perhaps restricted)
	assembler string to generate
	size in bytes

    The operand fields are numbered, which allows their values to be
    substituted into the assembler string. Our sample node above would thus
    match both the second and the fourth entries. The compiler could
    correctly generate either one, but since the second is four bytes
    shorter, it would prefer that one. In reality, a compiler targeting
    the 68000 family would likely encode the useability of many different
    addressing modes in the same table entry, and would also encode the
    operand sizes. Large tables are usually left, however, and the compiler
    spends a lot of time figuring out which entries to use.

	Note that the above table entries contained a string for the
    emitted assembler code. This implies that such a compiler would produce
    assembler source which would have to be assembled after the compilation
    phase. Aztec C for the Amiga operates in this fashion, as do many C
    compilers based on the original pcc (portable C compiler). Different
    table entries could be used to build machine code directly. A distinct
    advantage of table-driven code generation is that, given a sufficiently
    general representation, only the table need be changed to generate code
    for a new CPU; the table matcher and associated code stays the same.

	One of the critical resources in today's CPU's is the register set.
    Good code requires good use of the available registers. As many
    variables as possible should be in registers (unless, of course, doing
    so results in poorer code!). The reason for this is quite simple -
    registers can be addressed directly, without an explicit address, thus
    saving instruction bytes. They are also located in the CPU chip itself,
    so no memory cycles are needed to load or store them. Compilers need
    some scheme whereby registers are matched up with values and variables
    that are being used. One of the best schemes is "register colouring",
    which "paints" each value with the colour (number) of the register that
    it will be in. A rule here, of course, is that no two values can have
    the same colour at the same time. This type of scheme ignores any
    "register" specification that the programmer gave, since it can usually
    do a better job anyway. Another common scheme is to have a register
    allocator which allocates registers to values as needed, and then does
    "spilling" to the stack when the registers run out and more are needed.
    The latter is what is done in Draco, and partially in Toy (although
    without spilling). There are of course many important details that I am
    ignoring (such as register colouring taking a fairly long time as it
    traverses the graph structures built to represent the program).

	Closely related to register allocation is storage allocation. Where
    do all of the variables go? For a 68000 family CPU, the local variables
    and parameters are normally in registers or on the stack. Global
    variables are in locations fixed when the program is loaded, and are
    referenced using full 32 bit absolute addresses (the so-called "large"
    model for the 68000) or using 16 bit offsets from a base register (the
    "small" model). Both of the commercial C compilers allow you to choose
    which of these models you want to use. The first allows as much global
    variable space as you want, but is bulkier and slower. The second is
    quicker/faster but limits you to 64K worth of global variables, and
    ties up a critical address register. Both Toy and Draco allow only the
    large model. Modifying Toy to use the small model for global variables
    should not be difficult - the interested programmers are encouraged to
    try it for themselves.


    Code Generation in Toy

	Toy's code generator is shown in listing 1, file "codeGen.d". It is
    a fairly long file, but has no complex routines in it. In the
    description which follows, I will assume some familiarly with MC68000
    assembler language, but not with the detailed format of machine code.
    When generating code, it is useful to classify the target instructions
    so that a common interface can be used. The 68000 doesn't fit perfectly
    into such a classification, but it is still useful. For the purposes of
    Toy, I have split the instructions up as follows:

	opSpecial - those that don't fit any other form
	
	opSingle - operations on a single operand. The operand can be
	    any size (byte, word, long) and is addressed by any addressing
	    mode which yields modifiable data

	opEA - these are very similar to 'opSingle', except that no
	    explicit size is given, only an effective address

	opRegister - these contain an effective address and a register
	    number - they have two operands

	opModed - these contain a register, an effective address, and an
	    op-mode, which specifies size and direction of operation

	opMove - these are the MOVE instructions. They have two effective
	    addresses

	opBranch - these are PC-relative branches, which may be conditional
	    or unconditional

    In general (with many exceptions) an MC68000 instruction can be split
    up as follows:

	4 bit primary op-code
	3 bit register number
	3 bit op-mode
	6 bit effective address

    Many individual instructions treat some of the fields as additional op-
    code bits, and many use only the bottom two bits of the op-mode as a
    size indicator.

	The many constants defined at the beginning of "codeGen.d" are
    straight out of the MC68000 reference manuals. They include such things
    as the range of D-regs that Toy will use, names for the stack pointer
    and frame pointer (more on that later), names for the various
    addressing modes that the CPU has, names for the operand sizes that are
    available, and names for the opcodes that Toy will use.

	Procedure 'genWord' is used for all emitted code. It adds a 16 bit
    word to the generated instruction stream. If the fixed-size buffer
    (here 10000 bytes) overflows, it aborts the compilation. The various
    'opXXX' routines use 'genWord' to emit specific classes of
    instructions. They build the instruction word by shifting the various
    fields into their proper places. This level of indirection may seem
    pointless, but this classification of instructions and the use of a
    common generation point is very useful when adding a peep-hole
    optimizer.

	Next come all of the routines we have seen called from the parser.
    I won't describe them in detail, since they are all straightforward
    once the overall context of code generation is known. Using registers
    is simple, but where EXACTLY are the various variables and parameters
    going to be? Global variables will be stored in a contiguous chunk of
    memory allocated by AmigaDOS when the Toy program is loaded. In
    AmigaDOS terms, they will be in a BSS hunk. For simplicity, each
    variable occupies 2 bytes, thus, we need only add 2 to a global counter
    for each global variable declared. This gives us the offset of the next
    global variable in the one chunk (the first being at offset 0). The
    offset is stored in the variable's 'se_value' field. When a function
    with parameters is called, we will set up the stack like this:

		    +----------------+
		    |  parameter #1  |
		    +----------------+
		    |  parameter #2  |
		    +----------------+
		    |		     |
		
		    |  parameter #n  |
		    +----------------+
		    | return address |
		    +----------------+
		    | previous FP    |
	   FP ----->+----------------+
		    |	  local      |
		    |	variables    |
		    +----------------+
		    |	  saved      |
		    |	registers    |
	   SP ----->+----------------+

    What does all of this mean, and how does it happen? 'FP' stands for
    "frame pointer", and is a register that we will use to contain the
    address of the above "stack frame" for this procedure call. Note that
    it points directly at the local variables, so they can be accessed via
    an offset from the frame pointer. Also, there are a fixed number of
    bytes (8) between where the frame pointer points and the parameters,
    thus they too can be accessed via the frame pointer (A6).

	Recall that the procedure parser called 'allocateVariable' for each
    parameter as they were encountered. It would then be easy to assign
    offsets 0, 2, 4, 6, etc. to the parameters. From the above diagram,
    however, that would be incorrect. The last parameter is at offset 8,
    the next at offset 8 + 2, the next 8 + 2 * 2, etc. and the first is at
    offset 8 + N * 2, where N is the number of parameters, which is not
    known until all of the parameters are counted. This is the reason for
    the 'fixLocals' call that is made by 'procEndLocals' - it adds
    "ParameterSize + 8" to the 'se_value' field of all parameters and
    local variables, which gives them their correct frame pointer offset.

	Assuming an empty stack, the sequence of events done to create a
    stack frame is (recall that the MC68000 stack grows downwards - pushing
    something decreases the address in the stack pointer):

	push 1st parameter
	push 2nd parameter
	...
	push nth parameter
	do a JSR to the procedure, which pushes the return address
	do a LINK #M,FP instruction, which pushes the current value of FP,
	    copies SP into FP, then adds M (a negative number, to allocate
	    the space for the local variables) to SP
	save registers by pushing them onto the stack

    The operation of 'allocateVariable', 'procStartParameters',
    'procEndParameters', 'procEndLocals' and 'codeInit' should now be
    fairly clear. 'codeTerm' generates code to reverse the actions done by
    the entry code:

	restore registers by popping them from the stack
	do an UNLK FP to remove the space for the local variables and pop
	    the previous value of FP
	pop the return address into a temporary (A0)
	increment SP to remove the parameters
	return, by jumping indirect through A0

    Here we see our first "optimization". If there are no parameters to
    this procedure, the last three steps above can be shortened into just
    emitting a simple "RTS" instruction. The MC68010/20/30/40 all have
    an "RTD" (return and deallocate) instruction which is able to do the
    return and parameter popping in one instruction. This could be
    generated in response to an appropriate compiler flag. The calls to
    'codeStartProc' and 'codeEndProc' are to the system interface. They
    ready it for a new function whose name is given, and then request that
    it save away (or write directly) the code that was generated for it.

	Variable 'NextRegister' is used to implement a small stack of
    registers for expression evaluation. It starts out at D_REG_FIRST (7),
    which is the highest numbered data register. It can go as low as
    D_REG_LAST (2) for usable registers. Allocating a register consists of
    decrementing 'NextRegister'. If it is below D_REG_LAST and a register
    is needed, our small stack has overflowed, and the compiler aborts. The
    standard fix for this situation is to "spill" registers onto the stack.
    E.g. if NextRegister = D_REG_LAST - 1 (1) when a register is needed,
    move the contents of D7 to the stack and reuse D7. It is then necessary
    to make sure that all needed register values are actually in the
    registers when an instruction using them is generated. Draco uses a
    routine called 'needRegs' for this purpose.


    Example of Code Generation

	As an example, we will return to the small program fragment given
    in the first article in this series:

	int N, Sum;

	proc main()void:

	    Sum := 0;
	    readln(N);
	    while N ~= 0 do
		Sum := Sum + N;
		readln(N);
	    od;
	    write("The sum is: ", Sum, "\n");
	corp;

    The calls to the code generator for this file go like this:

	    Call			    effect/notes
	--------------------------------------------------------------

	setGlobalSize(276)		space used by run-time-system
					(ignored in the rest of this)

	allocateVariable(<N>)		se_value := 0; globalSize := 2

	allocateVariable(<Sum>) 	se_value := 2; globalSize := 4

	procStartParameters()		parSize := 0

	procEndParameters()		localSize := 8

	procEndLocals() 		fixLocals(8); localSize := 0

	codeInit(<main>)		LINK	#0,FP
					MOVEM.L <D2-D7>,-(SP)
					NextRegister := 7

	pushConstant(0) 		CLR.W	D7
					NextRegister := 6

	popVariable(<Sum>)		NextRegister := 7
					MOVE.W	D7,global_at_0

	callSpecial(rc_readLine)	JSR	_toy_readLine

	callSpecial(rc_readInt) 	JSR	_toy_readInt
					MOVE.W	D0,D7
					NextRegister := 6

	popVariable(<N>)		NextRegister := 7
					MOVE.W	D7,_global_at_2

	LOOP_POSITION := getPos()	LOOP_POSITION := 0x24

	pushVariable(<N>)		MOVE.W	_global_at_0,D7
					NextRegister := 6

	pushConstant(0) 		CLR.W	D6
					NextRegister := 5

	compareValues(<~=>)		NextRegister := 6
					CMP.W	D6,D7
					SNE	D7
					ANDI.W	#1,D7

	BRANCH_POSITION := branchForward(true)
					NextRegister := 7
					TST.W	D7
					BEQ.W	???
					BRANCH_POSITION := 0x38

	pushVariable(<Sum>)		MOVE.W	_global_at_2,D7
					NextRegister := 6

	pushVariable(<N>)		MOVE.W	_global_at_0,D6
					NextRegister := 5

	binaryOp(<+>)			NextRegister := 6
					ADD.W	D6,D7

	popVariable(<Sum>)		NextRegister := 7
					MOVE.W	D7,_global_at_2

	callSpecial(rc_readLine)	JSR	_toy_readLine

	callSpecial(rc_readInt) 	JSR	_toy_readInt
					MOVE.W	D0,D7
					NextRegister := 6

	popVariable(<N>)		NextRegister := 7
					MOVE.W	D7,_global_at_0

	branchTo(LOOP_POSITION) 	BRA.S	offset 0x24

	patchBranch(BRANCH_POSITION)	branch at 0x36 to 0x64

	emitString("The sum is: ")	JSR	offset 0x76
					"The sum is :"

	callSpecial(rc_printString)	JSR	_toy_printString

	pushVariable(<Sum>)		MOVE.W	_global_at_2,D7
					NextRegister := 6

	callSpecial(rc_printInt)	NextRegister := 7
					MOVE.W	D7,-(SP)
					JSR	_toy_printInt

	emitString("\n")		JSR	offset 0x90
					"\n"

	callSpecial(rc_printString)	JSR	_toy_printString

	codeTerm(<main>)		MOVEM.L (SP)+,<D2-D7>
					UNLK	FP
					RTS

    Four very minor optimizations have occurred here. Two are using a CLR
    instruction to put a 0 into a register. The third is using the RTS for
    procedure exit instead of the longer sequence. The last is using a
    short backwards branch in the "while" loop. Note the use of a JSR
    around a string constant - it places the address of the string constant
    onto the stack, which is precisely where it is needed when calling
    _toy_printString. This trick saves having a separate buffer somewhere
    for string constants. Unfortunately, if confuses the heck out of
    disassemblers and debuggers!


    Optimizations

	We have seen examples of the "optimizations" in the Toy compiler.
    They are just special cases where it is obvious that a better code
    sequence exists. Others could be added (such as using a MOVEQ in
    'pushConstant' for values from -128 to +127). Doing this sort of thing
    makes the code better, but it doesn't fix any of the big problems. The
    most glaring problem in Toy is the code for a comparison. This was the
    simplest way to allow both boolean variables and comparisons to work
    the same in conditional branches. Generating a value in a register is
    not necessary unless the value is being assigned to something. Where
    Toy currently generates (from the above sequence):

	    CMP.W   D6,D7
	    SNE     D7
	    ANDI.W  #1,D7
	    TST.W   D7
	    BEQ.W   label

    we would like to see something like:

	    CMP.W   D6,D7
	    BEQ     label

    Notice that the condition (NE on the Scc, EQ on the Bcc) has been
    reversed. This is because, on a "while" statement (or on an "if"
    statement for that matter), we branch around the body or alternative if
    the condition is NOT met. Since there can be only one conditional
    branch being generated at a time, we could store the condition when
    'compareValues' is called, and generate the proper conditional branch
    when 'branchForward' is called. We would also need special code in
    'popVariable' to generate the 'bool' value before trying to store it to
    a variable. (If this isn't clear, consider what would happen if this
    optimization were done and we had a statement like "flag := a < b;".)

	The next most obvious problem is that all operations are done only
    in registers. The 68000 is quite capable of adding a value in a
    variable directly into a register, for instance. In Draco, this sort of
    thing is handled in two ways. The first has the compiler carrying
    around explicit descriptions of values, so it can generate a memory
    operation when appropriate. The second is fixes in the peepholer. MOVE
    instructions on the 68000 can have two effective addresses, so they can
    move a value from one variable (or a constant) directly to another,
    without using a register at all. Also available are immediate-mode
    instructions which can operate on a memory operand and a constant.
    Applying these to the code for the above sample program, we could get:

	    link.w    fp,#0
	    movem.l   <d2,d3,d4,d5,d6,d7>,-(sp)
	    clr.w     Sum
	    jsr       _toy_readLine
	    jsr       _toy_readInt
	    move.w    d0,N
	LOOP:
	    tst.w     N
	    beq       LOOPEXIT
	    move.w    Sum,d7
	    add.w     N,d7
	    move.w    d7,Sum
	    jsr       _toy_readLine
	    jsr       _toy_readInt
	    move.w    d0,N
	    bra       LOOP
	LOOPEXIT:
	    jsr       16(pc)
	    "The sum is: "
	    jsr       _toy_printString
	    move.w    Sum,-(sp)
	    jsr       _toy_printInt
	    jsr       4(pc)
	    "\n"
	    jsr       _toy_printString
	    movem.l   (sp)+,<d7,d6,d5,d4,d3,d2>
	    unlk      fp
	    rts
	
    This, with minor differences (the LINK/UNLK are not there since there
    are no local variables, the TST of N is done with a MOVE to D7, the
    strings are at the end of the code and hence referenced with a PEA, and
    only registers D7 and D6 are saved/restored) is the code that Draco
    generates. So, we must consider all of these as "easy" optimizations.

	A "middle" optimization is a good peephole optimizer. This
    technique is so named because it views the generated code through a
    small "peephole", and simply replaces a short sequence of instructions
    with another that is equivalent, but shorter/faster. A good peepholer
    can make up for a poor code generator. I prefer not to overload the
    peepholer, however, since it is called upon for every instruction
    generated. For example, many people consider changing multiply/divide
    by a power of two into a shift (signed divide is a bit different) to be
    a peepholer's job. I consider it to be a special case which should be
    done earlier. The peepholer is typically the place where the CPU's
    "funny" instructions or modes can be used, or where unfortunate code
    sequences, which occur because of the design of the code generator, can
    be fixed up.

	An example of the latter for Draco are the "opModed" instructions,
    which take two operands. For complex expressions, the operations are
    always done from memory to register, but in some simple cases, the
    register to memory variants will work. E.g. in

	    register int i;
	    int j;

	    j := j + i;

    the default code generated is something like

	    move.w	j,d7
	    add.w	reg_of_i,d7
	    move.w	d7,j

    but the peepholer can change it to

	    add.w	reg_of_i,j

    Other peepholes, involving the 68000's auto-increment addressing mode,
    can yield considerable improvements. A frustrating aspect of many
    optimizations is that they affect each other. E.g. in the above
    example, the difference is that in the optimized case, the value of 'j'
    is not in a register at the end. If the very next code used 'j', the
    first sequence could be better.

	Other "medium" optimizations are register- and condition-tracking.
    Here the compiler keeps track of what will be in the scratch registers
    and in the CPU's condition codes throughout the code generation. If a
    value is required that is already available, it is just used directly,
    rather than being re-fetched or re-computed.

	"Hard" optimizations include things like register colouring,
    automatically assigning variables to registers, code motion (e.g.
    moving a piece of code out of a loop since it always gets the same
    result), strength reduction (e.g. changing a bunch of array indexing
    operations into pointer increments and dereferences), dead code
    elimination (removing code that cannot be executed), etc. Applying
    these sorts of things, a good compiler could turn our example into:

	    move.w    d2,-(sp)			; only low 16 bits used
	    clr.w     d2
	    bra       ENTRY
	LOOP:
	    add.w     d0,d2
	ENTRY:
	    jsr       _toy_readLine
	    jsr       _toy_readInt
	    tst.w     d0
	    bne       LOOP			; runs faster this way
	    pea       STRING1
	    jsr       _toy_printString
	    move.w    d2,-(sp)
	    jsr       _toy_printInt
	    pea       STRING2
	    jsr       _toy_printString
	    move.w    (sp)+,d2
	    rts
	STRING1:				; can pack them tighter
	    "The sum is: "			; down here
	STRING2:
	    "\n"


    Conclusion