Toy - A Small Compiler for the Amiga Part 3 - Parsing This third article in the series on the Toy compiler describes the parsing routines. These are divided up into three modules. The first deals with procedure headers, along with the entry and exit code used, and with variable declarations. The second deals with expressions, and the third deals with statements. As has been mentioned previously, the Toy compiler does not build an internal structure to repreresent a parsed program. Thus, the parsing routines call on the code generation routines directly. I have tried to keep the parser relatively indepen- dent of the target processor, but since I haven't tried to port Toy to a different CPU, I don't really know how successful I have been. Perhaps an ambitious MS-DOS programmer will see these articles and build an 8086 version! The parsing technique used is that of "recursive descent". This means, quite literally, that the parser is recursive, and it descends into the structure of the program. The file handling code calls the "top" levels of the parser, which handle things like global variables and procedures. They in turn decend into lower levels, like statements and expressions. Places in which the grammar loops (a right-hand-side refers back, either directly or indirectly, to its own left-hand-side) correspond to a loop or recursion in the parsing routines. The reader may wish to quickly glance back at the previous article, to see the interfaces defined for the symbol table and tokenizer code. Declarations and Procedures Listing 1 shows file "decl.d". It contains only two functions, 'declareVariables' and 'defineProc'. The first is used to define a group of variables, either global to the program, or local to a procedure. The second is used to define a procedure, and to call the code generation routines at the right places to cause the procedure's entry and exit code to be built. This second is one of the larger procedures in the compiler. For a more complex language, both of these would be split up into more pieces. In particular, there would be a function to parse an arbitrary type and return some kind of structure defining it. Also, most languages treat two descriptions of the same type as equivalent, so there would be facilities for either merging multiple descriptions into one, or of comparing two descriptions. To illustrate this, consider the following Draco style declarations: [10] * char fred; [10] * char joe; Here, both 'fred' and 'joe' are "the same type", which is an array of 10 pointers to character, thus we can assign one to the other, or compare the two. In order to allow this, the compiler has to either use the same structure (typically an array node, which has an element-type which is a pointer node, which in turn has a subtype which is a built- in type node) for both, or to provide a routine which traverses these structures and decides they are equivalent. Avoiding this whole area is the main reason I kept Toy down to just 'int' and 'bool' as types. 'declareVariables' is given one parameter, which is the kind of variables to be declared. This is either 'sk_localVariable' for variables local to a function, or 'sk_globalVariable' for variables global to the entire program. This kind is stored in the symbol table entries created for the variables, and is also used by the code generation routines to assign storage to the variables. The structure of 'declareVariables' is that of two nested 'while' loops. The inner one loops through a declaration of a number of variables of the type, and the outer one loops past several of the inner sets. E.g. in int a, b, c, d; bool fred, joe, sam, harry; bool alice, mary, sue, evangeline; there would be three traversals of the outer loop, each consisting of four traversals of the inner loop. When we enter the function, the current token will be the keyword for the type, which is either 'int' or 'bool'. We thus arrange for the outer loop to continue while that situation is true. This means that the body of that loop must consume all legal tokens up to and including the semicolon that separates the declaration sets. We remember the type in local variable 'typ' ('type' is a reserved word in Draco, so I can't use that as the variable name), then move past the token. At that point we expect to see a list of undeclared identifiers, separated by commas, followed by a semicolon. We arrange the inner loop so that it always starts looking at the new identifier. 'getToken' is called instead of 'getSimpleToken', since we need the symbol pointer in order to define the symbol. If the identifier is already defined, we complain about it and go on. Otherwise, we give it the appropriate kind and type, then call 'allocateVariable' in the code generation portion of the compiler to let it figure out where the variable will be. It will use the symbol's 'se_value' field to store any needed information. After skipping the newly defined identifier token, we run head on into one of the uglier parts of most compilers - that of error recovery. The Toy grammar expects a comma or a semicolon after the identifier. What happens if there isn't one, but there is, say, a '<' token instead? Well, we have to put out an error message, and then try to continue meaningfully. Good error recovery is a careful balance between skipping too much input (which could result here in some variables not getting declared, which would cause extra error messages when those variables were used) and skipping too little input (which could result in extra error messages here, and perhaps the compiler getting into an infinite loop if it ends up in the same place again). I won't try to justify the recovery that Toy does - it is a result of my experience with tinkering with compilers. For correct input, the inner loop would have skipped a comma, and would now be looking at the next identifier, or would have found a semicolon and done nothing about it, thus leaving it for the outer loop. The outer loop expects the semicolon between declaration sets, and thus has a very similar error-recovery situation. Here the recovery is dependent on the kind of the variable - if we are inside a procedure, we should stop at anything that looks like an executeable statement, but if we are not, we should stop at anything that looks like a new declaration or a procedure definition. Declaration processing is one area that is more complicated in a C- type language. In Draco, even the most complicated type is handled by the general type parser, and variable declaration is very much as is shown here. In C, however, the type is a combination of the basic type given at the beginning, and the modifiers given around the identifier ('*' before it for pointer, parentheses and bounds lists after it for function types and arrays). Because surrounding parentheses are allowed, the identifier is in fact not seen until the bottom of a recursive routine! Parts of the type structure are built bottom-up, and other parts are built top-down. In my ANSI-C compiler, I found that good error recovery required some finely tuned code with arbitrary token lookahead. 'defineProc' is responsible for all of the work of parsing a procedure definition. This includes handling its parameter list, its result type, any local declarations, its body, and any work needed to help the code generation routines produce correct entry and exit code. When it is called (by the system interface code), the current token will be the 'proc' token. The first step then, is to skip that token. After the 'proc' should be an undeclared identifier - the name of the new procedure. Here we see another aspect of error recovery - that of preventing the compiler from crashing when given invalid input. Later portions of 'defineProc' modify the procedure's symbol table entry. If no identifier was given, we don't have a symbol table entry. Thus, either those places must all check for this case, or we should build a fake one to use. The latter option is used here. After the procedure name should be a left parenthesis. Next are the procedure's parameters. These are handled very similarly to the declaration of variables. The error recovery is slightly different yet again, and the symbol kind used here is 'sk_parameter'. We also restrict the parameters to be of type 'int', so that the symbol table entry for the procedure need contain only the number of parameters. In general, it would contain something like a linked list of parameters and their associated types. After any parameter (there may be none) should be the closing right parenthesis. It is easy to miss the very important calls to the code generator here. 'procStartParameters' is called before declaring any parameters; 'procEndParameters' is called after the parameters but before any local variables; and 'procEndLocals' is called after the local variables. 'declareVariable' is called for each parameter and local variable. This interface should be adequate for any target CPU, and could even allow the code generator to decide to keep some variables in registers. In Toy, procedure 'main' is special - it is called by the startup code to start the user's program. Thus, it has no parameters, and also returns no result. C's 'main' is more complex - it is passed a description of the command-line arguments to the program, and returns a status result. Draco handles this with library calls, so presumeably an expanded Toy would do the same. After giving the procedure symbol its proper kind and parameter type, we look for a procedure result type. That is also put into its symbol table entry. We then expect to see a colon, marking the end of the procedure header. After getting the local variables, we can call upon the code generation routines to get ready to generate code for the procedure's body. It is on this call (to 'codeInit') that the procedure's entry code would be produced. This entry code must save any registers used by the procedure, allocate stack space for local variables, and make the local variables and parameters accessible to the body code. The part of defining a procedure that seems most important to a programmer, that of handling the body of the procedure, is surprisingly simple - it just calls on the statement parser to handle a list of semicolon separated statements. Error recovery here consists of cont- inually calling 'parseStatementList' until a 'corp' or the end of the input file is found. By convention, 'parseStatementList' returns the type of the last statement or expression in the list. Error checking here is making sure that the type returned matches what is expected based on the result type of the procedure. 'procResult' is yet another special entry in the code generator which does whatever is necessary to make the last used expression the result of a current procedure. 'codeTerm' then does the procedure exit code and any other cleanup needed. 'symbolPurge' is then called to remove all local definitions from the symbol table. This will include the local variables and parameters of the procedure, but not the procedure itself. The procedure ends with a 'corp' and an optional semicolon. Expressions Listing 2 shows "expression.d", the Toy expression parser. In this file we will start with the last function and work backwards to the front. 'parseExpression' is the external interface to the expression parser. It is used by the statement parser in several places. It does nothing other than call 'parseComparison', which is the top level of the recursive descent expression parser. This level of indirection allows for the addition of more levels of expressions, such as boolean operators, which would have lower precedence than the comparison operators. 'parseComparison' is typical of the expression handling routines. It first calls the next lower level of the expression parsing chain, in this case 'parseAddSub'. It then examines the next input token. While that token is a comparison operator, it loops, parsing and generating code for a comparison. When an appropriate operator is found, the previously obtained sub-expression, which is to be the left-hand operand for the operator, is checked for an appropriate type. The operator token is then skipped (skipping the operator after the type test makes an error message be associated with the left-hand operand, rather than with the operator). Next, the lower level routine, 'parseAddSub', is then called to get the right-hand operand. That operand is also checked for appropriateness. For a more complex language, the check would have to compare the types of the left and right-hand operands for comparison compatibility. When all is well, the code generator is called on to generate a comparison ('compareValues'). It is passed a saved copy of the operator token, so that it knows which type of comparison to use. Note that there are no other calls on the code generator - the operands are set up by the lower level parts of the expression parser. The result of the function, like all of these functions, is the type of the expression. If a comparison was done, the type is 'bool', otherwise the type is that of the subexpression that would have been the left-hand side if a comparison was present. The next routine (moving backwards), 'parseAddSub' is entirely analagous. Beyond is 'parseMulDiv', again quite similar. Note that these last two share the same code generation routine. Next comes 'parseUnary', which is a little different - it uses an 'if' instead of a 'while', and calls itself recursively if it finds a unary operator. At this point it is useful to go through an example parse, to see how all of the cases fit together. We will parse the expression 1 + 2 * -+3 - +++++4 < 5 / (6 + 7) When 'parseExpression' is called, the current token will be the '1'. The chain of calls will immediately go down through 'parseComparison', 'parseAddSub', 'parseMulDiv', and 'parseUnary', landing in 'parseUnit', which will handle the number and return. The current token will then be the '+', which will allow the return to go back up to 'parsePlusMinus', which will enter its 'while' loop. It will check the type of what is now its left-hand operand, which is 'int', therefore OK. Skipping over the '+', it will dive back down the chain again. 'parseUnit' will again handle the number (2), and return back up, this time to 'parseMulDiv'. 'parseMulDiv' will move past the '*' operator, and call 'parseUnary' for its right-hand operand. This time, 'parseUnary' sees one of the operators it is looking for (the unary '-'), so it goes into its larger alternative. The operator is saved and skipped, and 'parseUnary' is called recursively. This time it sees the unary '+ ', which is handled the same way. The next call to it sees no unary operator, so 'parseUnit' is called to handle the number 3. We now return to the lower of two recursive 'parseUnary' calls and check the operand type. All is well, but nothing is done, since the operator was a unary '+', which is a do-nothing in Toy. So, we return to the upper of the recursive 'parseUnary' calls. This time the operator was unary '-', and we call the code generation routine 'unaryOp', giving it the '-' token. This is the first call to the code generator, other than calls done by 'parseUnit', even though we have processed 7 expression tokens. Other operations are remembered in the local states of the parsing routines. The recursion in 'parseUnary' is done so that the code generation is done in the proper order. For unary operators, the execution is done right-to-left, whereas the operators are seen left-to-right. Using recursion allows us to hold an arbitrary number of them pending. The return from the uppermost 'parseUnary' call goes back to the second call of 'parseUnary' inside 'parseMulDiv'. After a type check, it will finally call 'binaryOp' to do the multiply of 2 and -3. It's 'while' loop will now exit, and we return to 'parseAddSub', which will be in the same situation - it will call 'binaryOp' to do the add. The 'while' loop is again satisfied by the '-', so we go down for another operand. Note what is happening here - the binary operators are being evaluated left-to-right, as they should. 'parseComparison' does the same thing, even though, e.g. "a < b ~= c", is not valid. This is done for error recovery purposes. Continuing with our parse, 'parseUnary' will recurse its way through the string of unary '+'s, doing nothing with them. After the usual handling of the '4', we return back to 'parseAddSub' which can complete the handling of the subtraction. We finally return to 'parse- Comparison', which sees a comparison operator and dives back down for its right-hand operand. The difference here is the parenthesized sub- expression. We get down to 'parseUnit' as usual, but here something special happens. It sees the '(', skips it and then calls 'parse- Expression', which is the top of the chain. This call is the reason for the "recursive" in the name "recursive descent". Parsing of the subexpression "6 + 7" is standard until we come to the ')'. It matches none of the operator tests, so the calls return all the way out of 'parseExpression', and we end up back in 'parseUnit'. Now we check for and skip the ')', and return, thus treating the parenthesized sub- expression just like any other bottom-level unit. Some observations about the "recursive descent" parsing method: - no temporary storage is needed - everything is kept in local variables in the parsing routines - there is no limit to the size or complexity of the expression parsed, other than the size of the execution stack for the compiler - operator precedence or priority is determined by the position of the handler for that operator in the chain of descending routines. Here the precedence goes unary >> '*'/'/' >> '+'/'-' >> comparison - new operators are added by augmenting an existing routine if they are at a previously used precedence, or by inserting a new routine in the call chain - although Toy doesn't have any, postfix operators are added much like prefix (unary) operators We have seen most of 'parseUnit' except for a couple of details. The default in its if-else chain is where the compiler ends up if it gets confused. This case should consume at least one token, unless we are sure that we won't get back here without somebody else consuming one. Handling of identifiers is a bit special. When we see one at the beginning of a statement or expression, we don't know if it is being used as a value, or we have an assignment statement or a procedure call. Thus, we must skip the identifier and make that decision based on the next token. Even then, a procedure call could be a void procedure, which is a statement, or a valued procedure, which is part of an expression. Thus, the compiler treats statements and expressions much the same, except that statements must be of type 'void'. 'parseUnit' uses a 'pushXXX' code generation call to tell the code generator about new bottom-level operands. The model is that of a stack machine - push an operand, push another operand, perform the operation. It is entirely up to the code generator as to just how it implements this, however. 'checkUndef' is a utility routine which complains if the identifier is not defined, and sets its type to 'rt_error'. 'parseProcCall' should by now be fairly straightforward. It does some checking to avoid excess error messages, and just calls 'pushParameter' for each parameter to the routine being called. Note that the parameters would already be "stacked", but this call is telling the code generator that they are to be used as parameters. Finally, 'callProc' is called to generate the actual call (which would be a 'jsr' or 'bsr' for a 68000 CPU). A useful feature of many compilers is that of constant expression evaluation, also called constant folding. Here, any operator for which both operands are constants is performed directly by the compiler, instead of generating code to do the operation. It is of course done iteratively/recursively so that e.g., the above expression would end up as just "-9 < 0" or "true". This evaluation could be added to Toy as part of the parser, or as part of the code generator. In Draco, it is done during parsing - each routine of the chain returns not only the type of the expression, but also a description of where it is, which can be "a constant whose value is XXX". The parsing routines thus check directly for two constant operands, and do the operation if needed. Doing it that way unfortunately blurs the separation between the parser and the code generator. A better way might be to insert a layer between the parser and the code generator which is just responsible for constant folding. Statements Listing 3 shows "statement.d", the statement parser for Toy. After going through the expression parsing, most of it should be relatively straightforward. Once again, it is better to start at the end of the file and work back to the front. Recall that a procedure body is done by calling 'parseStatementList', which returns the type of the last statement/expression in the semi-colon separated list. A semicolon after anything means that that thing must have been a statement. Toy, like Draco, will not accept an expression as a statement. 'parseStatementList' is a simple loop which calls 'parseStatement' repeatedly, until it encounters a token which indicates the end of a statement list. The condition here is a negative one - "loop until you find one of these", whereas the loops in the expression parser were positive ones - "loop while you see one of these". This choice is made for error recovery reasons. 'parseStatement' simply calls one of the handlers of specific statement types. Recall that assignment statements are handled by calling 'parseAssignmentStatement' from inside the bottom level of the expression parser. An assignment statement in Toy consists of a variable name, ':=', and an expression. 'parseAssignmentStatement' calls 'checkAssign' to verify that the given symbol is a variable that can be assigned to. 'popVariable' tells the code generator to store the latest expression value into the indicated variable. The error checking done here simply insures that the expression has the same type as the variable. No error message is given if either is 'rt_error', which indicates that there has been an error previously, and so we don't want an extra one now. In a more complex language, assignment compatibility would be considerably more complex. A 'write' statement in Toy consists of 'write', '(', a list of expressions or string constants, and ')'. This is in fact the only place in the compiler where the value of a string constant is actually used. 'isStatement' is a utility routine which returns 'true' if the current token can start a statement. 'parseWriteStatement' loops through strings or expressions, calling special code generation routines to arrange for them to be printed. Here we see the first use of the type 'RelocCode_t', to tell 'callSpecial' which of the special run-time routines is to be called. 'parseReadlnStatement' is very similar, except that reads can only be done into assignable variables. The 'checkUndef', needed here, is not needed in 'parseAssignmentStatement' since the check was already done in the expression parser. The semantics is that a new input line is read ("callSpecial(rc_readLine)"), and then the required number of integers are taken from that line ("callSpecial(rc_readInt)"). 'parseWhileStatement' is the first routine we have seen which needs to generate branches in the output code. The code generator interface to do this consists of four routines. 'getPos' returns a value which can be used as the target of a later backwards branch. 'branchTo' generates an unconditional backwards branch to a previously determined target. 'branchForward' generates a conditional or unconditional branch to some forward location. It returns an identifier that can be used to fix up the branch when the target is known. 'patchBranch' patches such a branch to branch to the current code position. This simple interface is adequate for Toy, but may not allow for the generation of good code. The positions and identifiers used in Toy are actually just offsets in the code for the current procedure. Given the above, 'parseWhileStatement' is quite simple - skip past the 'while', get a statement list for the condition, skip the 'do', get a statement list for the body, and skip the 'od'. An unconditional branch is generated to go from the end of the body to the beginning of the condition statement list, and a conditional branch, which branches on a 'false' condition, is made from the end of the condition to just past the looping branch. If one wanted to cripple the 'while' statement in Toy to requiring just a simple expression as the condition (thus disallowing center and end-exit loops), just replace the first call to 'parseStatementList' with a call to 'parseExpression'. 'parseIfStatement' is more complicated. It allows the use of the "elif" form in "if" statements. This means that it must keep track of a number of places where the ends of the alternatives have to all branch to after the entire "if statement". In Draco, these are all chained together in the code buffer (which also helps the forward-branch- shortening code to work), but I wanted a cleaner interface in Toy, so a simple fixed-size array is used. This limits the Toy compiler to handling 'ELIF_MAX' "elif"s in a single "if" statement. The generated code for an "if" statement with "elif"s looks something like this: +--------------+--------------+------+ ^ ^ ^ | | | | v -> C -> A ->+ +-> C -> A ->+ +-> C -> A ->+ A -> next | ^ | ^ | ^ v | v | v | +----------+ +----------+ +----------+ where the 'C's are conditions, the 'A's are alternatives, and 'next' is the statement that comes after the "if" statement. Each condition either drops into its alternative (the condition is 'true'), or branches to the next condition. The last condition branches to the "else" part if there is one. Each alternative branches to the next statement. Array 'offsets' contains the positions of each of the forward branches from the ends of alteratives to the next statement. The other branches are maintained by simple state variables, since there is only one of them active at a time. The loop in 'parseIfStatement' loops over all of the conditional tests and alternatives. The first is an "if", and the remainder (if any) are "elif"s. The loop is a bottom-exit loop since we want to stop looping when we have fully processed an alternative, and the next token is not "elif". The loop first skips the "if"/"elif". On all except the first time around, it then handles the termination of the previous alternative. This involves generating an unconditional forward branch (to the next statement), and patching the previous condition's branch to come "here". Next, it gets the expression to be used as the condition for this alternative. This call to 'parseExpression' could be replaced with a call to 'parseStatementList', but that could lead to some very ugly Toy code. Next comes a check for the "then" token, followed by a call to 'parseStatementList' for the body of the alternative. Finally we have the loop condition, along with some error checking. After the loop, we allow for a single "else" part, which is very similar to the previous alternatives, except that it doesn't have a condition. After any "else", all that is left is patching all of the end-of-alternative branches to come "here". The interested reader might want to trace through the parsing of the example given in the first article, and reproduced here: int N, Sum; proc main()void: Sum := 0; readln(N); while N ~= 0 do Sum := Sum + N; readln(N); od; write("The sum is: ", Sum, "\n"); corp; What sequence of calls to the code generator is produced? Miscellaneous Not listed here, but included on disk, is file "parseUtil.d", which includes some utilities used during parsing. 'checkUndef' complains about an undeclared identifier, and sets it up as 'sk_undefined', with type 'rt_error'. 'checkAssign' complains if an identifier isn't a local or global variable or a parameter. 'isStatement' returns 'true' if the current token can appear after a semicolon, i.e. is one that can start a new statement or is a separating token in a compound statement like "if" or "while". 'findStatement' skips tokens until it finds such a token. It is used in error recovery. 'findClose', another error recovery routine, skips tokens until it finds a right parenthesis, a comma, or something that can start a statement. Adding New Language Constructs Since it is the parser which defines the language compiled, it is in the parser that changes are needed to add new language constructs. Adding new types, such as arrays, structures, pointers, etc. would be done either in "decl.d" or in a new file which just parses types. As an example, parsing a Draco-style array type consists of something like: skip the '[' dimensions := 0; while parseExpression(); verify that expression is a compile-time integral constant dimensionTable[dimensions] := expression value; dimensions := dimensions + 1; token = ',' do skip the ',' od; verify and skip a ']' call 'parseType' to get the element type As was mentioned previously, adding compound types to Toy would require the addition of type structures, better type checking, etc. Also required for array types would be array indexing in the expression parser. This would also have to be available for the left-hand-side of assignment statements. Also needed are the two new tokens, '[' and ']'. A new construct, such as a 'case' statement, would be done by adding a new routine to 'statement.d', along with a test for "case" and a call to the routine in 'parseStatement'. A first decision here would be what the code for a case statement looks like. Most compilers use two (or more) variants. If the case indexes are closely spaced, a table of code offsets is generated, and the code produced uses the case indexing expression value to index into that table to yield an offset to branch to. This may require setting up a "default" case. If the case indexes are sparse, some sort of lookup is needed. Lattice C uses a linear search done directly inline. Draco calls a run-time-system routine to do a binary search of a sorted table. Other compilers might code a binary decision tree directly. Whatever methods are used, it is likely that additional entry points into the code generator would be needed. And of course, additional token types would be needed as well. Next Month Next month's article will discuss the code generator. We have seen most of its entry points in this article, but nothing of just what it does. We will go into the details of storage allocation and the construction of MC68000 instructions. Some trivial optimizations will be shown, and more details of the system interface will be discussed.