Toy - A Small Compiler for the Amiga Part 2 - Data Structures, Symbol Table and Tokenizer In part 1 of this series, I described the Toy language and briefly outlined the Toy compiler. In this installment I begin the detailed description of the compiler and its operation. With this article, as well as with future ones in the series, are accompanying listings of the compiler. The files listed are shown in their entirety, but some files are only described, and not listed. These files are available on the disk version of the magazine. This article has three main portions. The first describes the data structures that the compiler uses internally. This is done using the compiler's main include file, 'toy.g', shown in Listing 1. Not all of the structures will be fully described, since they are not encountered until later in the compiler. The second portion of the article will describe the symbol table routines used by the compiler. These are used for keeping track of the identifiers (variable, procedures, etc.) declared in the program. The final portion describes the tokenizer, that portion of the compiler which breaks the input text stream up into a stream of larger tokens which the compiler then parses. Data Structures 'toy.g' is included with all source files of the compiler. This means that all of the types and values declared in it are available throughout the compiler. On a somewhat larger project, it would prob- ably be better to split the information up so that most types are known only to those portions of the project that require them ("information hiding"). The second include file used by the compiler is also included in all source files. It is 'externs.g', and it includes declarations of all functions which are used outside of the file which defines them. This file is not listed here since it is straightforward. Such usage ensures that all such functions are used properly, since all uses are made with the same declaration in effect, and the file which defines the function also has that declaration in effect, so the compiler can check the declaration against the actual definition. This same tech- nique can also be used with ANSI C. Again, on a larger project, it may be better to split up the file. The first type defined in 'toy.g' is one that we will see a lot of throughout this and the next article. It is an enumeration type which lists all of the legal token values that the tokenizer will generate. Some of the tokens are accompanied by additional information. E.g. a 'tk_number' needs the value of the number. Others, such as 'tk_proc' require no additional information. For some compilers, it is useful to also define a structure which can represent any token completely. That way, token values can be readily passed around. There was no need for this in Toy, however. The second type, 'ResultType_t', (the company I work for has a programming convention in which all type names end in '_t', so I use that same convention in my recent non-work programming) is a much smaller enumeration of the types that the compiler knows about. Type 'rt_error' is used internally to cut down on the error messages issued in response to a single error. Most compilers have more complex types and need a more complex type structure. Next comes type 'SymbolKind_t', yet another enumeration, this time of all of the kinds of symbols that we can store in our symbol table. Note that Toy (like Draco) stores the language's reserved words in the main symbol table. The closely related type 'SymbolEntry_t' describes the structure of each symbol table entry. It contains a pointer to the name of the symbol, the length of that string (including the termin- ating '\e'), a value of type 'SymbolKind_t' saying what kind of symbol this is, the type of the symbol, and a couple of fields used only for some kinds of symbols. The length of the name need not be stored, but doing so allows slightly quicker equality checks in some cases, and avoids recomputing the length when the symbol is freed. The final type, 'RelocCode_t' I will not describe here, except to say that it has to do with code generation and emitting the final object file. Anyone who has worked with a real compiler will immediately see that there are no data structures to contain any compiler state infor- mation (except the symbol table of course), or to hold any kind of rep- resentation of a tokenized or parsed program. As mentioned in the first article, the Toy compiler doesn't do any of that - the parsing routines call the tokenizer directly, and also call the code generation routines directly, so no intermediate representations are needed. This is true to a lesser extent of the Draco compiler also. It does maintain state information about register usage, and a small buffer of instructions for the peephole optimizer, however. The Symbol Table Most compilers will have more than one symbol table. For example, a Pascal compiler will often create a new symbol table for each BEGIN - END block to contain the symbols declared in that block. Thus, when it is trying to find the meaning of a symbol, it starts with the table for the most local block and works outward until it finds a definition. Another way of handling this is to have a single table, but to stack the definitions in that table, so that earlier ones are hidden from view by more local definitions. A rule with both Toy and Draco is that such hiding is not allowed - there can be only one definition of a given symbol at any given time. With this rule, a single symbol table is all that is needed. Also, since Toy's special tokens like 'if', etc. are "reserved words" as opposed to "keywords", the user cannot declare identifiers that look like them. Thus, they too can be in the single symbol table with no conflict. The only requirement on the single table is that it be possible to remove local variables from it at the end of each procedure. Listing 2, 'symbol.d' is the symbol table code for Toy. The table itself is a 1023 (1023 being a nice prime number) slot hash table. The first routine, 'symbolInit' simply initializes all slots to 'sk_empty', indicating that they are unused. It is called by the per-file startup code. Global variable 'EntryCount' is used to count the number of entries in the table. If that limit is reached, the Toy compiler will abort. A fairly easy alternative to that aborting is to use a dynam- ically allocated table, and, if it gets too full, allocate a new one twice as big, and re-enter all symbols into the new table. 'se' is a variable declared as pointer to 'SymbolEntry_t'. It scans along the elements of the array, which are "sizeof(SymbolEntry_t)" bytes apart. 'memFree' is a routine which is defined in the system interface portion of the compiler. That portion is not shown here, but the part needed to test the tokenizer is available on disk. 'memFree' simply calls the Amiga Exec routine 'FreeMem' to free the memory. Other rou- tines from that system interface are used later in the symbol table code, and also in the tokenizer. 'symbolTerm' is the corresponding routine to clean up the hash table. Specifically, it frees the strings associated with all used entries. Note that it uses the stored 'se_length' field as the length in characters of the string. The multiplication by "sizeof(char)" is not really necessary since characters are one byte long. I am, how- ever, trying to prepare myself for the eventuality of multi-byte char- acters, such as the ANSI C people are discussing, so I try to include the required scaling where appropriate. 'symbolPurge' performs the function mentioned earlier - it removes all local (to a procedure) symbols from the table, ready for the next procedure. 'enter' is the central routine in the symbol table handler. It is used both to enter new symbols into the table, and to look up existing routines. This is in fact done in the tokenizer. This means that the remainder of the compiler itself does not have to worry about identi- fier tokens as special - they simply contain a pointer to the symbol table slot which they occupy. An implication of this is that undefined symbols are in fact entered into the table when they are first used. The first step in entry/lookup is to calculate a hash value for the string. This is simply the first character, shifted left 6 bits, along with the last character. For one-character names, these two are the same character. The resulting value is taken modulo the symbol table size to yield an index into the table at which to start the search for the symbol. Next comes a loop which starts at the given table position, and searches forward, either for an unused slot, or for one with the desired name. The forward search wraps around to the beginning of the table if it passes the end. If what is found is an empty slot, then the new symbol is added. Note that the name is copied into a dynamically allocated region of memory. This is consistent with freeing the string when the symbol is removed. In either case, a pointer to the found table slot is returned. A slot kind of 'sk_empty' tells the caller that the symbol has not been seen before. The final routine in the file is 'fixLocals'. This is used during code generation to fix up the offset from the frame pointer (stored in the 'se_value' field) at which this local variable or parameter can be found. The fix is required since the offset amount is not known until all local variables have been declared and their required space added up. The extra offset could be remembered elsewhere and added in as required, but this method makes other code simpler. The Tokenizer The tokenizer for the Toy compiler is in file 'lex.d', shown in Listing 3. This portion of the compiler is more complex than the symbol table routines and maintains more internal state. In other compilers I have written, the current input token is always available for examin- ation in a global variable. For the Toy compiler, however, I decided to work hard to keep things modular, thus there are no global variables. The symbol table handler communicate solely through routines, for example. The interface to the tokenizer that I chose works similarly. Routine 'nextToken' processes input characters to build up the next input token. The token is kept entirely within the tokenizer module (lex.d), however. The rest of the compiler calls 'getToken' to return the current token. 'getToken' can be called several times for a given token, and replaces the global variables I have used in the past. The tokenizer maintains several distinct state systems. The simp- lest is simply a buffer to contain input text. This is composed of 'SourceBuffer', 'SourcePos', and 'SourceMax', which are, respectively, the buffer for input characters, the position in the buffer of the next character to be processed, and the number of characters currently in the buffer. When the last character in the buffer has been processed, and more are needed, a call is made to the system interface portion of the compiler to read more characters. A second state set is that which is used to build up identifiers and string constants. I prefer not to place limitations on the length of string contants, so a dynamically sized buffer is used. Given the code to do that, there is no reason not to also use it to allow iden- tifiers of arbitrary length. The variables involved here are 'String- Buffer', 'StringBufferSize' and 'StringBufferPos', which are, respec- tively, a pointer to the dynamically allocated buffer, the size of the current buffer, and the position of the next character to be added to the buffer. Also maintained is information which can be used in error messages to identify the location of the error. These are the number of the current input line, and the column of the current input character. In the Toy language, there are tokens which cannot be unambiguously iden- tified without looking at the next input character. For example, if a '<' is seen, it is not known whether the token '<' is present, or if it is a '<='. This looking ahead at a single next character is called one-character-lookahead. To handle this, the compiler has pairs of variables, containing the characters themselves, the line they came from, and the column they were in. This lookahead also adds some comp- lexity to handling the end-of-file situation. The compiler should not get upset just because the lookahead is trying to read past the end of the input file. This is handled by variable 'EofState', which has the following three values: Eof_none - end of input has not been seen Eof_one - end of input seen for lookahead character Eof_two - end of input seen for current character Attempting to read another character when in the third state is an error. This typically happens with incomplete files or with comments that have not been closed. The final set of state variables are used to describe the current input token. These are 'CurrentToken', 'CurrentNumber', 'Current- String' and 'CurrentSymbol'. The first is always used and is a member of the 'TokenKind_t' enumeration from 'toy.g'. The second is used only for numeric constants. The third is used for string constants, and the fourth for identifiers. Recall from the discussion of the symbol table that all identifiers are inserted into the symbol table as soon as they are seen, so a pointer to their symbol table slot is sufficient to completely identify them. The first procedures in the tokenizer that we should examine are the last two in the file. They are placed there because they call on the earlier routines, and I wanted to avoid unneccessary declarations. (Many people prefer a strict top-down ordering within a file - they will have to bear with my mostly-bottom-up arrangement.) 'lexInit' is called on a per-file basis by the system interface routine. It is called AFTER 'symbolInit', since it enters reserved words into the symbol table. The big declaration at the start of the function declares 'RESERVED_WORDS' to be an array of RESERVED_WORD_COUNT structures of type 'reservedWord_t', the value of which is given directly at compile time. Note that this is a constant, not an initialized variable, as it would be in C. The difference usually doesn't matter, but is worth noting. This array contains the text of the various reserved words, the length of that text, and the token value that goes with it. 'lexInit' goes through the array and calls 'enter' to enter each one into the symbol table. They are given a kind of 'sk_reservedWord', and their token value as their 'se_value' (subtracting 'tk_eof' from their token value yields an integral value, which is compatible with the 'se_value' field). Note that the reserved words are likely to hash to different places in the table, hence they are likely to be found without a search when they are looked up. This is desireable, since most reserved words occur more often than user identifiers. Following the reserved word initialization, the string buffer is set up with a default sized buffer (here 256 bytes). This will be suf- ficient for nearly all programs. The input source buffer is started off as being empty (SourcePos = SourceMax), so the first attempt to get a character will have to read some from the input file. The first call to 'nextChar' will do this. At that point, due to the one character look- ahead, we will not yet have a value for the current character. The line and column counters are started to represent the position of the cur- rent character and it is obtained by the second call to 'nextChar'. The final step of initialization is to construct the representation for the first token by calling 'nextToken'. 'lexTerm', the tokenizer's termin- nation routine, simply frees the current string buffer. Now we go back to the beginning of the file. Routines 'errorHere' and 'errorHereThree' are used to generate error messages. They are called from the tokenizer itself, but are most used by the parsing routines to report syntactic and semantic errors. They are located in 'lex.d' so they can utilize the current position indicators in the messages they produce. They call on 'errorPosition' and 'printString', which are in the system interface. The former prints the name of the input file and the passed line and column numbers in some reasonable format. The latter simply prints the passed string. 'errorHereThree' is used for multipart messages, typically those containing a user program identifier. Note that 'PreviousLine' and 'PreviousColumn' are used as the position indicators. This is because the one character lookahead has read the next character, and ITS position is in 'CurrentLine' and 'CurrentColumn'. Next is routine 'nextChar' which reads the next input character from the input stream. Recall that we have a buffer of characters from the input stream, and it is from there that 'nextChar' should get the characters it needs. When that bufferful is exhausted, the system interface should be called on to get another bufferfull. The first action that 'nextChar' does is to copy the lookahead character and its position into the current character and position. The position indica- tors for the next character are updated. If the character was a new- line, then the column goes to 0, and the line is incremented. If the character was a tab, then the column is moved up to the next tab stop. (Toy assumes the standard 8 character tab stops.) If 'EofState' is 'Eof_one', then we have already encountered the end of the input file for the lookahead character. We move up to 'Eof_two'. Otherwise, we get the next character from the input buffer. If the input buffer is used up (SourcePos = SourceMax), then we call 'readSource' to read another bufferfull of characters. If none are available, then 'EofState' is set to 'Eof_one'. This type of activity could have been moved to the system interface file, but that would have required an additional subroutine call for each input character. Since the number of input characters read is the largest count of anything that a compiler has to deal with, it is best to keep tokenizing as efficient as possible. For simple compilers, the tokenizer can occupy up to 50% of the total time. Routine 'addChar' implements the arbitrary sized buffer used for string constants and identifiers. It is called to add a character to the buffer. If the buffer is full, it will allocate another buffer twice as big and copy over the current contents. "whitespace" is a term describing the spaces and empty lines in a file or document. Routine 'whiteSpace' skips over this stuff in the input stream. It also skips over comments. Any amount of such white- space can occur between any two input tokens. This routine implements nested comments, i.e. comments that can be totally inside another comment. This is handy when experimenting with a program, since an entire section can be "commented out" without worrying about whether it has any internal comments or not. This doesn't work with C programs (although #if can effectively do the same thing). Variable 'level' is a count of how many comments we are currently nested in. It starts at zero for each whitespace and is incremented when '/*' is found and decremented when '*/' is found. The controlling expression of the 'while' loop can be broken down as: - stop looping if we hit the end of the file - loop past spaces, tabs and newlines - we are inside a comment if the current character is '/' and the next character is '*' - loop while we are inside a comment Routine 'nextToken' is the heart of the tokenizer. It uses the auxilliary routines we have seen and breaks the input stream up into tokens for the rest of the compiler. The nature of the token found is stored in the previously mentioned token state variables. The first step in this process is to skip any white-space that separates the next token from the current one. Note that the convention is to skip the white-space at the beginning of getting a token, rather than at the end. This allows the current-position variables to describe the last character of the current token. The tokenizer uses single character lookahead, but does not provide any token-level lookahead to the rest of the compiler. The language has been designed to not require any. Again, since the tokenizer is a major time-consumer in a simple compiler, it should not spend too much time deciding what to do. The simplest way to handle this is used here - just do a big 'case' on the current input character to decide what kind of token we have. This builds a large case table, but it is worth the space in this instance. The local variable 'tk' is used throughout this routine instead of the file level 'CurrentToken'. This allows the use of a register variable for it. It is initialized to 'tk_illegal' to allow the default case to be that of an unrecognized character. The first case recognized is that of an identifier or a reserved word. They can begin with a capital or lower case letter, or with an underscore. The identifier is collected in the string buffer using 'addChar'. The 'while' loop could be faster if table lookup were used to identify valid identifier characters. Note that digits have been added to the set of valid characters. A final '\e' (end-of-string) character is added to the buffer to mark the end of the identifier. Next, 'enter' is called to look the name up in the symbol table. Recall that the routine will enter the symbol with kind 'sk_empty' if it is not found. This is not handled here, but is passed along to the parser. The two cases here are that of the name being a reserved word, in which case the token kind is that of that reserved word, and that of the name being a regular (or unknown) identifier. The next major case is that of a numeric constant. This is pretty straightforward - the value is built up and left in 'CurrentNumber'. Note that no check for overflow is made, and that only decimal integers are supported. The corresponding code for Draco is much more complex, involving possible calls to the IEEE floating point library to convert a string to an IEEE floating point value. The next case is the last major one. It is for string constants, which start with a '"'. The code here is a bit more complex than might appear necessary. It does a number of things. First, it handles string breaks, that is, it will join together multiple string constants, separated by whitespace, into a single, large string constant. This allows long string constants on short input lines. Second, escaped characters within the string are supported. These are things like '\n', representing a newline. The more complex cases are those like '\(23)', representing a character with decimal value 23. These are used to produce things like the escape sequences used in "shoot", described in the previous article. The outer 'while' loops through all the pieces of a string constant that are separated by string breaks. The next level loops through all of the characters in a given piece. It loops until it sees a closing '"' (or the end of the input file). Within the inner loop, '\' is handled specially - when one is seen, the next character is read, and it is handled as an escaped character. Only newlines are tabs are handled directly here - other possiblities are 'r' for a carriage return, 'b' for a backspace, etc. These can be obtained in Toy using the third escape alternative - a decimal number in parentheses. This form is used so as to maintain compatibility with Draco, which allows any constant expression to be within the parentheses. Again, no check for overflow is done on the decimal value. Here we see our first two error messages in the compiler. They are complaints about not finding a ')' after the digits, or not finding any digits after the '('. The "error recovery" here is that of doing absolutely nothing. When a char- acter has been read, whether escaped or not, is is added to the string buffer with 'addChar', and we move on to the next character. If it isn't a '"', collecting of the string piece continues. When a '"' is found, it is skipped via 'nextChar', and then whitespace is skipped as part of looking for another string piece. This violates the previously mentioned rule of skipping whitespace only at the start of a new token. As a consequence, an error message associated with a string constant will actually point just before the beginning of the NEXT token. When the string is fully collected, a new buffer for it is allocated. It is pointed to by 'CurrentString' and its length is in 'CurrentNumber'. The remainder of the cases are simple ones which handle one or more operator or punctuation tokens. Note that '~' is illegal by itself, but can start the valid token '~='. The remaining three routines represent the interace to the token- izer that is available to the rest of the compiler. 'getToken' returns any of the information pieces (CurrentSymbol, CurrentString, Current- Number) requested, along with the kind of the current token. In many situations, the compiler needs no more than the token kind, so 'get- SimpleToken' is an interface returning just that. 'skipToken' is similar to 'nextToken' in that it moves on to the next token. It, however, frees any storage (specifically, a string constant) associated with the token. It is used during error recovery, when a particular token is being looked for. Improvements The symbol table routines are adequate for this compiler, but likely not for some other purposes. An interpreter would probably need to be able to delete arbitrary definitions from the table. This would mess up the searching that 'enter' does. Other compiler designs might benefit from multiple symbol tables. This would require dynamic alloc- ation of the tables. Removing the limit on the number of symbols handled would also require that. The current hash function is not very good and would produce poor behavior (lots of collisions) if the table gets too full. A better one could combine bits from the first two and last two characters of the symbol. This would add special cases for one-character symbols. There isn't very much that can be done to a tokenizer to make it "better". A few changes could be done to speed it up, such as a table of characters types. Improvements would go along with changes to the language. For example, hexadecimal, octal and binary numeric constants are useful for some purposes. If floating point were added to Toy, the tokenizer would need to be able to convert strings of characters into floating point constants. Similarly, a character type would require character constants. Since they are a single occurrence of what is allowed inside string constants, it may then be desireable to extract that handling from the string processing and turn it into a separate routine. (Draco has a 'nextStringChar' routine, for example.) More keywords could be added by extending the 'TokenKind_t' enumeration and adding them to the table in 'lexInit'. Similarly, additional operators and punctuation tokens would require additions to 'TokenKind_t' and to the case statement in 'nextToken'. More ambitious changes would require major work elsewhere. Support- ing full escaped characters as in Draco would require full constant expression evaluation elsewhere in the compiler. Adding a preprocessor as in C would insert another layer between these tokenizing routines and the system interface routines (or a separate program, but that would result in triple the amount of text file I/O). Usage How are these routines used? A full example can be found in the 'tokenTest.d' file used to test the tokenizer. An outline of it goes something like this: proc main()void: *SymbolEntry_t se; *char buffer; ulong number; TokenKind_t tk; symbolInit(); lexInit(); while tk := getToken(&se, &buffer, &number); tk ~= tk_eof do case tk incase tk_number: write('<', number, '>'); incase tk_string: write('\"', buffer, '\"'); memFree(buffer, number); incase tk_id: se*.se_kind := sk_undefined; write('<', se*.se_name, '>'); ... esac; write(' '); nextToken(); od; writeln(); lexTerm(); symbolTerm(); corp; This program simply turns an input source file into a stream of tokens which it prints. 'tokenTest.d' is an incomplete version of 'system.d', which provides only those services needed for this test. Conclusion This article has presented some information on the data structures used by the Toy compiler, and has described the symbol table and tokenizer modules. The symbol table routines serve simply as a way to associate some information with a name. The tokenizer breaks a stream of input text into a stream of higher-level tokens, as requested by other parts of the compiler, and in accordance with the rules for tokens in the language. Neither job is overly complex, but both must be done right in order to make an acceptable compiler. Shown like this, they may seem obvious, but let me assure you that I didn't get them right the first time, even though I've done this a few times before! Next Month The next article in this series will present the parser for the Toy compiler. There are three main portions to it - that which parses decl- arations, including procedure headers, that which parses expressions, and that which parses statements. It is in the parsing routines that the input program is "understood" in some sense. It is also here that new language features can be added.