Toy - A Small Compiler for the Amiga
	 Part 2 - Data Structures, Symbol Table and Tokenizer


	In part 1 of this series, I described the Toy language and briefly
    outlined the Toy compiler. In this installment I begin the detailed
    description of the compiler and its operation. With this article, as
    well as with future ones in the series, are accompanying listings of
    the compiler. The files listed are shown in their entirety, but some
    files are only described, and not listed. These files are available on
    the disk version of the magazine.

	This article has three main portions. The first describes the data
    structures that the compiler uses internally. This is done using the
    compiler's main include file, 'toy.g', shown in Listing 1. Not all of
    the structures will be fully described, since they are not encountered
    until later in the compiler. The second portion of the article will
    describe the symbol table routines used by the compiler. These are used
    for keeping track of the identifiers (variables, procedures, etc.)
    declared in the program. The final portion describes the tokenizer,
    that portion of the compiler which breaks the input text stream up into
    a stream of larger tokens which the compiler then parses.


    Data Structures

	'toy.g' is included with all source files of the compiler. This
    means that all of the types and values declared in it are available
    throughout the compiler. On a somewhat larger project, it would prob-
    ably be better to split the information up so that most types are known
    only to those portions of the project that require them ("information
    hiding"). The second include file used by the compiler is also included
    in all source files. It is 'externs.g', and it includes declarations of
    all functions which are used outside of the file which defines them.
    This file is not listed here since it is straightforward. Such usage
    ensures that all such functions are used properly, since all uses are
    made with the same declaration in effect, and the file which defines
    the function also has that declaration in effect, so the compiler can
    check the declaration against the actual definition. This same tech-
    nique can also be used with ANSI C. Again, on a larger project, it may
    be better to split up the file.

	The first type defined in 'toy.g' is one that we will see a lot of
    throughout this and the next article. It is an enumeration type which
    lists all of the legal token values that the tokenizer will generate.
    Some of the tokens are accompanied by additional information. E.g. a
    'tk_number' needs the value of the number. Others, such as 'tk_proc'
    require no additional information. For some compilers, it is useful to
    also define a structure which can represent any token completely. That
    way, token values can be readily passed around. There was no need for
    this in Toy, however.

	The second type, 'ResultType_t', (the company I work for has a
    programming convention in which all type names end in '_t', so I use
    that same convention in my recent non-work programming) is a much
    smaller enumeration of the types that the compiler knows about. Type
    'rt_error' is used internally to cut down on the error messages issued
    in response to a single error. Most compilers have more complex types
    and need a more complex type structure.

	Next comes type 'SymbolKind_t', yet another enumeration, this time
    of all of the kinds of symbols that we can store in our symbol table.
    Note that Toy (like Draco) stores the language's reserved words in the
    main symbol table. The closely related type 'SymbolEntry_t' describes
    the structure of each symbol table entry. It contains a pointer to the
    name of the symbol, the length of that string (including the termin-
    ating '\e'), a value of type 'SymbolKind_t' saying what kind of symbol
    this is, the type of the symbol, and a couple of fields used only for
    some kinds of symbols. The length of the name need not be stored, but
    doing so allows slightly quicker equality checks in some cases, and
    avoids recomputing the length when the symbol is freed.

	The final type, 'RelocCode_t' I will not describe here, except to
    say that it has to do with code generation and emitting the final
    object file.

	Anyone who has worked with a real compiler will immediately see
    that there are no data structures to contain any compiler state infor-
    mation (except the symbol table of course), or to hold any kind of rep-
    resentation of a tokenized or parsed program. As mentioned in the first
    article, the Toy compiler doesn't do any of that - the parsing routines
    call the tokenizer directly, and also call the code generation routines
    directly, so no intermediate representations are needed. This is true
    to a lesser extent of the Draco compiler also. It does maintain state
    information about register usage, and a small buffer of instructions
    for the peephole optimizer, however.


    The Symbol Table

	Most compilers will have more than one symbol table. For example, a
    Pascal compiler will often create a new symbol table for each BEGIN -
    END block to contain the symbols declared in that block. Thus, when it
    is trying to find the meaning of a symbol, it starts with the table for
    the most local block and works outward until it finds a definition.
    Another way of handling this is to have a single table, but to stack
    the definitions in that table, so that earlier ones are hidden from
    view by more local definitions. A rule with both Toy and Draco is that
    such hiding is not allowed - there can be only one definition of a
    given symbol at any given time. With this rule, a single symbol table
    is all that is needed. Also, since Toy's special tokens like 'if', etc.
    are "reserved words" as opposed to "keywords", the user cannot declare
    identifiers that look like them. Thus, they too can be in the single
    symbol table with no conflict. The only requirement on the single table
    is that it be possible to remove local variables from it at the end of
    each procedure.

	Listing 2, 'symbol.d' is the symbol table code for Toy. The table
    itself is a 1023 (1023 being a nice prime number) slot hash table. The
    first routine, 'symbolInit' simply initializes all slots to 'sk_empty',
    indicating that they are unused. It is called by the per-file startup
    code. Global variable 'EntryCount' is used to count the number of
    entries in the table. If that limit is reached, the Toy compiler will
    abort. A fairly easy alternative to that aborting is to use a dynam-
    ically allocated table, and, if it gets too full, allocate a new one
    twice as big, and re-enter all symbols into the new table. 'se' is a
    variable declared as pointer to 'SymbolEntry_t'. It scans along the
    elements of the array, which are "sizeof(SymbolEntry_t)" bytes apart.

	'memFree' is a routine which is defined in the system interface
    portion of the compiler. That portion is not shown here, but the part
    needed to test the tokenizer is available on disk. 'memFree' simply
    calls the Amiga Exec routine 'FreeMem' to free the memory. Other rou-
    tines from that system interface are used later in the symbol table
    code, and also in the tokenizer.

	'symbolTerm' is the corresponding routine to clean up the hash
    table. Specifically, it frees the strings associated with all used
    entries. Note that it uses the stored 'se_length' field as the length
    in characters of the string. The multiplication by "sizeof(char)" is
    not really necessary since characters are one byte long. I am, how-
    ever, trying to prepare myself for the eventuality of multi-byte char-
    acters, such as the ANSI C people are discussing, so I try to include
    the required scaling where appropriate.

	'symbolPurge' performs the function mentioned earlier - it removes
    all local (to a procedure) symbols from the table, ready for the next
    procedure.

	'enter' is the central routine in the symbol table handler. It is
    used both to enter new symbols into the table, and to look up existing
    routines. This is in fact done in the tokenizer. This means that the
    remainder of the compiler itself does not have to worry about identi-
    fier tokens as special - they simply contain a pointer to the symbol
    table slot which they occupy. An implication of this is that undefined
    symbols are in fact entered into the table when they are first used.
    The first step in entry/lookup is to calculate a hash value for the
    string. This is simply the first character, shifted left 6 bits, along
    with the last character. For one-character names, these two are the
    same character. The resulting value is taken modulo the symbol table
    size to yield an index into the table at which to start the search for
    the symbol. Next comes a loop which starts at the given table position,
    and searches forward, either for an unused slot, or for one with the
    desired name. The forward search wraps around to the beginning of the
    table if it passes the end. If what is found is an empty slot, then the
    new symbol is added. Note that the name is copied into a dynamically
    allocated region of memory. This is consistent with freeing the string
    when the symbol is removed. In either case, a pointer to the found
    table slot is returned. A slot kind of 'sk_empty' tells the caller that
    the symbol has not been seen before.

	The final routine in the file is 'fixLocals'. This is used during
    code generation to fix up the offset from the frame pointer (stored in
    the 'se_value' field) at which this local variable or parameter can be
    found. The fix is required since the offset amount is not known until
    all local variables have been declared and their required space added
    up. The extra offset could be remembered elsewhere and added in as
    required, but this method makes other code simpler.


    The Tokenizer

	The tokenizer for the Toy compiler is in file 'lex.d', shown in
    Listing 3. This portion of the compiler is more complex than the symbol
    table routines and maintains more internal state. In other compilers I
    have written, the current input token is always available for examin-
    ation in a global variable. For the Toy compiler, however, I decided to
    work hard to keep things modular, thus there are no global variables.
    The symbol table handler communicate solely through routines, for
    example. The interface to the tokenizer that I chose works similarly.
    Routine 'nextToken' processes input characters to build up the next
    input token. The token is kept entirely within the tokenizer module
    (lex.d), however. The rest of the compiler calls 'getToken' to return
    the current token. 'getToken' can be called several times for a given
    token, and replaces the global variables I have used in the past.

	The tokenizer maintains several distinct state systems. The simp-
    lest is simply a buffer to contain input text. This is composed of
    'SourceBuffer', 'SourcePos', and 'SourceMax', which are, respectively,
    the buffer for input characters, the position in the buffer of the next
    character to be processed, and the number of characters currently in
    the buffer. When the last character in the buffer has been processed,
    and more are needed, a call is made to the system interface portion of
    the compiler to read more characters.

	A second state set is that which is used to build up identifiers
    and string constants. I prefer not to place limitations on the length
    of string contants, so a dynamically sized buffer is used. Given the
    code to do that, there is no reason not to also use it to allow iden-
    tifiers of arbitrary length. The variables involved here are 'String-
    Buffer', 'StringBufferSize' and 'StringBufferPos', which are, respec-
    tively, a pointer to the dynamically allocated buffer, the size of the
    current buffer, and the position of the next character to be added to
    the buffer.

	Also maintained is information which can be used in error messages
    to identify the location of the error. These are the number of the
    current input line, and the column of the current input character. In
    the Toy language, there are tokens which cannot be unambiguously iden-
    tified without looking at the next input character. For example, if a
    '<' is seen, it is not known whether the token '<' is present, or if
    it is a '<='. This looking ahead at a single next character is called
    one-character-lookahead. To handle this, the compiler has pairs of
    variables, containing the characters themselves, the line they came
    from, and the column they were in. This lookahead also adds some comp-
    lexity to handling the end-of-file situation. The compiler should not
    get upset just because the lookahead is trying to read past the end of
    the input file. This is handled by variable 'EofState', which has the
    following three values:

	Eof_none - end of input has not been seen
	Eof_one - end of input seen for lookahead character
	Eof_two - end of input seen for current character

    Attempting to read another character when in the third state is an
    error. This typically happens with incomplete files or with comments
    that have not been closed.

	The final set of state variables are used to describe the current
    input token. These are 'CurrentToken', 'CurrentNumber', 'Current-
    String' and 'CurrentSymbol'. The first is always used and is a member
    of the 'TokenKind_t' enumeration from 'toy.g'. The second is used only
    for numeric constants. The third is used for string constants, and the
    fourth for identifiers. Recall from the discussion of the symbol table
    that all identifiers are inserted into the symbol table as soon as they
    are seen, so a pointer to their symbol table slot is sufficient to
    completely identify them.

	The first procedures in the tokenizer that we should examine are
    the last two in the file. They are placed there because they call on
    the earlier routines, and I wanted to avoid unneccessary declarations.
    (Many people prefer a strict top-down ordering within a file - they
    will have to bear with my mostly-bottom-up arrangement.) 'lexInit' is
    called on a per-file basis by the system interface routine. It is
    called AFTER 'symbolInit', since it enters reserved words into the
    symbol table. The big declaration at the start of the function declares
    'RESERVED_WORDS' to be an array of RESERVED_WORD_COUNT structures of
    type 'reservedWord_t', the value of which is given directly at compile
    time. Note that this is a constant, not an initialized variable, as it
    would be in C. The difference usually doesn't matter, but is worth
    noting. This array contains the text of the various reserved words, the
    length of that text, and the token value that goes with it. 'lexInit'
    goes through the array and calls 'enter' to enter each one into the
    symbol table. They are given a kind of 'sk_reservedWord', and their
    token value as their 'se_value' (subtracting 'tk_eof' from their token
    value yields an integral value, which is compatible with the 'se_value'
    field). Note that the reserved words are likely to hash to different
    places in the table, hence they are likely to be found without a search
    when they are looked up. This is desireable, since most reserved words
    occur more often than user identifiers.

	Following the reserved word initialization, the string buffer is
    set up with a default sized buffer (here 256 bytes). This will be suf-
    ficient for nearly all programs. The input source buffer is started off
    as being empty (SourcePos = SourceMax), so the first attempt to get a
    character will have to read some from the input file. The first call to
    'nextChar' will do this. At that point, due to the one character look-
    ahead, we will not yet have a value for the current character. The line
    and column counters are started to represent the position of the cur-
    rent character and it is obtained by the second call to 'nextChar'. The
    final step of initialization is to construct the representation for the
    first token by calling 'nextToken'. 'lexTerm', the tokenizer's termin-
    nation routine, simply frees the current string buffer.

	Now we go back to the beginning of the file. Routines 'errorHere'
    and 'errorHereThree' are used to generate error messages. They are
    called from the tokenizer itself, but are most used by the parsing
    routines to report syntactic and semantic errors. They are located in
    'lex.d' so they can utilize the current position indicators in the
    messages they produce. They call on 'errorPosition' and 'printString',
    which are in the system interface. The former prints the name of the
    input file and the passed line and column numbers in some reasonable
    format. The latter simply prints the passed string. 'errorHereThree' is
    used for multipart messages, typically those containing a user program
    identifier. Note that 'PreviousLine' and 'PreviousColumn' are used as
    the position indicators. This is because the one character lookahead
    has read the next character, and ITS position is in 'CurrentLine' and
    'CurrentColumn'.

	Next is routine 'nextChar' which reads the next input character
    from the input stream. Recall that we have a buffer of characters from
    the input stream, and it is from there that 'nextChar' should get the
    characters it needs. When that bufferful is exhausted, the system
    interface should be called on to get another bufferfull. The first
    action that 'nextChar' does is to copy the lookahead character and its
    position into the current character and position. The position indica-
    tors for the next character are updated. If the character was a new-
    line, then the column goes to 0, and the line is incremented. If the
    character was a tab, then the column is moved up to the next tab stop.
    (Toy assumes the standard 8 character tab stops.) If 'EofState' is
    'Eof_one', then we have already encountered the end of the input file
    for the lookahead character. We move up to 'Eof_two'. Otherwise, we get
    the next character from the input buffer. If the input buffer is used
    up (SourcePos = SourceMax), then we call 'readSource' to read another
    bufferfull of characters. If none are available, then 'EofState' is set
    to 'Eof_one'. This type of activity could have been moved to the system
    interface file, but that would have required an additional subroutine
    call for each input character. Since the number of input characters
    read is the largest count of anything that a compiler has to deal with,
    it is best to keep tokenizing as efficient as possible. For simple
    compilers, the tokenizer can occupy up to 50% of the total time.

	Routine 'addChar' implements the arbitrary sized buffer used for
    string constants and identifiers. It is called to add a character to
    the buffer. If the buffer is full, it will allocate another buffer
    twice as big and copy over the current contents.

	"whitespace" is a term describing the spaces and empty lines in a
    file or document. Routine 'whiteSpace' skips over this stuff in the
    input stream. It also skips over comments. Any amount of such white-
    space can occur between any two input tokens. This routine implements
    nested comments, i.e. comments that can be totally inside another
    comment. This is handy when experimenting with a program, since an
    entire section can be "commented out" without worrying about whether
    it has any internal comments or not. This doesn't work with C programs
    (although #if can effectively do the same thing). Variable 'level' is
    a count of how many comments we are currently nested in. It starts at
    zero for each whitespace and is incremented when '/*' is found and
    decremented when '*/' is found. The controlling expression of the
    'while' loop can be broken down as:

	- stop looping if we hit the end of the file
	- loop past spaces, tabs and newlines
	- we are inside a comment if the current character is '/' and the
	    next character is '*'
	- loop while we are inside a comment

	Routine 'nextToken' is the heart of the tokenizer. It uses the
    auxilliary routines we have seen and breaks the input stream up into
    tokens for the rest of the compiler. The nature of the token found is
    stored in the previously mentioned token state variables. The first
    step in this process is to skip any white-space that separates the next
    token from the current one. Note that the convention is to skip the
    white-space at the beginning of getting a token, rather than at the
    end. This allows the current-position variables to describe the last
    character of the current token. The tokenizer uses single character
    lookahead, but does not provide any token-level lookahead to the rest
    of the compiler. The language has been designed to not require any.

	Again, since the tokenizer is a major time-consumer in a simple
    compiler, it should not spend too much time deciding what to do. The
    simplest way to handle this is used here - just do a big 'case' on the
    current input character to decide what kind of token we have. This
    builds a large case table, but it is worth the space in this instance.
    The local variable 'tk' is used throughout this routine instead of the
    file level 'CurrentToken'. This allows the use of a register variable
    for it. It is initialized to 'tk_illegal' to allow the default case to
    be that of an unrecognized character.

	The first case recognized is that of an identifier or a reserved
    word. They can begin with a capital or lower case letter, or with an
    underscore. The identifier is collected in the string buffer using
    'addChar'. The 'while' loop could be faster if table lookup were used
    to identify valid identifier characters. Note that digits have been
    added to the set of valid characters. A final '\e' (end-of-string)
    character is added to the buffer to mark the end of the identifier.
    Next, 'enter' is called to look the name up in the symbol table. Recall
    that the routine will enter the symbol with kind 'sk_empty' if it is
    not found. This is not handled here, but is passed along to the parser.
    The two cases here are that of the name being a reserved word, in which
    case the token kind is that of that reserved word, and that of the name
    being a regular (or unknown) identifier.

	The next major case is that of a numeric constant. This is pretty
    straightforward - the value is built up and left in 'CurrentNumber'.
    Note that no check for overflow is made, and that only decimal integers
    are supported. The corresponding code for Draco is much more complex,
    involving possible calls to the IEEE floating point library to convert
    a string to an IEEE floating point value.

	The next case is the last major one. It is for string constants,
    which start with a '"'. The code here is a bit more complex than might
    appear necessary. It does a number of things. First, it handles string
    breaks, that is, it will join together multiple string constants,
    separated by whitespace, into a single, large string constant. This
    allows long string constants on short input lines. Second, escaped
    characters within the string are supported. These are things like '\n',
    representing a newline. The more complex cases are those like '\(23)',
    representing a character with decimal value 23. These are used to
    produce things like the escape sequences used in "shoot", described in
    the previous article.

	The outer 'while' loops through all the pieces of a string constant
    that are separated by string breaks. The next level loops through all
    of the characters in a given piece. It loops until it sees a closing
    '"' (or the end of the input file). Within the inner loop, '\' is
    handled specially - when one is seen, the next character is read, and
    it is handled as an escaped character. Only newlines are tabs are
    handled directly here - other possiblities are 'r' for a carriage
    return, 'b' for a backspace, etc. These can be obtained in Toy using
    the third escape alternative - a decimal number in parentheses. This
    form is used so as to maintain compatibility with Draco, which allows
    any constant expression to be within the parentheses. Again, no check
    for overflow is done on the decimal value. Here we see our first two
    error messages in the compiler. They are complaints about not finding a
    ')' after the digits, or not finding any digits after the '('. The
    "error recovery" here is that of doing absolutely nothing. When a char-
    acter has been read, whether escaped or not, is is added to the string
    buffer with 'addChar', and we move on to the next character. If it
    isn't a '"', collecting of the string piece continues. When a '"' is
    found, it is skipped via 'nextChar', and then whitespace is skipped as
    part of looking for another string piece. This violates the previously
    mentioned rule of skipping whitespace only at the start of a new token.
    As a consequence, an error message associated with a string constant
    will actually point just before the beginning of the NEXT token. When
    the string is fully collected, a new buffer for it is allocated. It is
    pointed to by 'CurrentString' and its length is in 'CurrentNumber'.

	The remainder of the cases are simple ones which handle one or more
    operator or punctuation tokens. Note that '~' is illegal by itself, but
    can start the valid token '~='.

	The remaining three routines represent the interace to the token-
    izer that is available to the rest of the compiler. 'getToken' returns
    any of the information pieces (CurrentSymbol, CurrentString, Current-
    Number) requested, along with the kind of the current token. In many
    situations, the compiler needs no more than the token kind, so 'get-
    SimpleToken' is an interface returning just that. 'skipToken' is
    similar to 'nextToken' in that it moves on to the next token. It,
    however, frees any storage (specifically, a string constant) associated
    with the token. It is used during error recovery, when a particular
    token is being looked for.


    Improvements

	The symbol table routines are adequate for this compiler, but
    likely not for some other purposes. An interpreter would probably need
    to be able to delete arbitrary definitions from the table. This would
    mess up the searching that 'enter' does. Other compiler designs might
    benefit from multiple symbol tables. This would require dynamic alloc-
    ation of the tables. Removing the limit on the number of symbols
    handled would also require that. The current hash function is not very
    good and would produce poor behavior (lots of collisions) if the table
    gets too full. A better one could combine bits from the first two and
    last two characters of the symbol. This would add special cases for
    one-character symbols.

	There isn't very much that can be done to a tokenizer to make it
    "better". A few changes could be done to speed it up, such as a table
    of characters types. Improvements would go along with changes to the
    language. For example, hexadecimal, octal and binary numeric constants
    are useful for some purposes. If floating point were added to Toy, the
    tokenizer would need to be able to convert strings of characters into
    floating point constants. Similarly, a character type would require
    character constants. Since they are a single occurrence of what is
    allowed inside string constants, it may then be desireable to extract
    that handling from the string processing and turn it into a separate
    routine. (Draco has a 'nextStringChar' routine, for example.) More
    keywords could be added by extending the 'TokenKind_t' enumeration and
    adding them to the table in 'lexInit'. Similarly, additional operators
    and punctuation tokens would require additions to 'TokenKind_t' and to
    the case statement in 'nextToken'.

	More ambitious changes would require major work elsewhere. Support-
    ing full escaped characters as in Draco would require full constant
    expression evaluation elsewhere in the compiler. Adding a preprocessor
    as in C would insert another layer between these tokenizing routines
    and the system interface routines (or a separate program, but that
    would result in triple the amount of text file I/O).


    Usage

	How are these routines used? A full example can be found in the
    'tokenTest.d' file used to test the tokenizer. An outline of it goes
    something like this:

	proc main()void:
	    *SymbolEntry_t se;
	    *char buffer;
	    ulong number;
	    TokenKind_t tk;

	    symbolInit();
	    lexInit();
	    while
		tk := getToken(&se, &buffer, &number);
		tk ~= tk_eof
	    do
		case tk
		incase tk_number:
		    write('<', number, '>');
		incase tk_string:
		    write('\"', buffer, '\"');
		    memFree(buffer, number);
		incase tk_id:
		    se*.se_kind := sk_undefined;
		    write('<', se*.se_name, '>');
		...
		esac;
		write(' ');
		nextToken();
	    od;
	    writeln();
	    lexTerm();
	    symbolTerm();
	corp;

    This program simply turns an input source file into a stream of tokens
    which it prints. 'tokenTest.d' is an incomplete version of 'system.d',
    which provides only those services needed for this test.


    Conclusion

       This article has presented some information on the data structures
    used by the Toy compiler, and has described the symbol table and
    tokenizer modules. The symbol table routines serve simply as a way to
    associate some information with a name. The tokenizer breaks a stream
    of input text into a stream of higher-level tokens, as requested by
    other parts of the compiler, and in accordance with the rules for
    tokens in the language. Neither job is overly complex, but both must be
    done right in order to make an acceptable compiler. Shown like this,
    they may seem obvious, but let me assure you that I didn't get them
    right the first time, even though I've done this a few times before!


    Next Month

	The next article in this series will present the parser for the Toy
    compiler. There are three main portions to it - that which parses decl-
    arations, including procedure headers, that which parses expressions,
    and that which parses statements. It is in the parsing routines that
    the input program is "understood" in some sense. It is also here that
    new language features can be added.