AmigaMUD, Copyright 1998 by Chris Gray


	The ByteCode system in AmigaMUD


What is Bytecode?

Bytecode is just a different internal representation for the actions
(functions) in the system. It is slightly more compact, but, most
importantly, it is easier to interpret, so executes faster. I'm not
certain of the name, but it likely comes from the fact that the
"opcodes" in bytecode are always exactly one byte long. The bytecode
interpreter in AmigaMUD executes instructions for a simple virtual
machine, just like a Java virtual machine does. Again, like the Java
virtual machine, the AmigaMUD virtual machine is a stack machine, in
that it doesn't have any registers (other than the program counter,
the stack pointer and a frame pointer), and so pushes and pops things
from the stack a lot. The speed increase comes from the simplicity of
the decoding of the bytecode (each instruction starts with a one-byte
code that completely specifies what it does), and the fact that the
bytecode "compiler" can make some optimizations.


Why Did I Add the Bytecode system?

The non-bytecode interpreter in AmigaMUD is fairly efficient, quite a
bit moreso than the systems in many MUD servers. This is mostly due to
the strongly typed nature of the AmigaMUD programming language, and to
the effort I put into making it fast. However, when I wanted to
investigate fractal world generation, I found that it was far too
slow in AmigaMUD. So, I thought about ways of speeding it up. I
could have made the system generate native 68000 code, since I have
experience in 68000 code generation. That would have given me the
fastest execution. However, I want my system to be portable to other
architectures, and the effort of doing native code generation on all
of the various UNIX platforms, as well as for the ubiquitous Intel x86
architecture, is very large. Using a bytecode system can keep my system
portable, but still get good speedup.


Why Didn't I Use Java Bytecode?

There are three main reasons for this. I won't try to put them in any
order of importance, because I don't know the order:

    - it was lots of fun to do my own system

    - Java bytecode is a lot more complicated than mine (I don't have
	anything like Java's constant table), and has many features
	that I don't need.

    - my bytecode can run faster than Java's, mostly because I
	specifically designed it for fast execution. Also, many of the
	features that slow down Java are not present in the AmigaMUD
	programming language, so I could use simpler techniques.


Did I Reach My Goals?

At first, I was quite disappointed in the speedups I was getting with
bytecode. For my first fractal terrain generator, it was difficult to
reach even a 3 times speedup over the old interpreter. I had expected
much more speedup, and could not explain the small gains. Then I
thought about it differently, and thought that perhaps for that code,
the original interpreter was faster than I had thought. I went to a
native code version of the same program, and discovered that it only
ran 6.5 times faster than the bytecode version! That ratio is almost
unheard-of for any kind of interpretation, so I had no reason to be
dissatisfied.

Here is one test program that I used. It is a solver for the "Towers
of Hanoi" problem, but without any printouts. Printouts would limit
the speed of any version to system imposed speeds, so they have been
left out.

    hanoi: proc, owner SysAdmin, useCount 3:
    proc hanoi(int fromPeg, toPeg, usingPeg, n)void:
	if n ~= 0 then
	    hanoi(fromPeg, usingPeg, toPeg, n - 1);
	    hanoi(usingPeg, toPeg, fromPeg, n - 1);
	fi;
    corp

The timings I got for it are:

     n Old  New  Factor  Native  Factor
    -----------------------------------
    18	43    7    6.1	  0.62	  11.2
    19	87   15    5.8	  1.26	  11.9
    20 170   29    5.9	  2.54	  11.4

So, the bytecode interpreter runs about 6 times faster than the old
interpreter, and about 12 times slower than native code, for this
routine. The absolute times are for a 25 Mhz 68040.

The actual fractal terrain generator has somewhat different numbers:

    256 x 256 fractal terrain generation:

    Old  New  Factor  Native  Factor
    --------------------------------
     43   13   3.3	2.0    6.5

Here, the bytecode is 3.3 times faster than the old interpreter, and
6.5 times slower than native code.

Why is there such a large difference in the ratios? I believe this is
an artifact of the nature of the code, and the architecture of the
68000 family of CPUs, and that the ratios would be quite different for
other CPUs. The bytecode machine has 3 address registers (PC, SP, FP,
as mentioned earlier), and needs upto 4 or 5 other 32 bit values in
some instructions. These fit nicely into the registers available in
the 68000 family, and so the bytecode interpreter runs well. The
fractal program has a lot of local variables, more than will fit into
the registers, so it doesn't run very well (at least with the compiler
I used) as native code. So, the slowdown from native to bytecode is
not that much.

On other CPU architectures, things are different. On the register-poor
Intel x86 architecture, the compiler may not be able to keep all of
the virtual machine's registers in real registers, and so the bytecode
will not be as effective. On RISC CPU's, which typically have 32
registers, all of the local variables for the fractal routine will fit
into real registers, and it will run correspondingly faster. In both
of those cases, the slowdown from native code to bytecode will be
larger than I observed for the 68000 family.


Is More Speedup Possible?

There are some things that can be done to squeeze a bit more speedup
from the bytecode interpreter, but they are probably pretty minimal.
One of the things that slows down the bytecode interpreter is the fact
that it must fetch bytes of longer values (16 bit or 32 bit) from the
instruction stream one byte at a time. This is because those values
may not be aligned, and so the interpreter cannot use the 16 or 32 bit
load instructions, which require aligned operands on the 68000 family.
On CPU's which do not require such alignment, those instructions might
be usable, so long as there are no byte-ordering problems (the byte-
order for AmigaMUD bytecode is big-endian). Even with endian
differences, optimization of the bytecode on first execution can be
done, re-arranging the bytes to be in the required order for the
native CPU. On a CPU that does not allow unaligned accesses, some of
the operands will end up properly aligned, just by chance, and the
system could detect those cases and replace the instructions with
equivalent ones that just use the 16 or 32 bit fetch. This is possible
since, once bytecode is present in the server's memory for a function,
it is not moved. A more extreme possiblity is to change from bytecode
to wordcode, where the instructions are 32 bit values, and everything
ends up being properly aligned. This would greatly increase the size
of the code, but might be worthwhile in terms of speedup. With 16 bit
or 32 bit instructions, the interpreter would not need to extend the
bytes before using them to index into the compiler-generated table of
offsets for the case statement which is the core of the engine.

Looking at the 68000 assembler code my compiler generated for the
bytecode interpreter, there are a couple of instructions, per bytecode
instruction executed, that could be removed. With larger opcodes (no
longer just a byte), the opcodes could all be multiples of 2 (the
likely size of the entries in the compiler-generated jump table), so
that the shift generated by the compiler is not needed. Its not clear
that a compiler could ever be convinced of this however, since it
likely doesn't check to see that all possible indexes are multiples of
2, and likely will also insist on checking for all of the odd
values. Since all bytecode is generated by the system, I could assume
that only valid bytecodes are present, and not test the range of the
bytecodes.  Again, this is difficult to convince a compiler of.


Actual Experiments with an X86 Machine

The AmigaMUD server has been translated into ANSI C and now runs on my
Pentium machine running Linux. Initial translation of the old bytecode
interpreter to C produced a working version, but it wasn't a very fast
version. On the X86, the code sequence needed to fetch an unaligned
value from memory, a byte at a time, and reassemble the value into a
longword, is quite lengthy. However, the X86 architecture, having grown
out of an 8 bit architecture, will do unaligned accesses. A quick
experiment showed that using those unaligned references on my system
(300 MHz P-II) was considerably faster than the byte-referencing code
sequence. So, I added facilities to the bytecode system to allow for
unaligned references. As mentioned above, this required first changing
the byte-order from my virtual machine big-endian to the X86's native
little-endian (only in memory for execution, not in the database). This
step produced a bytecode system almost twice as fast as the original.

Looking at a disassembly of the X86 code, I still saw a lot of instructions
being executed. This turned out to be a small problem with my C skills.
Adding a bunch of casts produced simpler sequences and a bit more speed.

I was reasonably happy with the system at this point, but knew that, at
least for the Gnu compiler family, there was one more step I could take.
That step is to use the "label variable" technique. This uses a language
extension that allows a "goto" to a label indexed from a table of label
values. With an hour or so of work, I was able to produce a version of
the bytecode machine using this technique (conditional compilation and
macros make the same source still usable by other compilers). This last
improvment yielded another 50% speedup. Examining the code produced by
gcc with optimization enabled, shows that there isn't a whole lot of
room left for more improvement. Some byte-code instructions (e.g. the
one to negate the top-of-stack value) are compiling to only 4 X86
instructions, including the dispatch to the next bytecode instruction.

The final results for this machine (300 Mhz P-II running Linux) are (the
test program in this case is 100 iterations of the 256x256 fractal terrain
generation, done in a very straightforward way):

	Unoptimized MUD server:

	    tree interpreter:	187 seconds
	    bytecode machine:	77 seconds

	Optimized MUD server:

	    tree interpreter:	93 seconds
	    bytecode machine:	31 seconds

	Native code:

	    unoptimized:	4 seconds
	    optimized:		2 seconds


So, in this case, the bytecode machine is about 15 times slower than
native execution and only about 3 times faster than the tree interpreter,
fitting in with the conjectures above.

Repeating the original "Towers of Hanoi" experiment:

	Unoptimized MUD server:

	    tree interpreter:	n = 22: 57	n = 23: 114
	    bytecode machine:	n = 22: 14	n = 23:  29

	Optimized MUD server:

	    tree interpreter:	n = 22: 34	n = 23:  68
	    bytecode machine:	n = 22:  4	n = 23:   9

	Native code:

	    unoptimized:	n = 22: 0.53	n = 23: 1.08
	    optimized:		n = 22: 0.68	n = 23: 1.40

The native code times are wierd. Proper optimization flags to gcc (I just
used '-O') should get better optimized times than unoptimized. With this
example, bytecode is 4 - 8 times faster than the tree interpreter, and
7 - 8 times slower than native code.