AmigaMUD, Copyright 1998 by Chris Gray The ByteCode system in AmigaMUD What is Bytecode? Bytecode is just a different internal representation for the actions (functions) in the system. It is slightly more compact, but, most importantly, it is easier to interpret, so executes faster. I'm not certain of the name, but it likely comes from the fact that the "opcodes" in bytecode are always exactly one byte long. The bytecode interpreter in AmigaMUD executes instructions for a simple virtual machine, just like a Java virtual machine does. Again, like the Java virtual machine, the AmigaMUD virtual machine is a stack machine, in that it doesn't have any registers (other than the program counter, the stack pointer and a frame pointer), and so pushes and pops things from the stack a lot. The speed increase comes from the simplicity of the decoding of the bytecode (each instruction starts with a one-byte code that completely specifies what it does), and the fact that the bytecode "compiler" can make some optimizations. Why Did I Add the Bytecode system? The non-bytecode interpreter in AmigaMUD is fairly efficient, quite a bit moreso than the systems in many MUD servers. This is mostly due to the strongly typed nature of the AmigaMUD programming language, and to the effort I put into making it fast. However, when I wanted to investigate fractal world generation, I found that it was far too slow in AmigaMUD. So, I thought about ways of speeding it up. I could have made the system generate native 68000 code, since I have experience in 68000 code generation. That would have given me the fastest execution. However, I want my system to be portable to other architectures, and the effort of doing native code generation on all of the various UNIX platforms, as well as for the ubiquitous Intel x86 architecture, is very large. Using a bytecode system can keep my system portable, but still get good speedup. Why Didn't I Use Java Bytecode? There are three main reasons for this. I won't try to put them in any order of importance, because I don't know the order: - it was lots of fun to do my own system - Java bytecode is a lot more complicated than mine (I don't have anything like Java's constant table), and has many features that I don't need. - my bytecode can run faster than Java's, mostly because I specifically designed it for fast execution. Also, many of the features that slow down Java are not present in the AmigaMUD programming language, so I could use simpler techniques. Did I Reach My Goals? At first, I was quite disappointed in the speedups I was getting with bytecode. For my first fractal terrain generator, it was difficult to reach even a 3 times speedup over the old interpreter. I had expected much more speedup, and could not explain the small gains. Then I thought about it differently, and thought that perhaps for that code, the original interpreter was faster than I had thought. I went to a native code version of the same program, and discovered that it only ran 6.5 times faster than the bytecode version! That ratio is almost unheard-of for any kind of interpretation, so I had no reason to be dissatisfied. Here is one test program that I used. It is a solver for the "Towers of Hanoi" problem, but without any printouts. Printouts would limit the speed of any version to system imposed speeds, so they have been left out. hanoi: proc, owner SysAdmin, useCount 3: proc hanoi(int fromPeg, toPeg, usingPeg, n)void: if n ~= 0 then hanoi(fromPeg, usingPeg, toPeg, n - 1); hanoi(usingPeg, toPeg, fromPeg, n - 1); fi; corp The timings I got for it are: n Old New Factor Native Factor ----------------------------------- 18 43 7 6.1 0.62 11.2 19 87 15 5.8 1.26 11.9 20 170 29 5.9 2.54 11.4 So, the bytecode interpreter runs about 6 times faster than the old interpreter, and about 12 times slower than native code, for this routine. The absolute times are for a 25 Mhz 68040. The actual fractal terrain generator has somewhat different numbers: 256 x 256 fractal terrain generation: Old New Factor Native Factor -------------------------------- 43 13 3.3 2.0 6.5 Here, the bytecode is 3.3 times faster than the old interpreter, and 6.5 times slower than native code. Why is there such a large difference in the ratios? I believe this is an artifact of the nature of the code, and the architecture of the 68000 family of CPUs, and that the ratios would be quite different for other CPUs. The bytecode machine has 3 address registers (PC, SP, FP, as mentioned earlier), and needs upto 4 or 5 other 32 bit values in some instructions. These fit nicely into the registers available in the 68000 family, and so the bytecode interpreter runs well. The fractal program has a lot of local variables, more than will fit into the registers, so it doesn't run very well (at least with the compiler I used) as native code. So, the slowdown from native to bytecode is not that much. On other CPU architectures, things are different. On the register-poor Intel x86 architecture, the compiler may not be able to keep all of the virtual machine's registers in real registers, and so the bytecode will not be as effective. On RISC CPU's, which typically have 32 registers, all of the local variables for the fractal routine will fit into real registers, and it will run correspondingly faster. In both of those cases, the slowdown from native code to bytecode will be larger than I observed for the 68000 family. Is More Speedup Possible? There are some things that can be done to squeeze a bit more speedup from the bytecode interpreter, but they are probably pretty minimal. One of the things that slows down the bytecode interpreter is the fact that it must fetch bytes of longer values (16 bit or 32 bit) from the instruction stream one byte at a time. This is because those values may not be aligned, and so the interpreter cannot use the 16 or 32 bit load instructions, which require aligned operands on the 68000 family. On CPU's which do not require such alignment, those instructions might be usable, so long as there are no byte-ordering problems (the byte- order for AmigaMUD bytecode is big-endian). Even with endian differences, optimization of the bytecode on first execution can be done, re-arranging the bytes to be in the required order for the native CPU. On a CPU that does not allow unaligned accesses, some of the operands will end up properly aligned, just by chance, and the system could detect those cases and replace the instructions with equivalent ones that just use the 16 or 32 bit fetch. This is possible since, once bytecode is present in the server's memory for a function, it is not moved. A more extreme possiblity is to change from bytecode to wordcode, where the instructions are 32 bit values, and everything ends up being properly aligned. This would greatly increase the size of the code, but might be worthwhile in terms of speedup. With 16 bit or 32 bit instructions, the interpreter would not need to extend the bytes before using them to index into the compiler-generated table of offsets for the case statement which is the core of the engine. Looking at the 68000 assembler code my compiler generated for the bytecode interpreter, there are a couple of instructions, per bytecode instruction executed, that could be removed. With larger opcodes (no longer just a byte), the opcodes could all be multiples of 2 (the likely size of the entries in the compiler-generated jump table), so that the shift generated by the compiler is not needed. Its not clear that a compiler could ever be convinced of this however, since it likely doesn't check to see that all possible indexes are multiples of 2, and likely will also insist on checking for all of the odd values. Since all bytecode is generated by the system, I could assume that only valid bytecodes are present, and not test the range of the bytecodes. Again, this is difficult to convince a compiler of. Actual Experiments with an X86 Machine The AmigaMUD server has been translated into ANSI C and now runs on my Pentium machine running Linux. Initial translation of the old bytecode interpreter to C produced a working version, but it wasn't a very fast version. On the X86, the code sequence needed to fetch an unaligned value from memory, a byte at a time, and reassemble the value into a longword, is quite lengthy. However, the X86 architecture, having grown out of an 8 bit architecture, will do unaligned accesses. A quick experiment showed that using those unaligned references on my system (300 MHz P-II) was considerably faster than the byte-referencing code sequence. So, I added facilities to the bytecode system to allow for unaligned references. As mentioned above, this required first changing the byte-order from my virtual machine big-endian to the X86's native little-endian (only in memory for execution, not in the database). This step produced a bytecode system almost twice as fast as the original. Looking at a disassembly of the X86 code, I still saw a lot of instructions being executed. This turned out to be a small problem with my C skills. Adding a bunch of casts produced simpler sequences and a bit more speed. I was reasonably happy with the system at this point, but knew that, at least for the Gnu compiler family, there was one more step I could take. That step is to use the "label variable" technique. This uses a language extension that allows a "goto" to a label indexed from a table of label values. With an hour or so of work, I was able to produce a version of the bytecode machine using this technique (conditional compilation and macros make the same source still usable by other compilers). This last improvment yielded another 50% speedup. Examining the code produced by gcc with optimization enabled, shows that there isn't a whole lot of room left for more improvement. Some byte-code instructions (e.g. the one to negate the top-of-stack value) are compiling to only 4 X86 instructions, including the dispatch to the next bytecode instruction. The final results for this machine (300 Mhz P-II running Linux) are (the test program in this case is 100 iterations of the 256x256 fractal terrain generation, done in a very straightforward way): Unoptimized MUD server: tree interpreter: 187 seconds bytecode machine: 77 seconds Optimized MUD server: tree interpreter: 93 seconds bytecode machine: 31 seconds Native code: unoptimized: 4 seconds optimized: 2 seconds So, in this case, the bytecode machine is about 15 times slower than native execution and only about 3 times faster than the tree interpreter, fitting in with the conjectures above. Repeating the original "Towers of Hanoi" experiment: Unoptimized MUD server: tree interpreter: n = 22: 57 n = 23: 114 bytecode machine: n = 22: 14 n = 23: 29 Optimized MUD server: tree interpreter: n = 22: 34 n = 23: 68 bytecode machine: n = 22: 4 n = 23: 9 Native code: unoptimized: n = 22: 0.53 n = 23: 1.08 optimized: n = 22: 0.68 n = 23: 1.40 The native code times are wierd. Proper optimization flags to gcc (I just used '-O') should get better optimized times than unoptimized. With this example, bytecode is 4 - 8 times faster than the tree interpreter, and 7 - 8 times slower than native code.