| interpreter/compiler | arithmetic | bytecodes | sends |
| milliseconds | bytecodes/second | sends/second | |
| Exupery with atCache | 58 | 266 389 177 | 1 222 688 |
| Exupery without atCache | 66 | 14 466 546 | 1 211 120 |
| Normal Interpreter | 320 | 96 385 542 | 3 173 064 |
| Debugging Interpreter | 320 | 29 197 080 | 1 225 389 |
Two benchmarks are used in this analysis a very simple arithmetic benchmark and the bytecode benchmark. The send benchmark has been included but hasn't been compiled.
The performance numbers show the value of the atCache. Even with the cache the bytecode benchmark is still much slower than the arithmetic benchmark. My implementation of the atCache has not been tuned so this should be easy to improve.
arithmeticLoop | acc | acc := 0. 1 to: 2000000 do: [:each| acc := acc + 1]. ^ 5
The benchmark contains 19 bytecodes and about 75 machine code instructions. Looking at the code it should be very easy to improve. It's bad but it's still 5 times faster than the interpreter.
| 20 | Temporary loads and stores |
| 11 | Detagging constants |
| 9 | Boolean conversions |
So I should be able to double the performance with very simple techniques. First moving arguments and temporaries into registers then using the colouring register allocator. Second very simple constant propagation to avoid detagging constant integers. Third removing the conversions into and out of Smalltalk Booleans. All these optimisations need nothing more complex than tree traversals.
To easily remove conversions into and out of Smalltalk Booleans will require another intermediate language. By breaking bytecodes into a middle level intermediate that shows conversions we can cancel out convert then deconvert sequences. Optimising across bytecode boundaries is one major advantage a compiler has over an interpreter.
Very crudely, I use 6 potential instructions per bytecode (based on being 5 times faster than the interpreter which uses 10 cycles with 3 instructions per cycle). I execute 4 instructions per bytecode. So I'm issuing 2 instructions per cycle on average, the Athlon is supposed to manage 2.5 instructions per cycle. I'm doing well on this measure, especially given how bad the generated code is.
((block 1 (mov (8 ebp) eax) ;; Storing zero in acc (mov 1 (28 eax)) ;; Storing zero (mov (8 ebp) eax) ;; Storing 1 in each (mov 3 (32 eax)) ;; Storing 1 in each (jmp block3)) ;; Removable jump (block 2 (mov (8 ebp) eax) (mov (4 eax) ebx) ;; ebx is the senders context (mov 2 eax) ;; Increment the context's stack pointer (add (12 ebx) eax) ;; Incrementing (mov eax (12 ebx)) ;; Incrementing (sal 1 eax) ;; Calculating the address for the answer on stack (add ebx eax) ;; Address calculation (mov 11 (22 eax)) ;; Returning 5 as a hack (returnReceiver not implemented) (mov 1 eax) ;; 1 indicates return from method (not a send) (ret)) (block 3 ;; Start of Loop (mov (8 ebp) eax) ;; Loading each (mov (32 eax) eax) ;; Loading (and 1 eax) ;; tag checking each (jz block4) ;; tag checking (mov 4000001 eax) ;; Type checking constant (and 1 eax) ;; Type checking constant (jz block4) ;; Type checking constant (mov 4000001 ebx) ;; Detag constant (sar 1 ebx) ;; Detagging (mov (8 ebp) eax) ;; Loading each (mov (32 eax) eax) ;; Loading each (sar 1 eax) ;; Detagging each (sub ebx eax) ;; The payload comparison (should be cmp) (jg block5) ;; comparison jump (mov trueObj eax) ;; Conversion to a Boolean object (mov (eax) eax) ;; Conversion (jmp block6)) ;; Conversion (block 4) ;; Overflow block (block 5 (mov falseObj eax) ;; Conversion (mov (eax) eax) ;; Conversion (jmp block6)) ;; Conversion (block 6 (mov falseObj ebx) ;; Conversion From Boolean (sub (ebx) eax) ;; Conversion back (je block2) ;; Conversion back (mov (8 ebp) eax) ;; Loading each (mov (28 eax) eax) ;; Loading (and 1 eax) ;; Tag checking acc (jz block7) ;; Tag checking (mov 3 eax) ;; Load Constant (and 1 eax) ;; Tag checking constant (jz block7) ;; Tag checking (mov (8 ebp) eax) ;; Loading acc (mov (28 eax) ebx) ;; Loading (sar 1 ebx) ;; Detagging acc (mov 3 eax) ;; Load constant (sar 1 eax) ;; Detagging constant (add ebx eax) ;; Payload add (sal 1 eax) ;; Retag (part) (jo block7) ;; Check for overflow (add 1 eax) ;; Rest of reload (mov (8 ebp) ebx) ;; Storing acc (mov eax (28 ebx)) ;; Storing (mov (8 ebp) eax) ;; Loading each (mov (32 eax) eax) ;; Loading (and 1 eax) ;; Tag checking each (jz block8) ;; Tag checking (mov 3 eax) ;; Loading constant (and 1 eax) ;; Tag checking constant (jz block8) ;; Tag checking (mov (8 ebp) eax) ;; Loading each (mov (32 eax) ebx) ;; Loading (sar 1 ebx) ;; Detagging each (mov 3 eax) ;; Load constant (sar 1 eax) ;; Detagging constant (add ebx eax) ;; Payload add (sal 1 eax) ;; Retagging (jo block8) ;; Overflow check (add 1 eax) ;; Retagging (mov (8 ebp) ebx) ;; Storing each (mov eax (32 ebx)) ;; Storing (jmp block3) ;; Jump to the start of the loop (jmp block2)) ;; Pointless jump (dead code) (block 7) ;; Overflow block (block 8)) ;; Overflow block