Exupery's Performance

An analysis of Exupery's performance just before the 0.02 release. This is the first release that can compile the bytecode benchmark. For basic bytecode loops Exupery is currently 5 times faster than the interpreter.

interpreter/compiler arithmetic bytecodes sends
milliseconds bytecodes/second sends/second
Exupery with atCache 58 266 389 177 1 222 688
Exupery without atCache 66 14 466 546 1 211 120
Normal Interpreter 320 96 385 542 3 173 064
Debugging Interpreter 320 29 197 080 1 225 389

Two benchmarks are used in this analysis a very simple arithmetic benchmark and the bytecode benchmark. The send benchmark has been included but hasn't been compiled.

The performance numbers show the value of the atCache. Even with the cache the bytecode benchmark is still much slower than the arithmetic benchmark. My implementation of the atCache has not been tuned so this should be easy to improve.

The Arithmetic Benchmark

After looking at the bytecode benchmark results I developed another benchmark to explore how efficient the basic compiler is. This benchmark is a very simple arithmetic loop. What I'm interested in is just basic bytecode execution times. This avoids operations that the interpreter optimises that Exupery currently doesn't.

Source Code

arithmeticLoop
	| acc |
	acc := 0.
	1
		to: 2000000
		do: [:each| acc := acc + 1].
	^ 5

Analysis

The benchmark contains 19 bytecodes and about 75 machine code instructions. Looking at the code it should be very easy to improve. It's bad but it's still 5 times faster than the interpreter.

Removable instructions(75 total instructions)
20 Temporary loads and stores
11 Detagging constants
9 Boolean conversions

So I should be able to double the performance with very simple techniques. First moving arguments and temporaries into registers then using the colouring register allocator. Second very simple constant propagation to avoid detagging constant integers. Third removing the conversions into and out of Smalltalk Booleans. All these optimisations need nothing more complex than tree traversals.

To easily remove conversions into and out of Smalltalk Booleans will require another intermediate language. By breaking bytecodes into a middle level intermediate that shows conversions we can cancel out convert then deconvert sequences. Optimising across bytecode boundaries is one major advantage a compiler has over an interpreter.

Very crudely, I use 6 potential instructions per bytecode (based on being 5 times faster than the interpreter which uses 10 cycles with 3 instructions per cycle). I execute 4 instructions per bytecode. So I'm issuing 2 instructions per cycle on average, the Athlon is supposed to manage 2.5 instructions per cycle. I'm doing well on this measure, especially given how bad the generated code is.

Assembly Code

The assembly format is a cross between AT&T syntax and s-expressions. It is a written form that is used to assist debugging and testing.
((block 1 
  (mov (8 ebp) eax)	;; Storing zero in acc
  (mov 1 (28 eax))	;; Storing zero
  (mov (8 ebp) eax)	;; Storing 1 in each
  (mov 3 (32 eax))	;; Storing 1 in each
  (jmp block3))		;; Removable jump

(block 2 
  (mov (8 ebp) eax) 
  (mov (4 eax) ebx)	;; ebx is the senders context
  (mov 2 eax)		;; Increment the context's stack pointer
  (add (12 ebx) eax)	;; Incrementing
  (mov eax (12 ebx))	;; Incrementing
  (sal 1 eax)		;; Calculating the address for the answer on stack
  (add ebx eax)		;; Address calculation
  (mov 11 (22 eax))	;; Returning 5 as a hack (returnReceiver not 
				implemented)
  (mov 1 eax)		;; 1 indicates return from method (not a send)
  (ret)) 

(block 3		;; Start of Loop
  (mov (8 ebp) eax)	;; Loading each
  (mov (32 eax) eax)	;; Loading
  (and 1 eax)		;; tag checking each
  (jz block4)		;; tag checking
  (mov 4000001 eax)	;; Type checking constant
  (and 1 eax)		;; Type checking constant
  (jz block4)		;; Type checking constant
  (mov 4000001 ebx)	;; Detag constant
  (sar 1 ebx)		;; Detagging
  (mov (8 ebp) eax)	;; Loading each
  (mov (32 eax) eax)	;; Loading each
  (sar 1 eax)		;; Detagging each
  (sub ebx eax)		;; The payload comparison (should be cmp)
  (jg block5)		;; comparison jump
  (mov trueObj eax)	;; Conversion to a Boolean object
  (mov (eax) eax)	;; Conversion
  (jmp block6))		;; Conversion

(block 4)		;; Overflow block

(block 5 
  (mov falseObj eax)	;; Conversion
  (mov (eax) eax)	;; Conversion
  (jmp block6))		;; Conversion

(block 6 
  (mov falseObj ebx)    ;; Conversion From Boolean
  (sub (ebx) eax)       ;; Conversion back
  (je block2)		;; Conversion back
  (mov (8 ebp) eax)	;; Loading each
  (mov (28 eax) eax)	;; Loading
  (and 1 eax)		;; Tag checking acc
  (jz block7)		;; Tag checking
  (mov 3 eax)		;; Load Constant
  (and 1 eax)		;; Tag checking constant
  (jz block7)		;; Tag checking
  (mov (8 ebp) eax)	;; Loading acc
  (mov (28 eax) ebx)	;; Loading
  (sar 1 ebx)		;; Detagging acc
  (mov 3 eax)		;; Load constant
  (sar 1 eax)		;; Detagging constant
  (add ebx eax)		;; Payload add
  (sal 1 eax)		;; Retag (part)
  (jo block7)		;; Check for overflow
  (add 1 eax)		;; Rest of reload
  (mov (8 ebp) ebx)	;; Storing acc
  (mov eax (28 ebx))	;; Storing
  (mov (8 ebp) eax)	;; Loading each
  (mov (32 eax) eax)	;; Loading
  (and 1 eax)		;; Tag checking each
  (jz block8)		;; Tag checking
  (mov 3 eax)		;; Loading constant
  (and 1 eax)		;; Tag checking constant
  (jz block8)		;; Tag checking
  (mov (8 ebp) eax)	;; Loading each
  (mov (32 eax) ebx)	;; Loading
  (sar 1 ebx)		;; Detagging each
  (mov 3 eax)		;; Load constant
  (sar 1 eax)		;; Detagging constant
  (add ebx eax)		;; Payload add
  (sal 1 eax)		;; Retagging
  (jo block8)		;; Overflow check
  (add 1 eax)		;; Retagging
  (mov (8 ebp) ebx)	;; Storing each
  (mov eax (32 ebx))	;; Storing
  (jmp block3)		;; Jump to the start of the loop
  (jmp block2))		;; Pointless jump (dead code)

(block 7)		;; Overflow block

(block 8))		;; Overflow block