Testing Exupery

Introduction

Test driven development has been an invaluable tool when working in a domain with plenty of theory. The theory creates problems due to the wealth of published design choices. Without prior first hand experience how do we decide which design to use?

Tests are a tool to manage the complexity of details. The complexity to work on this code has stayed constant as the system grew from a just decoding byte codes into a basic compiler, then a more complex compiler, and then integration with the interpreter. To develop software of arbitrary complexity it is necessary to keep the marginal complexity constant because we only have finite minds.

The tests are divided into customer tests and programmer tests. The programmer tests include unit testing and just enough integration testing to make changes safe. The programmer tests are quick and safe to run. They will not corrupt the image. The customer tests are end to end tests that test functionality in a way that should be meaningful for a customer but they can crash the system and can take many seconds to run.

The customer tests are split between TestExuperyCompiler and TestExuperyPlugin. These tests can take a while to run, currently about 12 seconds, and can also corrupt the image so I don't run them unless the programmer tests pass.

The programmer tests are split across several classes, each one testing a different section. They are currently undergoing a major refactoring. This strategic refactoring required a change to my programming environment which has both improved the system and also opened up a lot of smaller refactorings.

The Test Runner that comes with Squeak could run either all the tests or a single class, this might have been OK before the development system included so many tests however my standard development environment already has several hours of tests. By keeping the tests squashed together in a single class, I had a reasonable interface at the expense of a little ugliness in the code. That ugliness started to grow.

The Test Browser can run an entire category of tests at once rather than just a single class with the Test Runner. I've modified the Test Browser to suite my own personal preferences and now have an interface that is convenient.

The Customer Tests

The customer tests test things the customer would care about in a way that should make sense to a customer. The customer tests are split into two separate parts, testing the compiler in TestExuperyCompiler and testing the interpreter integration in TestExuperyPlugin. They may take a while to run and can crash the system because they sometimes test integration.

TestExuperyCompiler tests the basic compiler, each test compiles a small Smalltalk method into assembly then runs it by linking with a C stub. These were the first end to end tests.

TestExuperyPlugin tests compiling and running methods completely inside the image. It also tests the hooks into the interpreter. The interpreter hooks are part of this suite because they can crash Squeak which the programmer tests must never do.

Compiler testing

These tests are incomplete, they don't fully stress either the instruction selector or register allocator. The incompleteness is an engineering compromise, the register allocator is easier to test in isolation. The tests are successful because they make it safe to change the code, if changes become to risky then it is time to find a new testing strategy.

The tests in TestExuperyCompiler compile byte codes from CompiledMethods into assembly suitable for gas (the GNU assembler) which is linked against a C test harness to produce an executable. This executable is run and it's output checked. The executable's output is the top of the stack after the compiled method has run.

The acceptance test production mechanism isn't scalable but it hasn't yet caused actual problems. It might be better to generate the C code and to use a set of examples to check that the interpreter is doing the same thing as the compiler. This would be Cem Kaner's Oracles in "Architectures of Test Automation".

Interpreter Integration

The interpreter integration tests just cover the changes I have made. These tests test both the basic interpreter integration modifications and full end to end testing using the compiler and assembler. These are the only customer tests that test the assembler.

The interpreter integration could be considered to be programmer testing except programmer tests must be safe to run. They are not in the programmer tests because of the risk of crashing the image.

The interpreter integration tests are an example of modifying legacy code using a test driven approach. The tests check the changes and the modifications were small to minimize the risk of accidentally changing the behavior of the interpreter. Modifying untested code by just testing the changes is a neat way of driving tests into new area's. It is the changes that are going to be most understood. When working on untested code it will be necessary to make changes with out tests if only to create easily testable interfaces.

Byte Code Reading and Intermediate Generation

The first part of the compiler is generating intermediate language from byte codes. This is currently done in three parts: byte code reading, intermediate generation, and intermediate emission. Byte code reading just decodes CompiledMethods and calls an intermediate emittor. Intermediate generation creates a mid level abstraction half way between byte codes and intermediate language. Intermediate emission generates intermediate for the byte code building blocks.

The the byte code mock.

The byte code mock was written before the code that it mocks out because I wrote the byte code reader first. The mock enabled me to defer thinking about what happened to the byte codes after they were decoded.

The method testPushTemp is a standard current test for byte code reading.

testPushTemp
	|method generator|
	method := TestByteCodeCompiler compiledMethodAt: #pushTemp.
	generator := MockIntermediateGenerator new.
	generator setExpectedBytecodes: #(
		(popIntoTemp 0
			(pushConstant 1))
		(returnSelf)).
	ByteCodeReader 
		newMethod: method
		using: generator.
	self should: [generator byteCodeString = 'pushConstant 1 popIntoTemp 0 returnSelf']

The lines "method := ..." and "generator := ..." create the objects needed for the test. Now that byte code reading has been factored out into it's own class then they will be moved into a setUp method.

The statement "generator setExpectedBytecodes: ..." initializes the mock. It uses a nested array that represents the stack layout of the bytecodes. The mock keeps track of the calls and will signal a failure as soon the byte code reader fails to deliver the right value. This will bring up a debugger on the right method to edit.

The statement "ByteCodeReader ..." calls the byte code reader with the previously initialized mock.

The statement "self should: ..." tests that the byte code reader reads the right byte codes in order. This was the original check, the testing inside the mock now replaces it.

After refactoring in the MockIntermediateEmittor which checked the calls immediately, I added similar behavior to the byte code reading tests and mock. This was done for symmetry and also because the new behavior is a better test. One of the things automated tests provide is the safety needed to make changes late in the project after we have learned from doing.

As an experiment try commenting out "bitAnd: 16rF" in ByteCodeReader>>pushTemp then run the programmer tests and bring up a debugger on the test. The test fails in the "generator pushTempAt:" call precisely where the error is. Testing with mocks brings the error right to the point of failure.

ByteCodeReader>>pushTemp
	|answer|
	answer := generator pushTempAt: (self currentByteCode  bitAnd: 16rF).
	stack addLast: answer.
	^ answer

The Intermediate generation mock.

Intermediate generation started out as a single layer with a single test. testGenerateReturnTop is a good example of the first few tests. This simple strategy made it very easy to go end to end early, it let the register allocator and instruction selector start to evolve.

testGenerateReturnTop
	|method generator|
	method := TestByteCodeCompiler compiledMethodAt: #return4.
	generator := IntermediateGenerator for: MachineX86 new.
	ByteCodeReader
		newMethod: method
		using: generator.
	self should: [generator intermediateString = #((block 1 
		#(mov (mem (add (mem (add ebp 8)) 4)) t1)  	"Senders Context"
		#(mov (add (mem (add t1 12)) 1) t2) 		"New Stack Ptr"
		#(mov t2 (mem (add t1 12))) 			"Save Stack Ptr"
		#(mov 9 (mem (add (add t1 (sal 2 t2)) 28))) 	"Place 4 on callers stack"
		#(ret)))].
However both the original code and tests were both brittle and hard to understand. Take the original testGenerateReturnTop above, it's getting hard to tell what code is dealing with general tag manipulation. There are two byte code represented in the above code, push 4 and return top. What represents push 4? The 9 in "#(mov 9 (mem ..."

First the production code was refactored with "extract method" however this left the same problem in the tests. The IntermediateEmitter object was created so that the intermediate emission code could be mocked out. Creating the mock pushed this refactoring further than I'd have taken it without the tests.

testGenerateAddByteCode
	|generator code |
	generator := IntermediateGenerator for: MachineX86 new.
	code := generator
		byteCodeAddWith: (MedLiteral smallInteger: 1)
		and:	(MedLiteral smallInteger: 2).
	self should: [code intermediateForm = #(add 1 (jo block2 
	      (sal 1 (add (sar 1 3) 
                          (sar 1 5)) )) )].
	self should: [generator intermediateString = #((block 1 (jz block2 (and 1 3)) 
													 (jz block2 (and 1 5)) )
												(block 2))].
	self should: [generator  method numberOfBlocks = 2].
testGenerateMinusByteCode and testEmitSubtract shows are major win in this test refactoring, the actual subtract code is only a single IL instruction because all the work happens in emitConvertToSmallInteger and emitIntegerValueOf. The two major benefits are removing a lot of duplication and separating out different layers of abstration.
testGenerateMinusByteCode
	|generator emitter |
	emitter := MockIntermediateEmitter new.
	generator := IntermediateGenerator for: MachineX86 new.
	generator emitter: emitter.
	emitter setExpectedEmission: #(emitConvertToSmallInteger block2
					(emitSubtract
						(emitIntegerValueOf block2 5)
						(emitIntegerValueOf block2 3))).
									
	generator
		sendPrim: (MedLiteral smallInteger: 2)
		minus:	(MedLiteral smallInteger: 1).
	self should: [generator method numberOfBlocks = 2].
testEmitSubtract
	| generator emitter result machine |
	machine := MachineX86 new.
	generator := IntermediateGenerator for: machine.
	emitter := IntermediateEmitter 
		machine: machine 
		jumpTargets: #().
	generator emitter: emitter.
	result := emitter 
		emit: (MedLiteral literal: 1)
		minus: (MedLiteral literal: 2).
	self should: [result intermediateForm = #(#sub 2 1)]
The major win of this refactoring is each component is small enough to be understandable. The process is just a simple tree rewrite but I'm not smart enough to follow going from Smalltalk byte code to just under assembly language in a single step.

While the mock and the new two object intermediate generation approach is cleaner and more extendible, I'm glad I started with the simple approach. It did run out of steam but it meant that when I refactored it I had plenty of example code to draw on.

Register Allocation

The register allocator is made up of three components: Liveness Analysis, Building the Interference Graph, and the register allocator itself. The actual register allocator is the algorithm from Appel's book.

Liveness Analysis

The tests for Liveness analysis are fairly simple, each one analyses a small section of the code and tests liveness at a couple of critical points.
testBasicLivenessAnalysis
	|source analyser|
	source := MedMethod 
		createAssembly: #((block 1 (add esp t1)))
		for: MachineX86 new.
	analyser := LivenessAnalyser on: source.
	self should: 
		[analyser 
			isLive: #(esp t1)
			at: 1].
The above method is a good example, the rest are very similar but handle more complex cases. As a test it is OK, there is repetition that should be factored out in the "source :=" and "analyser :=" lines. Given the code is stable and the tests are understandable I'm going to leave it for now. If the rest of the test suite improves or I need to change the liveness analysis then I'll refactor them.

The method isLive:at: is purely an interface for testing. There are a few helper methods for testing, they may indicate problems in the code but at this time it's hard to say.

Building The Interference Graph

The tests for building the interference graph are similar to the tests for liveness analysis. They are a little more complex because there is more setup work. The interference graph generation needs the liveness information generated earlier so there is the setup code required to generate it.

TestInterferenceGraphGeneration>>testRetInterferance is a good example these tests. They do need some refactoring.

testRetInterferance
	|source |
	machine := MachineX86 new.
	source := MedMethod 
		createAssembly: #((block 1 (mov 1 t1) (mov 2 t2) (add t1 t2) (ret) ))
		for: machine.	
	allocator := ColouringRegisterAllocator new
		initialiseSource: source 
		machine: machine.
	allocator analyseLiveness.
	allocator buildInterferenceGraph.

	self should:
		[allocator
			interferes: (machine registerNamed: #t1)
			with: (machine registerNamed: #t2)].
	self should:
		[allocator
			interferes: (machine registerNamed: #t2)
			with: (machine registerNamed: #t1)].
The first section just runs the interference graph generator for a simple code sequence "#((block 1 (mov 1 t1) (mov 2 t2) (add t1 t2) (ret) ))". So that section should be refactored into something like:
testRetInterferance
       self generateInterferenceGraph: #((block 1 ...))

       self should: "check the interference graph"
When creating a suite of tests often the later tests are copy and pasted from the first one. Until it becomes clear what code is replicated and what code is always the same I'm fine with this. Test code tends to be more formulaic than production code but both should be readable. Readability is more important for test code because only the clarity makes us trust it, there are no tests of test code. TestInterferenceGraphVerification>>verifyInterferenceGraph is a consistency checking method. It is only called in testReturn42Interferance and checks the validity of the interference graph created. TestInstructionSelector>>verifySelector is a better example of a verify method. I'll probably add the verify message to all the tests when I next work on interference graph generation. Verify methods have really payed off when building shared structures.

Register Allocation

The register allocator is a neat example of a high theory component. The first allocator was very simple, a basic LRU allocator that let the general system evolve. While it was replaced, building it was worthwhile because it at least halved the time required to get real code compiling and running. It also let me observe the effect that decent register allocation makes to the quality of code generated. Because I was following Appel's simplest compiler that could possibly work where he simplifies a lot of problems by assuming that there is a decent coloring register allocator. It is possible that there are other simpler designs that don't need a coloring allocator. However I'm currently targeting the x86 which only has 6 registers which leaves very little room for wastage.

The register allocator is logically split into two stages, pushing the temporaries onto the stack then popping them all off. The pushing stage makes preliminary decisions including which registers are potential register candidates. This split between the pushing phase then the popping phase creates a problem for testing because the exact order that the registers are pushed is non deterministic. The order depends on the order Sets are iterated over which depends on object ID's which are different each time the code is run.

Stuffing values into the stack to make the popping phase deterministic is ugly but I'm currently satisfied with this solution. Directly modifying the state of the register allocator is a sign that it should be broken into multiple objects. Currently register allocation is a highly cohesive unit so this can wait. This is a judgment call but the code feels fine to work on. If it was the worst current problem with the tests then I'd fix it. It would take too much effort to be optimistically repaired and isn't the current worst problem.

The tests of selection (with names like testSelect*) show the stack being set during test set up. testSelectWithSpilling is a good example.

testSelectWithSpilling
	self createAllocator: #((block 1
				   (add t1 t2) (add t3 t4) (add t5 t6) (add t6 t7) 
				   (add t3 t5) (add t5 t1)  (jmp block1) )
				(block 2 (add t8 t5) )).

	allocator analyseLiveness.
	allocator buildInterferenceGraph.
	allocator registerStack: #(  t1 t2 t3 t4 t5 t6 t7 t8).
	(allocator registerStack at: 1) markAsSpilled.

	allocator select.

	self should: [(allocator spiltRegisters at: 1) name = #t1].
	self verifyAllocator.

Only a few of the select tests need to explicitly set the stack. (3 at the moment out of 5 and the other two are simple).

The selection tests could be improved greatly but even as they are they really helped during development. The test cases do show up my lack of understanding as I wrote this code. However writing this way did help me understand the algorithm. A key point that I only understood after writing most of the code was how the selection interacts with simplification. Simplification must mark potential spills even though we don't decide to spill until selection. This is to solve Preston Brig's diamond allocation problem.

testSelectDoesntMarkAsSpilled was introduced during debugging, without the "allocator registerStack:" this bug is non-deterministic and only occurs in 1 run in 5. Being able to create a test that replicates the bug every time really helped track it down. (I think that this bug was due to not understanding about marking nodes as potential spills). Once I noticed why it was nondeterministic I ran the test until I captured the offending stack and used that to build this test. This gave me a test that replicated the failure reliably making debugging much easier.

High theory domains have the risk of over design, the simple instruction selector is working fine while the first simple register allocator needed replacing. Sometimes a lot of theory in an area just means there is a lot that looks good on paper. Without personal experience it is hard to tell the two apart. Evolutionary development lets experience guide the first implementation by any team.

Instruction Selection

The instruction selection tests are old and have been refactored. The tests are fine so far as they go but there are a few missing details. These tests all rely on a single interface to the instruction selector and just change the instructions parsed. They are probably some of the most mature tests in the system.

TestInstructionSelection>>testInstructionMemWithAddressing is a good example of these tests. It both shows their good points and a few little problems.

testInstructionMemWithAddressing
	selector := self selectInstructions: #((block 1 (mem (add ebp 8) ) )).
	self should: [selector intermediateForm = #((block 1 (mov (8 ebp) t1) ))].
	self verifySelector.
The method name dates back to when all the customer tests were in a single class. Assigning to selector is historic, selector used to be a method variable, now the assignment is done inside selectInstructions:.

The check "self should: [..." could be factored into a checking method. Currently I don't have any idioms for local checking methods. Checking methods really only became possible once the tests were in separate classes and the key objects where instance variables. Instance variables were used before the tests were factored into separate classes but their use was limited to the most important variables to avoid too much name pollution.

TestInstructionSelector>>verifySelector verifies that the intermediate generated is correctly formed assembly. Data-structure verification methods are starting to appear in this code. It's a nice way of checking that structural invariants are always obeyed.

The verify methods originated when chasing a bug due to using the wrong class in assembly code. Assembly uses MedAssembly where intermediate uses other classes. Introducing verify methods tracked down all the occurrences and it creates a single place to add checks for well formedness. Verify methods are becoming a standard convention in this project.

Assembly Code Emission

Assembly code emission uses my standard strategy of feeding different information into a standard test interface.

It is very similar operation to machine code emission. The two areas are interesting to compare to see how styles change and also what tests were needed each time. Machine code emission is still under development and misses a few obvious tests such as checking instructions are encoded in the correct order.

The tests look for substrings in the output. This is a way of avoiding the fixed call prolog instructions. Machine code emission added a special method basicAssemble: that doesn't generate the code emission to solve the same problem.

I'll pick on a representative method to show how I'd refactor it now. The refactorings are only an improvement now that TestAssemblyGeneration is it's own class. The refactoring relies on moving the variable result into an instance variable.

testEmitIndirectAddressingWithOffset
	| result |
	result := (self emitFor: #((block 1 (mov (-16 eax) ebx))) ).
	self 
		stringShould: result
		contain:   'movl -16(%eax), %ebx'.
testIndirectAddressingWithOffset
    self emitFor: #( (block 1 (mov (-16 eax) ebx))) ).
    self checkHas: 'movl -16(%eax), %ebx'.
The first method is the original and the second is a possible improvement. The second just moves common repeated code into local helper methods. While this does increase the vocabulary needed to read a specific test, often I find myself skimming an entire category of tests. When looking at several methods the second version is definitely better because it's easier to see what is different between each test.

If I ended up reading through the assembly generation tests while programming, I'd probably just refactor the entire suite. I'm not going to now because I've just split the programmer tests into separate classes and would rather push on and write some more production code.

Machine Code Emission

The machine code emission tests are the newest area. They are coming out much cleaner than the earlier sections. This is probably because I've learned some useful testing idioms for writing compilers.

Machine Code Emission started with a test case that assembled an empty method into an empty ByteCodeArray. Then it grew by adding different examples to test each individual assembly instruction and addressing mode used in "return43".

The advantage of testing the interface is it has let the implementation change underneath. The assembler started off as a single visitor class and is slowly getting more complex. It currently includes a class to encode each instruction. The design hasn't yet stabilized.

Some people worry that test code makes it harder to evolve the program because the tests need to be changed as well as the production code. This isn't the case because both production and test code is being changed throughout development so they will both evolve to sufficiently adaptable forms. The tests make it safer to change the production code and that ability to change the code enables refactoring for more testability.

The machine code emission code is still under development. There are a lot of missing instructions including jumps which will more than a single isolated instruction to test. The current tests do not check that the instructions are assembled in order.

Interpreter Integration

The interpreter integration testing is the weakest part. Currently the tests are good for defining an individual goal. They also provide a decent regression test.

This is testing legacy code, yes I know I could ask and the people who know are very helpful but the legacy code problem is interesting and I've been making solid progress.

The tests for the interpreter and plug-in are a little large. The tests require the use of the debugger to check intermediate values. e.g. For testRegisterMachineCode, I'm going to need to use the debugger to check that the code cache is created (malloced), that the code is copied into the code cache, that the method address is returned. The test just checks that the method given can be executed after being registered.

The tests are still very useful because they break programming into small separately testable episodes. The test is the final check at the end of the episode. I'd expect that the tests will have a smaller granularity as I discover more strategies for testing the interpreter.

Normally I will use a debugger to explore when doing test driven development. In Smalltalk it is very common for me to write code in the debugger, even without a create missing method button. Writing code in the debugger is an easy way to ensure that I'm not programming beyond the test.

How the Project Grew

The project was aiming to be as simple as possible but not naive. Because this area is theory rich I followed Andrew Appel's "Modern compiler implementation in C." On individual details I would "time out" and do the quickest thing that might work knowing that it was not good enough. Going end to end fast was more important than getting any individual detail right to begin with.

Iteration 1, the basic compiler

Started with the customer test "return42". The basic method was to write a customer test that failed then based on the failure write the programmer tests needed.

The first part of the compiler was the ByteCodeReader. It started by being coded against the mock because the real code wasn't written.

The intermediate code generation started as a single method that generated the original method. It was testGenerateReturnTop transcribed into intermediate generation calls. These early tests used real CompiledMethods.

The next step was to go end to end. This compiled into assembly which was then run. This involved a basic instruction selector, register allocator, and assembly emittor module. The focus was getting everything joined up as fast as possible at the expense of the individual parts.

The end to end system was then extended to handle more byte codes until it could handle SmallInteger factorials.

At this point, intermediate generation was getting brittle and hard to understand. The abstraction jump from byte codes to low-level intermediate language was too great. It was refactored using extract method to introduce the intermediate emittor methods. The unit tests still had the same fragility so the emittor methods were moved into a separate object and the emittor mock created. This refactoring was because my brain couldn't deal with that much abstraction in a single place. Object oriented code forces us to juggle several methods at any point in time so each one must be very simple.

Iteration 2, generating reasonable code

The code being generated was too slow to be useful. Register pressure was the biggest problem. A little analysis of the code produced suggested that a factor of four improvement should be easily available by fixing the two worst current problems.

The simplistic register allocator was replaced that with a coleasing register allocator written test first. Move coalescing is essential on the x86 when using Appel's trick of faking 3 address with 2 address.

A few extensions were added to the instruction selector which still uses the maximal munch algorithm that doubled performance. The improvement is both due to denser code that uses complex addressing modes and also to reducing the register pressure by not needing registers for address generation intermediate values.

The instruction selection changes took about a day to add compared with about 3 weeks for the new register allocator.

Iterative Factorial, 21 byte-codes.

Before anything:          401 lines
With regiser allocator:   263 lines
     no move coalescing
Full register allocation: 202 lines
With addressing modes:    104 lines

[With 17 lines of overhead code]
So my initial guess of a factor of four is about right. I'm getting about 5 instructions per byte code on average. Which is not bad, given this is the x86 with few registers and 2 operand instructions.

Five instructions on average is actually very good given a add byte code in general needs 10 instructions: 4 instructions to tag check (one comparison and one branch), 2 instructions to untag, 1 to add, 3 to retag and check for overflow.

It is possible to improve the code. Changing the integer tag from 1 to 0 would save three instructions because the tag bit would not need to be cleared before each add or subtract and set after. One instruction could be saved during the checks by using the trick Tim Rowledge uses in the interpreter. This would reduce the overhead to 5 instructions. It would be sensible to get some hard data before adding any of these improvements especially changing the tagging codes.

Iteration 3, moving towards a release

This iteration is taking the compiler fully end to end which involves integrating with the interpreter and also writing an assembler. Both involve a lot of exploring external systems.

I'm trying to get most of the integration code into the plug-in. At this point it's still a little early to say but the current hooks are very small and fairly free of policy. They use the hook mechanism Ian Piumarta introduced.

The interpreter and plug-in tests cover a little too much code but then they are testing a new environment and working with code that wasn't developed with automated testing in mind.

The following three tests were written in order, this is modifying legacy code by introducing tests.

testRegisterFunction tests the basic registering functionality. This is the beginning of the interpreter modification and the plug-in. This test took about a week to get passing but most of the time was spent understanding how the interpreter works.

testRegisterMachineCode tests registering some machine code. This is the beginning of the code cache.

testAssembledReturn43 tests compiling and registering a function. This is the first completely end to end test. The assembler is being developed normally with programmer tests.

Writing this document triggered a major test refactoring. Explaining the old style was too painful so I refactored it. The first thing I find when trying to explain something in writing as a programmer is the desire to fix the problem rather than explain it. I think there is a key here for combining writing with programming in a mutually beneficial way.

The results of the major refactoring are obvious in the number of smaller problems which are now fixable. The major refactoring only took about half a day after getting a satisfactory test browser. Playing with the Test Browser took about a day and a half. It's not the time that makes this change so large but the impact on the global structure.

Where it is going

The next stage is to work on the interpreter integration and the assembler. The aim is to get something useful to somebody, i.e. something that will speed up at least one real project.

The first step is to extend the amount of the language that it handles while aiming for fast byte code performance. The current target is to handle the byte code micro benchmark while providing a useful speed up. After the byte code benchmark the next goal is a real project.

Once it handles a decent amount of the language it's time to start speeding up sends. Send performance will probably be improved by adding polymorphic inline caches then adding recompilation with type feed back.