Tests are a tool to manage the complexity of details. The complexity to work on this code has stayed constant as the system grew from a just decoding byte codes into a basic compiler, then a more complex compiler, and then integration with the interpreter. To develop software of arbitrary complexity it is necessary to keep the marginal complexity constant because we only have finite minds.
The tests are divided into customer tests and programmer tests. The programmer tests include unit testing and just enough integration testing to make changes safe. The programmer tests are quick and safe to run. They will not corrupt the image. The customer tests are end to end tests that test functionality in a way that should be meaningful for a customer but they can crash the system and can take many seconds to run.
The customer tests are split between TestExuperyCompiler and TestExuperyPlugin. These tests can take a while to run, currently about 12 seconds, and can also corrupt the image so I don't run them unless the programmer tests pass.
The programmer tests are split across several classes, each one testing a different section. They are currently undergoing a major refactoring. This strategic refactoring required a change to my programming environment which has both improved the system and also opened up a lot of smaller refactorings.
The Test Runner that comes with Squeak could run either all the tests or a single class, this might have been OK before the development system included so many tests however my standard development environment already has several hours of tests. By keeping the tests squashed together in a single class, I had a reasonable interface at the expense of a little ugliness in the code. That ugliness started to grow.
The Test Browser can run an entire category of tests at once rather than just a single class with the Test Runner. I've modified the Test Browser to suite my own personal preferences and now have an interface that is convenient.
The customer tests test things the customer would care about in a way that should make sense to a customer. The customer tests are split into two separate parts, testing the compiler in TestExuperyCompiler and testing the interpreter integration in TestExuperyPlugin. They may take a while to run and can crash the system because they sometimes test integration.
TestExuperyCompiler tests the basic compiler, each test compiles a small Smalltalk method into assembly then runs it by linking with a C stub. These were the first end to end tests.
TestExuperyPlugin tests compiling and running methods completely inside the image. It also tests the hooks into the interpreter. The interpreter hooks are part of this suite because they can crash Squeak which the programmer tests must never do.
The tests in TestExuperyCompiler compile byte codes from CompiledMethods into assembly suitable for gas (the GNU assembler) which is linked against a C test harness to produce an executable. This executable is run and it's output checked. The executable's output is the top of the stack after the compiled method has run.
The acceptance test production mechanism isn't scalable but it hasn't yet caused actual problems. It might be better to generate the C code and to use a set of examples to check that the interpreter is doing the same thing as the compiler. This would be Cem Kaner's Oracles in "Architectures of Test Automation".
The interpreter integration could be considered to be programmer testing except programmer tests must be safe to run. They are not in the programmer tests because of the risk of crashing the image.
The interpreter integration tests are an example of modifying legacy code using a test driven approach. The tests check the changes and the modifications were small to minimize the risk of accidentally changing the behavior of the interpreter. Modifying untested code by just testing the changes is a neat way of driving tests into new area's. It is the changes that are going to be most understood. When working on untested code it will be necessary to make changes with out tests if only to create easily testable interfaces.
The method testPushTemp is a standard current test for byte code reading.
testPushTemp |method generator| method := TestByteCodeCompiler compiledMethodAt: #pushTemp. generator := MockIntermediateGenerator new. generator setExpectedBytecodes: #( (popIntoTemp 0 (pushConstant 1)) (returnSelf)). ByteCodeReader newMethod: method using: generator. self should: [generator byteCodeString = 'pushConstant 1 popIntoTemp 0 returnSelf']
The lines "method := ..." and "generator := ..." create the objects needed for the test. Now that byte code reading has been factored out into it's own class then they will be moved into a setUp method.
The statement "generator setExpectedBytecodes: ..." initializes the mock. It uses a nested array that represents the stack layout of the bytecodes. The mock keeps track of the calls and will signal a failure as soon the byte code reader fails to deliver the right value. This will bring up a debugger on the right method to edit.
The statement "ByteCodeReader ..." calls the byte code reader with the previously initialized mock.
The statement "self should: ..." tests that the byte code reader reads the right byte codes in order. This was the original check, the testing inside the mock now replaces it.
After refactoring in the MockIntermediateEmittor which checked the calls immediately, I added similar behavior to the byte code reading tests and mock. This was done for symmetry and also because the new behavior is a better test. One of the things automated tests provide is the safety needed to make changes late in the project after we have learned from doing.
As an experiment try commenting out "bitAnd: 16rF" in ByteCodeReader>>pushTemp then run the programmer tests and bring up a debugger on the test. The test fails in the "generator pushTempAt:" call precisely where the error is. Testing with mocks brings the error right to the point of failure.
ByteCodeReader>>pushTemp |answer| answer := generator pushTempAt: (self currentByteCode bitAnd: 16rF). stack addLast: answer. ^ answer
testGenerateReturnTop |method generator| method := TestByteCodeCompiler compiledMethodAt: #return4. generator := IntermediateGenerator for: MachineX86 new. ByteCodeReader newMethod: method using: generator. self should: [generator intermediateString = #((block 1 #(mov (mem (add (mem (add ebp 8)) 4)) t1) "Senders Context" #(mov (add (mem (add t1 12)) 1) t2) "New Stack Ptr" #(mov t2 (mem (add t1 12))) "Save Stack Ptr" #(mov 9 (mem (add (add t1 (sal 2 t2)) 28))) "Place 4 on callers stack" #(ret)))].However both the original code and tests were both brittle and hard to understand. Take the original testGenerateReturnTop above, it's getting hard to tell what code is dealing with general tag manipulation. There are two byte code represented in the above code, push 4 and return top. What represents push 4? The 9 in "#(mov 9 (mem ..."
First the production code was refactored with "extract method" however this left the same problem in the tests. The IntermediateEmitter object was created so that the intermediate emission code could be mocked out. Creating the mock pushed this refactoring further than I'd have taken it without the tests.
testGenerateAddByteCode
|generator code |
generator := IntermediateGenerator for: MachineX86 new.
code := generator
byteCodeAddWith: (MedLiteral smallInteger: 1)
and: (MedLiteral smallInteger: 2).
self should: [code intermediateForm = #(add 1 (jo block2
(sal 1 (add (sar 1 3)
(sar 1 5)) )) )].
self should: [generator intermediateString = #((block 1 (jz block2 (and 1 3))
(jz block2 (and 1 5)) )
(block 2))].
self should: [generator method numberOfBlocks = 2].
testGenerateMinusByteCode and testEmitSubtract shows are major win in
this test refactoring, the actual subtract code is only a single IL
instruction because all the work happens in emitConvertToSmallInteger
and emitIntegerValueOf. The two major benefits are removing a lot of
duplication and separating out different layers of abstration.
testGenerateMinusByteCode |generator emitter | emitter := MockIntermediateEmitter new. generator := IntermediateGenerator for: MachineX86 new. generator emitter: emitter. emitter setExpectedEmission: #(emitConvertToSmallInteger block2 (emitSubtract (emitIntegerValueOf block2 5) (emitIntegerValueOf block2 3))). generator sendPrim: (MedLiteral smallInteger: 2) minus: (MedLiteral smallInteger: 1). self should: [generator method numberOfBlocks = 2].
testEmitSubtract | generator emitter result machine | machine := MachineX86 new. generator := IntermediateGenerator for: machine. emitter := IntermediateEmitter machine: machine jumpTargets: #(). generator emitter: emitter. result := emitter emit: (MedLiteral literal: 1) minus: (MedLiteral literal: 2). self should: [result intermediateForm = #(#sub 2 1)]The major win of this refactoring is each component is small enough to be understandable. The process is just a simple tree rewrite but I'm not smart enough to follow going from Smalltalk byte code to just under assembly language in a single step.
While the mock and the new two object intermediate generation approach is cleaner and more extendible, I'm glad I started with the simple approach. It did run out of steam but it meant that when I refactored it I had plenty of example code to draw on.
testBasicLivenessAnalysis |source analyser| source := MedMethod createAssembly: #((block 1 (add esp t1))) for: MachineX86 new. analyser := LivenessAnalyser on: source. self should: [analyser isLive: #(esp t1) at: 1].The above method is a good example, the rest are very similar but handle more complex cases. As a test it is OK, there is repetition that should be factored out in the "source :=" and "analyser :=" lines. Given the code is stable and the tests are understandable I'm going to leave it for now. If the rest of the test suite improves or I need to change the liveness analysis then I'll refactor them.
The method isLive:at: is purely an interface for testing. There are a few helper methods for testing, they may indicate problems in the code but at this time it's hard to say.
TestInterferenceGraphGeneration>>testRetInterferance is a good example these tests. They do need some refactoring.
testRetInterferance |source | machine := MachineX86 new. source := MedMethod createAssembly: #((block 1 (mov 1 t1) (mov 2 t2) (add t1 t2) (ret) )) for: machine. allocator := ColouringRegisterAllocator new initialiseSource: source machine: machine. allocator analyseLiveness. allocator buildInterferenceGraph. self should: [allocator interferes: (machine registerNamed: #t1) with: (machine registerNamed: #t2)]. self should: [allocator interferes: (machine registerNamed: #t2) with: (machine registerNamed: #t1)].The first section just runs the interference graph generator for a simple code sequence "#((block 1 (mov 1 t1) (mov 2 t2) (add t1 t2) (ret) ))". So that section should be refactored into something like:
testRetInterferance
self generateInterferenceGraph: #((block 1 ...))
self should: "check the interference graph"
When creating a suite of tests often the later tests are copy and
pasted from the first one. Until it becomes clear what code is
replicated and what code is always the same I'm fine with this. Test
code tends to be more formulaic than production code but both should
be readable. Readability is more important for test code because only
the clarity makes us trust it, there are no tests of test code.
TestInterferenceGraphVerification>>verifyInterferenceGraph is a
consistency checking method. It is only called in
testReturn42Interferance and checks the validity of the interference
graph created. TestInstructionSelector>>verifySelector is a better
example of a verify method. I'll probably add the verify message to
all the tests when I next work on interference graph generation. Verify
methods have really payed off when building shared structures.
The register allocator is logically split into two stages, pushing the temporaries onto the stack then popping them all off. The pushing stage makes preliminary decisions including which registers are potential register candidates. This split between the pushing phase then the popping phase creates a problem for testing because the exact order that the registers are pushed is non deterministic. The order depends on the order Sets are iterated over which depends on object ID's which are different each time the code is run.
Stuffing values into the stack to make the popping phase deterministic is ugly but I'm currently satisfied with this solution. Directly modifying the state of the register allocator is a sign that it should be broken into multiple objects. Currently register allocation is a highly cohesive unit so this can wait. This is a judgment call but the code feels fine to work on. If it was the worst current problem with the tests then I'd fix it. It would take too much effort to be optimistically repaired and isn't the current worst problem.
The tests of selection (with names like testSelect*) show the stack being set during test set up. testSelectWithSpilling is a good example.
testSelectWithSpilling self createAllocator: #((block 1 (add t1 t2) (add t3 t4) (add t5 t6) (add t6 t7) (add t3 t5) (add t5 t1) (jmp block1) ) (block 2 (add t8 t5) )). allocator analyseLiveness. allocator buildInterferenceGraph. allocator registerStack: #( t1 t2 t3 t4 t5 t6 t7 t8). (allocator registerStack at: 1) markAsSpilled. allocator select. self should: [(allocator spiltRegisters at: 1) name = #t1]. self verifyAllocator.
Only a few of the select tests need to explicitly set the stack. (3 at the moment out of 5 and the other two are simple).
The selection tests could be improved greatly but even as they are they really helped during development. The test cases do show up my lack of understanding as I wrote this code. However writing this way did help me understand the algorithm. A key point that I only understood after writing most of the code was how the selection interacts with simplification. Simplification must mark potential spills even though we don't decide to spill until selection. This is to solve Preston Brig's diamond allocation problem.
testSelectDoesntMarkAsSpilled was introduced during debugging, without the "allocator registerStack:" this bug is non-deterministic and only occurs in 1 run in 5. Being able to create a test that replicates the bug every time really helped track it down. (I think that this bug was due to not understanding about marking nodes as potential spills). Once I noticed why it was nondeterministic I ran the test until I captured the offending stack and used that to build this test. This gave me a test that replicated the failure reliably making debugging much easier.
High theory domains have the risk of over design, the simple instruction selector is working fine while the first simple register allocator needed replacing. Sometimes a lot of theory in an area just means there is a lot that looks good on paper. Without personal experience it is hard to tell the two apart. Evolutionary development lets experience guide the first implementation by any team.
TestInstructionSelection>>testInstructionMemWithAddressing is a good example of these tests. It both shows their good points and a few little problems.
testInstructionMemWithAddressing selector := self selectInstructions: #((block 1 (mem (add ebp 8) ) )). self should: [selector intermediateForm = #((block 1 (mov (8 ebp) t1) ))]. self verifySelector.The method name dates back to when all the customer tests were in a single class. Assigning to selector is historic, selector used to be a method variable, now the assignment is done inside selectInstructions:.
The check "self should: [..." could be factored into a checking method. Currently I don't have any idioms for local checking methods. Checking methods really only became possible once the tests were in separate classes and the key objects where instance variables. Instance variables were used before the tests were factored into separate classes but their use was limited to the most important variables to avoid too much name pollution.
TestInstructionSelector>>verifySelector verifies that the intermediate generated is correctly formed assembly. Data-structure verification methods are starting to appear in this code. It's a nice way of checking that structural invariants are always obeyed.
The verify methods originated when chasing a bug due to using the wrong class in assembly code. Assembly uses MedAssembly where intermediate uses other classes. Introducing verify methods tracked down all the occurrences and it creates a single place to add checks for well formedness. Verify methods are becoming a standard convention in this project.
It is very similar operation to machine code emission. The two areas are interesting to compare to see how styles change and also what tests were needed each time. Machine code emission is still under development and misses a few obvious tests such as checking instructions are encoded in the correct order.
The tests look for substrings in the output. This is a way of avoiding the fixed call prolog instructions. Machine code emission added a special method basicAssemble: that doesn't generate the code emission to solve the same problem.
I'll pick on a representative method to show how I'd refactor it now. The refactorings are only an improvement now that TestAssemblyGeneration is it's own class. The refactoring relies on moving the variable result into an instance variable.
testEmitIndirectAddressingWithOffset | result | result := (self emitFor: #((block 1 (mov (-16 eax) ebx))) ). self stringShould: result contain: 'movl -16(%eax), %ebx'.
testIndirectAddressingWithOffset
self emitFor: #( (block 1 (mov (-16 eax) ebx))) ).
self checkHas: 'movl -16(%eax), %ebx'.
The first method is the original and the second is a possible
improvement. The second just moves common repeated code into local
helper methods. While this does increase the vocabulary needed to read
a specific test, often I find myself skimming an entire category of
tests. When looking at several methods the second version is
definitely better because it's easier to see what is different between
each test.
If I ended up reading through the assembly generation tests while programming, I'd probably just refactor the entire suite. I'm not going to now because I've just split the programmer tests into separate classes and would rather push on and write some more production code.
Machine Code Emission started with a test case that assembled an empty method into an empty ByteCodeArray. Then it grew by adding different examples to test each individual assembly instruction and addressing mode used in "return43".
The advantage of testing the interface is it has let the implementation change underneath. The assembler started off as a single visitor class and is slowly getting more complex. It currently includes a class to encode each instruction. The design hasn't yet stabilized.
Some people worry that test code makes it harder to evolve the program because the tests need to be changed as well as the production code. This isn't the case because both production and test code is being changed throughout development so they will both evolve to sufficiently adaptable forms. The tests make it safer to change the production code and that ability to change the code enables refactoring for more testability.
The machine code emission code is still under development. There are a lot of missing instructions including jumps which will more than a single isolated instruction to test. The current tests do not check that the instructions are assembled in order.
This is testing legacy code, yes I know I could ask and the people who know are very helpful but the legacy code problem is interesting and I've been making solid progress.
The tests for the interpreter and plug-in are a little large. The tests require the use of the debugger to check intermediate values. e.g. For testRegisterMachineCode, I'm going to need to use the debugger to check that the code cache is created (malloced), that the code is copied into the code cache, that the method address is returned. The test just checks that the method given can be executed after being registered.
The tests are still very useful because they break programming into small separately testable episodes. The test is the final check at the end of the episode. I'd expect that the tests will have a smaller granularity as I discover more strategies for testing the interpreter.
Normally I will use a debugger to explore when doing test driven development. In Smalltalk it is very common for me to write code in the debugger, even without a create missing method button. Writing code in the debugger is an easy way to ensure that I'm not programming beyond the test.
The first part of the compiler was the ByteCodeReader. It started by being coded against the mock because the real code wasn't written.
The intermediate code generation started as a single method that generated the original method. It was testGenerateReturnTop transcribed into intermediate generation calls. These early tests used real CompiledMethods.
The next step was to go end to end. This compiled into assembly which was then run. This involved a basic instruction selector, register allocator, and assembly emittor module. The focus was getting everything joined up as fast as possible at the expense of the individual parts.
The end to end system was then extended to handle more byte codes until it could handle SmallInteger factorials.
At this point, intermediate generation was getting brittle and hard to understand. The abstraction jump from byte codes to low-level intermediate language was too great. It was refactored using extract method to introduce the intermediate emittor methods. The unit tests still had the same fragility so the emittor methods were moved into a separate object and the emittor mock created. This refactoring was because my brain couldn't deal with that much abstraction in a single place. Object oriented code forces us to juggle several methods at any point in time so each one must be very simple.
The simplistic register allocator was replaced that with a coleasing register allocator written test first. Move coalescing is essential on the x86 when using Appel's trick of faking 3 address with 2 address.
A few extensions were added to the instruction selector which still uses the maximal munch algorithm that doubled performance. The improvement is both due to denser code that uses complex addressing modes and also to reducing the register pressure by not needing registers for address generation intermediate values.
The instruction selection changes took about a day to add compared with about 3 weeks for the new register allocator.
Iterative Factorial, 21 byte-codes.
Before anything: 401 lines
With regiser allocator: 263 lines
no move coalescing
Full register allocation: 202 lines
With addressing modes: 104 lines
[With 17 lines of overhead code]
So my initial guess of a factor of four is about right. I'm getting
about 5 instructions per byte code on average. Which is not bad, given
this is the x86 with few registers and 2 operand instructions.
Five instructions on average is actually very good given a add byte code in general needs 10 instructions: 4 instructions to tag check (one comparison and one branch), 2 instructions to untag, 1 to add, 3 to retag and check for overflow.
It is possible to improve the code. Changing the integer tag from 1 to 0 would save three instructions because the tag bit would not need to be cleared before each add or subtract and set after. One instruction could be saved during the checks by using the trick Tim Rowledge uses in the interpreter. This would reduce the overhead to 5 instructions. It would be sensible to get some hard data before adding any of these improvements especially changing the tagging codes.
I'm trying to get most of the integration code into the plug-in. At this point it's still a little early to say but the current hooks are very small and fairly free of policy. They use the hook mechanism Ian Piumarta introduced.
The interpreter and plug-in tests cover a little too much code but then they are testing a new environment and working with code that wasn't developed with automated testing in mind.
The following three tests were written in order, this is modifying legacy code by introducing tests.
testRegisterFunction tests the basic registering functionality. This is the beginning of the interpreter modification and the plug-in. This test took about a week to get passing but most of the time was spent understanding how the interpreter works.
testRegisterMachineCode tests registering some machine code. This is the beginning of the code cache.
testAssembledReturn43 tests compiling and registering a function. This is the first completely end to end test. The assembler is being developed normally with programmer tests.
Writing this document triggered a major test refactoring. Explaining the old style was too painful so I refactored it. The first thing I find when trying to explain something in writing as a programmer is the desire to fix the problem rather than explain it. I think there is a key here for combining writing with programming in a mutually beneficial way.
The results of the major refactoring are obvious in the number of smaller problems which are now fixable. The major refactoring only took about half a day after getting a satisfactory test browser. Playing with the Test Browser took about a day and a half. It's not the time that makes this change so large but the impact on the global structure.
The first step is to extend the amount of the language that it handles while aiming for fast byte code performance. The current target is to handle the byte code micro benchmark while providing a useful speed up. After the byte code benchmark the next goal is a real project.
Once it handles a decent amount of the language it's time to start speeding up sends. Send performance will probably be improved by adding polymorphic inline caches then adding recompilation with type feed back.