Exupery, A Project Proposal

Introduction

What makes dynamic compilers interesting? Besides being the most promising way to efficiently implement object oriented languages, dynamic compilation provides an interesting bridge between hardware and systems programming. The technology is old but hasn't been widely used even though almost every field has interesting examples.

Ideally, it should be possible to use a dynamic compilation framework for a wide range of problems including database query execution, network packet filtering, as well as language implementation. This project proposes to build such a framework starting with a compiler for Squeak.

Such a framework would help with experiments in language design, hardware design, and systems programming.

The current release of Exupery is available from SqueakMap. There is some documentation available at http://www.kampjes.demon.co.uk/.

My personal project

To build a dynamic compiler for Squeak that is both faster and simpler than the [Holzle94] compiler. I believe that this is possible by removing a key constraint in the Holzle design. Specifically that the compiler may be invoked just before executing the method.

By removing the requirement that the compiler must be ready to compile any method at any point in time, just before method execution, allows the compiler to be written in the source language. This also removes the design constraints both for fast compile times and for avoiding deadlocks when the compiler might be needed to compile parts of itself.

A fast compiler might not be the right design choice in an adaptive optimizing system. Only the hot spot code needs to be compiled, and hopefully the hot spot will be called a lot in the future amortizing the time required to produce good code. The initial execution could be done with an instrumented interpreter rather than a quick compiler. This provides faster "initial compilation" than a simple compiler and also an easier migration path from language design using an interpreter to an efficient system with dynamic compilation. It would be worthwhile to do empirical experiments to explore the trade-offs between background compilation and inline compilation.

A fast compiler could be added later. Even with a fast compiler, there is still reason for a slow background compiler. Compiling for good speed is slow, interactive multimedia needs both a 10ms response time [Future97] and fast code. We don't want compile pauses but we do want the speed. Final compiles can be done in the background.

It is possible to argue that a background compiler is a better design for an interactive systems because it will not introduce delays and allows the use of slower algorithms [Lampson83]. A high quality compiler will require some n^2 worst case algorithms, either classical optimization or conversion to SSA.

Only a background compiler might be needed because either the interpreter can be instrumented to gather type information or all the code can be cached. There are about 3 million bytecodes in the system so assuming 20 bytes of machine code per bytecode it would take about 60Mb which is tolerable today.

There are plenty of CPU cycles to burn in modern desktop machines even if we do need all of them at peak times. If the compiler gets extended for efficient floating point optimization then vectorisation may be necessary for very high performance on commodity hardware.

(see the Intel NetBurst architecture ).

Hardware

How do we partition systems between hardware and software? For instance where should scheduling be done? In out of order hardware or in software with a RAM cache?

Do we need binary compatibility? If the standard system software includes a dynamic compiler then software could provide binary compatibility. Backwards compatibility is less important for games machines, specialist hardware such as network routers, and high performance machines (supercomputers). [Dynamo99] suggests that assembly to assembly optimization is profitable so software translation to specially designed hardware should be efficient.

I'm suggesting the key elements to budget in CPU design are heat, power, and off-chip bandwidth. A crude design would be a VLIW core with some compression that is easily decoded when loading into the instruction cache. If possible having a family of different CPUs each using the same general architecture but having a different mix of pipelines and thus a different instruction set. Different applications will require a different mix of pipelines.

Hand held devices for computer gaming or 3D environments will require both high performance and low power. Supercomputer designs can probably spend their heat and power budgets on better things than backwards compatibility. Especially if compatibility can be provided in software.

This sketch of hardware design is to demonstrate that there is plenty of opportunities for fruitful discussion between dynamic compilation and CPU design.

The Croquet system would be nice on a very high performance tablet computer.

Databases

The database query engine looks remarkably like an interpreter. Databases are now CPU bound, especially high performance OLAP databases. The first database optimization is to buy RAM and disks until the database becomes CPU bound. (Oracle tuning guides suggest that all high performance databases should be CPU bound, I don't have the exact reference).

Building a dynamic compiler for the database query engine should produce similar speed improvements to programming language implementation.

Dynamic compilation gets more appealing as pipelines grow and branches become relatively more expensive.

[GMUW00] suggests the use of iterators to implement the interfaces between relational operators. Just expanding the iterators inline will remove expensive indirect jumps and conditional branches. It is very likely that it costs much more to figure out what needs to be done than to do the actual work.

Until recently the cost of disk access has dominated database performance. This outlook is still reflected in the literature. While it may still be true for changes which need to be flushed to permanent storage, it isn't true for most querys.

Network Packet Filtering

[LeoneLee95] use run time code generation to optimize ML. One of their examples is the BSD packet filtering interpreter.

Packet filtering would probably suite a slower more optimizing compiler because the filters don't change very often especially in a network router compared with the amount of load going through them. A firewall's rule set is probably changed only a few times a year.

A prototype system could generate Linux kernel modules as a user space process then load them into the kernel. This is an easier compiler to write than the current dynamic compiler for Squeak.

Regular Expressions

Going forward by going back to the original ideas. I think that compiling regular expressions and LR grammars were part of the original ideas. Making compilation similar to (hard) GUI development would be cool. Smalltalk could be faster than C by compiling the inner loops to custom machine code.

There are plenty of uses for fast regular expressions and fast parsers. HTML processing is an obvious example. Network protocol analysis often looks like a parsing problem, e.g. parsing SMTP or HTTP.

Software Engineering (Methodology)

The two biggest risks in compiler writing are interacting details and too much theory.

Dynamic compilation is a good environment to explore approaches to detail management. Many of the hard problems in writing a compiler are just keeping track of many interacting details. Locating a bug often involves bouncing between all the stages of the compiler, functionality is at 90 degrees to the subsystem boundaries.

The problem of too much theory is a problem of having too many solutions for any given problem. Creating a compiler architecture is a problem of too much wealth, too many design ideas to choose from. This is the second system effect produced by reading rather than experience. Having so much theory creates problems picking out useful subsets. [Oaklisp91] observes that many compilers are over-engineered and with carefully chosen optimizations a compiler can be both simpler and faster.

One of the project goals is to construct a useful framework for dynamic compilation. The approach is to first build a dynamic compiler for Squeak then later build more compilers. The incremental approach to building frameworks is well described in [FooteOpdyke].

The incremental approach is also used for individual subsystems. The assembler provides a good example of design by incremental refinement. The question becomes is this design sufficient? Rather than which design is right? The problem is organizing many details into an understandable system. The assembler has gone through three phases so far, first a simple visitor, then each instruction got an assembling object, currently the instructions contain a list of encoding rules.

A design family for assemblers would start with a single visitor object then build through multiple stages to a special purpose programming language. My hypothesis is such families provide a useful reusable set of designs. Deciding which design is right up front is much harder than deciding to introduce incrementally each design embellishment.

Keeping complexity linear with program size is the wrong goal ([EWD1308]). A better goal is to keep the marginal complexity constant. Each small piece of development needs to be able to be understood by a human mind with a fixed understanding capacity.

The trick to keeping marginal complexity constant, the complexity to make any individual change, is to let the system be viewed from multiple directions. Why? A given change needs the right view to simplify it and that view is not necessarily the same which is needed for a different change.

To build arbitrarily complex systems means building systems which can not be fully understood, where understanding is limited to specific views. Each problem in the overall network may be understood, and each edge in the problem network. System building becomes the problem of decomposing the system into a network of problems then reorganizing the network as we learn.

References

[Holzle94] Urs Holzle. Adaptive Optimization for Self: Reconciling High Performance with Exploratory Programming.

[Appel99] Andrew W. Appel. Modern compiler implementation in C. Cambridge University Press 1999.

[LeoneLee95] Peter Lee and Mark Leone. Optimizing ML with Run-Time Code Generation. (PLDI'96)

[EWD1308] Edsgar Dijkstra. EWD1308 - "What lead to Notes on Structured Programming", available on the web .

[Lampson83] Butler W. Lampson. Hints for Computer System Design. Web address

[Hoare73] C.A.R. Hoare. Hints on programming language design. STAN-CS-73-403.

[Steele77] Guy Lewis Steele Jr. Fast Arithmetic in Maclisp. MIT AI Memo 421. September 1977.

[Piumarta] Ian Piumarta. J3 for Squeak. Web address.

[Dynamo99] Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia. Transparent Dynamic Optimization Dynamic assembly. HP Laboratories Cambridge HPL-1999-77 June, 1999

[GMUW00] Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom. Database System Implementation. Prentice Hall 2000.

[Future97] Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and Alan Kay. Back to the Future: The Story of Squeak, A Practical Smalltalk Written in Itself. Presented at OOPSLA '97. Web address.

[Ungar86] David Michael Ungar. The Design and Evaluation of a High Performance Smalltalk System. MIT Press 1986.

[FooteOpdyke] Brian Foote and William F. Opdyke. Life Cycle and Refactoring Patterns that Support Evolution and Reuse.

[Oaklisp91] Barak A. Pearlmutter and Kevin J. Lang. The Implementation of Oaklisp. In Topics in Advanced Language Implementation edited by Peter Lee 1991.

[Ungar92] Language Independent runtimes '92 abstract. Web address.