In order to show that puzzles can be effective and useful to register allocation, we have implemented a register allocator based on puzzles. We have implemented our register allocator in the LLVM compiler framework, version 1.9. Our tests are executed on a 32-bit x86 Intel(R) Xeon(TM), with a 3.06GHz cpu clock, 3GB of free memory and 512KB L1 cache running Red Hat Linux 3.3.3-7. Currently our allocator passes all the tests that LLVM passes using its default allocator. In this page we show results of our experiments. We have been able to compile and run over 1.3 million lines of code, and the results are encouraging. The next sequence of tables use the following key:
The LLVM test suite provides a large number of benchmark applications. In addition to these benchmarks, we have also compiled SPEC CPU 2000. Unfortunately, one of the benchmarks: eon, cannot be run in our programming environment - even when compiled with gcc. The other floating point benchmarks, e.g wupwise, swim, mgrid, applu, etc, cannot be compiled because they are FORTRAN applications, and our compiler only handles C/C++ code. Nevertheless, we could compile all the integer benchmarks, and all the four floating-point applications implemented in C. This is the data that we have used in our paper. Due to space constraints, we could not include all the other benchmarks in the LLVM test suite in the paper, so we show it in this page. In general, the results for SPEC 2000 mirror the general behavior of the other benchmarks. The table below contains links to all the raw data that we have been gathering while implementing and debugging the puzzle solver. The rows are chronologically ordered, so they give a good idea of how our implementation has evolved.
We compare our puzzle solver with three other register allocators, all implemented in LLVM 1.9 and all compiling and running the same benchmark suite of 1.3 million lines of C code. The first is LLVM's default algorithm, which is an industrial-strength version of linear scan. The algorithm does aggressive coalescing before register allocation and handles holes in live ranges by filling them with other variables whenever possible. We use ELS (Extended Linear Scan) to denote this register allocator.
The second register allocator is the iterated register coalescing of George and Appel with the extensions by Smith, Ramsey, and Holloway presented in PLDI 2004 for handling register aliasing. We use EIRC (Extended Iterated Register Coalescing) to denote this register allocator. The third register allocator is based on partitioned Boolean quadratic programming (PBQP). The algorithm runs in worst-case exponential time and does optimal spilling with respect to a set of Boolean constraints generated from the program text. We use this algorithm to gauge the potential for how good a register allocator can be. Lang Hames and Bernhard Scholz produced the implementations of EIRC and PBQP that we are using.
We have used SPEC CPU 2000 in most of our experiments. And some of the results we have found are summarized in the next charts. Notice that we do not present data for ten of the floating point programs. These benchmarks could not be compiled by LLVM in our machine. This is not a problem with the puzzle solver, because these floating point benchmarks could not be compiled even with the default allocator of LLVM. In fact, LLVM with its default allocator, or with the puzzle solver, pass the same amount of tests. I think the problem is support for sse floating point instructions in LLVM. Yet, the set of spec benchmarks that we use are the same that has been used in the paper "Demand-Driven Alias Analysis for C", by Zheng and Rugina.
This chart compares the maximum amount of space that each assembly program reserves on its call stack. The stack size gives an estimate of how many different variables are being spilled by each allocator. The puzzle solver and extended linear scan (LLVM's default) tend to spill more variables than the other two algorithms.
This chart shows the amount of spill code produced by each allocator when compiling SPEC2000. The puzzle solver inserts marginally fewer memory-access instructions than PBQP, 1.2 fewer memory-access instructions than EIRC, and 9.6% fewer memory-access instructions than extended linear scan (LLVM's default). Note that although the puzzle solver spills more variables than the other allocators, it removes only part of the live range of a spilled variable. The bars are normalized to the puzzle solver.
The chart below compares the register allocation time and the total compilation time of the puzzle solver and extended linear scan (LLVM's default). On average, extended linear scan (LLVM's default) is less than 1 faster than the puzzle solver. The total compilation time of LLVM with the default allocator is less than 3% faster than the total compilation time of LLVM with the puzzle solver. We note that LLVM is industrial-strength and highly tuned software, in contrast to our puzzle solver.
The next chart compares the ratio between the size of the assembly code produced by the compiler and the compilation time of the register assignment pass. We only compare two algorithm: the puzzle solver, and extended linear scan, the default algorithm of LLVM. The final column in the chart gives the standard deviation found for each algorithm. We see that the puzzle solver has a much more predictable behavior than LLVM's default, that is, the time spent by the puzzle solver in the register allocation process is much more proportional to the size of the assembly code produced than the time spent by ELS. The last column of the chart gives the standard deviation for each algorithm. The standard deviation for the puzzle solver is 66467.11244, whereas the standard deviation of LLVM's default is 386426.5572. In other words, the deviance of ELS is 5.83 times bigger than that of the puzzle solver.
This chart gives the run time of each benchmark compiled with different algorithms. The run time is normalized to gcc -O3, and it is the average of five runs of each benchmark. Considering all the benchmarks, the four allocators produce faster code than gcc; the fractions are: puzzle solver 0.944, extended linear scan (LLVM's default) 0.991, EIRC 0.954 and PBQP 0.929. If we remove the floating point benchmarks, i.e., mesa, amp, art, equake, then gcc -O3 is faster. The fractions are: puzzle Solver 1.015, extended linear scan (LLVM's default) 1.059, EIRC 1.025 and PBQP 1.008.
The chart below reports on the number of move and swap instructions necessary to implement parallel copies, fixing the coloring between different puzzles. As the chart shows, a very small number of moves is necessary between consecutive puzzles. This is mostly due to our allocation order, that starts at the root of the dominator tree, and goes down toward the leaves. Move moves are necessary to implement phi and pi functions than internal parallel copies, although there exist many more of these parallel copies than phi and pi functions.