Jones-Optimal Partial Evaluation by Specialization-Safe Normalization

1. Introduction
2. Step 1: Installation
3. Step 2: Evaluation

1 Introduction

This page describes the software artifact for the paper Jones Optimal Partial Evaluation Specialization-Safe Normalization by Matt Brown and Jens Palsberg (POPL'18). The full paper including the appendix containing our proofs is available here.

The paper presents a typed and Jones optimal self-applicable partial evaluator for F\(_\omega^{\mu i}\), a language first presented in our POPL'17 paper that supports typed self-applicable meta-programming. The primary purpose of this artifact is to reproduce our experimental results presented in Section 7. It also includes the following tools for doing further experiments:

An implementation of F\(_\omega^{\mu i}\) in Haskell consisting of:
- a parser with syntactic sugar for defining abbreviations, loading definitions from other files, building representations, and beta-normalizing terms.
- a type checker
- quoters for the Tagless-final HOAS (Boehm-Berarducci), Mogensen-Scott, and Tagless-final deBruijn representations described in the paper
- evaluators that count steps of call-by-value reduction to a value, steps of normal-order reduction to a beta-normal form, and steps of memoized normal-order reduction to a beta-normal form.
- a beta-equivalence checker.
Source code in our concrete syntax for our partial evaluator and self-interpreters.
A suite a F\(_\omega^{\mu i}\) programs and libraries.

2 Step 1: Installation

2.1 Requirements

The tests should be run on a modern Unix-like system (e.g. Linux or Mac OS X). Our programs are memory intensive – we test on a computer with 192GB of RAM, and the Unix time utility reports a maximum resident set size of approximately 80GB. For reviewers that don't have access to a computer with at least 80GB of RAM, we provide instructions for running on Amazon EC2.

2.2 Running locally

If you have a computer with 80GB of RAM, you can run the artifact locally. The artifact uses standard posix tools like bash, awk, cat and diff. We also require that Python 3.X and the Haskell build tool stack are installed. Click here for stack installation instructions. Once these requirements are installed, download extract the artifact sources:

Download from http://compilers.cs.ucla.edu/popl18/popl18-artifact.zip

Extract

$ unzip popl18-artifact.zip
$ cd popl18-artifact/

The following commands will download and compile all the Haskell dependencies, and build and test the F\(_\omega^{\mu i}\) implementation. Make sure to run these from within the popl18-artifact directory.

$ stack setup
$ stack build
$ stack test

To run all the tests and measurements of the artifact, use the runAll.sh script. This will take some time (about 6 hours on our server), and produce a lot of output. runAll.sh runs a series of speedup tests to produce the data presented in Section 7 of the paper. We recommend you pipe the output to a file for inspection later.

./runAll.sh > ~/popl18.log

2.3 Running on Amazon EC2

If you don't have a computer with 80GB of RAM, you might consider running on Amazon EC2. We tested using a r4.4xlarge instance running Ubuntu 16.04. Unfortunately, the r4.4xlarge instance does not qualify for Amazon's free tier.

$ sudo apt-get update
$ sudo apt-get install -y python3 unzip
$ curl -sSL https://get.haskellstack.org/ | sh

Next, download and build the artifact sources, and run the unit test suite:

$ wget http://compilers.cs.ucla.edu/popl18/popl18-artifact.zip
$ unzip popl18-artifact.zip
$ cd popl18-artifact/
$ stack setup
$ stack build
$ stack test

To run all the tests and measurements of the artifact, use the runAll.sh script. runAll.sh runs a series of speedup tests to produce the data presented in Section 7 of the paper. This will take some time (about 6 hours on our server), and produce a lot of output. We recommend you pipe the output to a file for inspection later.

./runAll.sh > ~/popl18.log

If you're running on EC2, you may want to run shutdown the instance after the run completes. We supplied a script runAllEC2.sh that saves the output from runAll.sh to ~/popl18.log and then halts the instance. Since running everything takes a while, we recommend you run it with nohup so it will continue if your ssh connection drops:

$ nohup ./runAllEC2.sh &

3 Step 2: Evaluation

3.1 Description of concrete syntax

Our parser supports the declaration of abbreviations via the decl directive. For example, the following example (taken from fomega/lib/Id.fw) declares id to be the polymorphic application function, and Id to be its type.

decl Id : *  = ∀A:*. A → A;
decl id : Id = ΛA:*. λx:A. x;

We can load all the declarations from a file in fomega/lib/ with the load directive. For example, the following example (taken from fomega/id.fw) loads the above declarations from fomega/lib/Id.fw and then applies id to itself.

load "Id";
id Id id

Note that fomega/id.fw ends with a term, while fomega/lib/Id.fw does not. The files in fomega/lib are libraries that consist of zero or more load and decl statements. The files in fomega/ are programs that consist of zero or more load and decl statements followed by a term to be evaluated.

An important difference between the artifact and the paper is the notation for representations of terms. The paper uses an overline notation \(\overline{\cdot}\) to denote the representations of terms, while the artifact uses square brackets: for a term e, [e] or [TF|e] denotes its representation in Tagless-final HOAS, [MS|e] denotes its representation in Mogensen-Scott HOAS, and [DB|e] denotes its representation in Tagless-final deBruijn. The artifact does not provide syntax for constructing pre-representations of terms.

The syntax <e> denotes the beta-normal form of e. This is used in a few places as a convenience.

3.2 The Evaluators

The artifact includes an evaluator for each of the reductions considered in the paper – call-by-value, normal order, and memoized normal order. The evaluators are instrumented to measure steps of reduction to a normal form after erasing types. The normal order and memoized normal order evaluators count steps to to β-normal form, and for call-by-value we count steps to a value (which, after type-erasure, is a λ-abstraction). We implement two evaluators for call-by-value – a simple one that performs poorly on large tests (the Fourth Futamura projection in particular), and a more complex one that is also more efficient. The latter reduces to β-normal form, but counts reductions to a value. We use the simple evaluator primarily to validate the more efficient/complex one.

3.3 The Test Types

This artifact includes three types of test. Each type of test is to be run by the corresponding test runner. The types are:

3.3.1 Speedup tests

Speedup tests compare the runtime of two terms on some input(s). Each test declares a term pgm and a Tagless-final HOAS representation rep. The expectation is that the β-normal form of rep represents a program equivalent to pgm. The test runner normalizes rep and recovers the represented term. Then it compares the runtime of that term with pgm, when applied to some arguments also specified in the test file.

Here is a typical example of a specialization test that measures specialization speedup. The code is from speedupTest-ssrmix-bbint-compile-fact-1.fw, one of the tests we use to experimentally demonstrate Jones-optimality of our partial evaluator:

load "BB";
load "SSRMix";
load "BBUnquote";
load "Church";

decl pgm : Nat → Nat =
  fact
;

decl rep : Exp (Nat → Nat) =
  mix (Exp (Nat → Nat)) (Nat → Nat)
    [unquote (Nat → Nat)]
    [[fact]]
;

decl arg1 : Nat = one;

This test computes the speedup obtained by "compiling" the factorial function fact, when run with input one (the Church encoding of 1). Compilation is done by specializing the Tagless-final HOAS (Boehm-Berarducci) interpreter unquote to the representation of the factorial function fact.

We also use specialization tests to measure interpretational overhead. Here's an example taken from speedupTest-bbint-overhead-fact-1.fw:

load "BB";
load "BBUnquote";
load "Church";

decl pgm : Nat → Nat =
  fact
;

decl rep : Exp (Nat → Nat) =
  [unquote (Nat → Nat) [fact]]
;

decl arg1 : Nat = one;

This test compares the runtime of fact with that of unquote (Nat → Nat) [fact] on input one. Because rep is a representation, it is already a β-normal form, so the test runner simply extracts the represented term and runs it.

To run this test, invoke the runSpeedupTest program:

$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-bbint-overhead-fact-1.fw

The `stack exec runSpeedupTest` part tells the Haskell build tool stack to execute the Haskell program runSpeedupTest. The remaining arguments are passed on to runSpeedupTest. The argument CBV specifies to count steps of call-by-value reduction to a value. The parameter 1 indicates that pgm and the program represented by rep should be run on one argument, namely arg1. Running this produces the following output:

$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-bbint-overhead-fact-1.fw
Running pgm
Computing specialized pgm
Running specialized pgm
fomega/speedupTest-bbint-overhead-fact-1.fw,PASSED,34,463,0.07

The last line is in CSV format with columns Filename,Check,pgm Steps,rep Steps,Speedup. The Filename column indicates the test file. The Check column indicates whether the outputs from pgm and the program represented by rep are β-equivalent on the input(s). The pgm Steps column contains the number of steps pgm takes on the input(s). The rep Steps column contains the number of steps the program represented by rep takes on the input(s). For CBV and SimpleCBV, the steps are to a value; for NormalOrder and MemoNormalOrder they are steps to a β-normal form. The last column, Speedup, is \(\frac{\tt{pgm\ steps}}{\tt{rep\ steps}}\) (see the Wikipedia article on speedup for details). A speedup greater than 1 indicates that the program represented by rep runs faster (i.e. in fewer steps) than pgm on the given input(s). A speedup less than 1 means pgm is faster. In this case we are measuring the interpretational overhead, so rep represents the interpretation of pgm and we get a slowdown. In this case we are not interested in speedup but in the overhead, which is \(\frac{\tt{rep\ steps}\ -\ \tt{pgm\ steps}}{\tt{pgm\ steps}} * 100\) or \((\tt{speedup}^{-1}-1)*100\).

An example of a failing (not β-equivalent) speedup test is fomega/speedupTest-not-equivalent.fw, which compares plus one and plus two on input one:

load "Church";
load "BB";

decl pgm : Nat → Nat =
  plus one
;

decl rep : Exp (Nat → Nat) =
  [plus two]
;

decl arg1 : Nat =
  one
;

When we run this test, we see that runSpeedupTest reports that the β-equivalence check failed:

$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-not-equivalent.fw
Running pgm
Computing specialized pgm
Running specialized pgm
fomega/speedupTest-not-equivalent.fw,FAILED,5,6,0.83

3.3.2 Typecheck tests

Typecheck tests which check if a program typechecks successfully. In particular, it tests that the final term and its transitive dependencies all typecheck. A type error lurking in an unused declaration will not be detected. Examples:

$ stack exec runTypecheckTest fomega/typecheckTestSuccess.fw
Succeeded.
$ stack exec runTypecheckTest fomega/typecheckTestFailure.fw 
Failed.
expected arrow type: 
l = x0
ty l = A0

The second example attempts to typecheck the term \A:*. \x:A. x x, which fails. This demonstrates that the test runner does in fact detect and report type errors (i.e. it doesn't simply output "Succeeded." for all inputs). It also demonstrates that our type errors are not user-friendly. Our results do not rely on these error messages. We use typecheck tests only to determine whether a program type checks.

3.3.3 Equivalence tests

Equivalence tests check if the normal forms of two terms or types are α-equivalent. The terms are given in a Church pair. The runner extracts each component, normalizes them, and tests that the normal forms are α-equivalent. These tests are run with runEquivTest.

$ stack exec runEquivTest NormalOrder fomega/equivTestSuccess.fw
normalized fst in 6 steps
normalized snd in 13 steps
Succeeded.
$ stack exec runEquivTest NormalOrder fomega/equivTestFailure.fw
normalized fst in 0 steps
normalized snd in 3 steps
Failed.
Not alpha-equivalent: 
   λs9024:*. λz9025:*. z9025...
   λs9026:*. λz9027:*. s9026 z9027...
Diff:
   x9031
...
   x9028 x9031
...

Again, the second test demonstrates that our equivalence test runner does catch and report failures when the two terms are not equivalent (though it doesn't provide very informative error messages). In this case, it fails because the encoding of 0 is not equivalent to the encoding of 1. Again, our results do not rely upon these error messages.

We can use runEquivTest to demonstrate the difference between the two call-by-value evaluators. The file fomega/equivTestSimpleCBVFailure.fw compares two values that are β-equivalent but not β-normal. The SimpleCBV evaluator reduces both terms to values in 0 steps and then the α-equivalence check fails.

$ stack exec runEquivTest SimpleCBV fomega/equivTestSimpleCBVFailure.fw
normalized fst in 0 steps
normalized snd in 0 steps
Failed.
...

The CBV evaluator reports that the terms reduce to values in 0 steps. However, it continues reducing to β-normal form, so the α-equivalence check succeeds.

normalized fst in 0 steps
normalized snd in 0 steps
Succeeded.

3.4 The Tests

We now describe the various test files included in the artifact, and relate them to the results in the paper. You can run each test individually, or run them all via the command:

$ bin/popl18-artifact.sh

3.4.1 Correctness of the CBV evaluator

As mentioned above, we implement two call-by-name evaluators. One is simple and quite standard, but takes a very long time on large benchmarks like the fourth Futamura projection. The other is more efficient, but reduces to beta normal form instead of a value. However it only counts the number of call-by-value reduction steps to reach the value form, and the subsequent normalization work is not counted.

Our first test compares the counts reported by these two evaluators on our suite of first Futamura projection tests. We run the entire suite with each evaluator, and compare that the outputs are identical.

$ ./testCBVEvaluator.sh
PASSED

To inspect the output manually, you can use the show parameter. You will see output of the following form:

$ ./testCBVEvaluator.sh show
Output from CBV
...
Output from SimpleCBV
...

3.4.2 Table 2: Interpretational Overhead

Table 2 measures the average overhead of interpreting a suite of programs with each of our interpreters. The script interpretationalOverhead.sh performs these measurements.

$ interpretationalOverhead.sh

3.4.3 Table 3: Jones Optimality

Table 3 demonstrates Jones optimality of our partial evaluator based on specialization-safe normalization. We specialize our self-interpreters to the cube, fact, and ackermann functions on Church numerals, and measure speedups on a range of inputs for each. To perform these measurements, run JonesOptimality.sh with the argument ssr, specifying to use the Jones optimal partial evaluator based on specialization-safe normalization.

$ ./JonesOptimality.sh ssr

3.4.4 Table 4: Non-Optimality of Specialization by Beta-Normalization

Table 4 demonstrates that specialization by beta-normalization is not Jones optimal. It reports speedups obtained by the same specializations as in Table 3, but using a beta-normalizer as the partial evaluator. To perform these measurements, run JonesOptimality.sh with the argument beta.

$ ./JonesOptimality.sh beta

3.4.5 Tables 5-7: Futamura Projection Speedups

Tables 5 through 7 show our speedups for the second, third, and fourth Futamura projections (Table 3 on Jones optimality covers the first projection). Run Futamura.sh with argument 2, 3 or 4 to perform the measurements for the second, third, or fourth projection, respectively.

$ ./Futamura.sh 2
...
$ ./Futamura.sh 3
...
$ ./Futamura.sh 4
...