Jones-Optimal Partial Evaluation by Specialization-Safe Normalization
Table of Contents
1 Introduction
This page describes the software artifact for the paper Jones Optimal Partial Evaluation Specialization-Safe Normalization by Matt Brown and Jens Palsberg (POPL'18). The full paper including the appendix containing our proofs is available here.
The paper presents a typed and Jones optimal self-applicable partial evaluator for F\(_\omega^{\mu i}\), a language first presented in our POPL'17 paper that supports typed self-applicable meta-programming. The primary purpose of this artifact is to reproduce our experimental results presented in Section 7. It also includes the following tools for doing further experiments:
- An implementation of F\(_\omega^{\mu i}\) in Haskell consisting of:
- a parser with syntactic sugar for defining abbreviations, loading definitions from other files, building representations, and beta-normalizing terms.
- a type checker
- quoters for the Tagless-final HOAS (Boehm-Berarducci), Mogensen-Scott, and Tagless-final deBruijn representations described in the paper
- evaluators that count steps of call-by-value reduction to a value, steps of normal-order reduction to a beta-normal form, and steps of memoized normal-order reduction to a beta-normal form.
- a beta-equivalence checker.
- Source code in our concrete syntax for our partial evaluator and self-interpreters.
- A suite a F\(_\omega^{\mu i}\) programs and libraries.
2 Step 1: Installation
2.1 Requirements
The tests should be run on a modern Unix-like system (e.g. Linux or Mac OS X). Our programs are memory intensive – we test on a computer with 192GB of RAM, and the Unix time utility reports a maximum resident set size of approximately 80GB. For reviewers that don't have access to a computer with at least 80GB of RAM, we provide instructions for running on Amazon EC2.
2.2 Running locally
If you have a computer with 80GB of RAM, you can run the artifact
locally. The artifact uses standard posix tools like bash
, awk
,
cat
and diff
. We also require that Python 3.X and the Haskell
build tool stack
are installed. Click here for stack
installation
instructions. Once these requirements are installed, download
extract the artifact sources:
- Download from http://compilers.cs.ucla.edu/popl18/popl18-artifact.zip
Extract
$ unzip popl18-artifact.zip $ cd popl18-artifact/
The following commands will download and compile all the Haskell
dependencies, and build and test the F\(_\omega^{\mu i}\)
implementation. Make sure to run these from within the
popl18-artifact
directory.
$ stack setup $ stack build $ stack test
To run all the tests and measurements of the artifact, use the
runAll.sh
script. This will take some time (about 6 hours on our
server), and produce a lot of output. runAll.sh
runs a series of
speedup tests to produce the data presented in Section 7 of the
paper. We recommend you pipe the output to a file for inspection
later.
./runAll.sh > ~/popl18.log
2.3 Running on Amazon EC2
If you don't have a computer with 80GB of RAM, you might consider running on Amazon EC2. We tested using a r4.4xlarge instance running Ubuntu 16.04. Unfortunately, the r4.4xlarge instance does not qualify for Amazon's free tier.
$ sudo apt-get update $ sudo apt-get install -y python3 unzip $ curl -sSL https://get.haskellstack.org/ | sh
Next, download and build the artifact sources, and run the unit test suite:
$ wget http://compilers.cs.ucla.edu/popl18/popl18-artifact.zip $ unzip popl18-artifact.zip $ cd popl18-artifact/ $ stack setup $ stack build $ stack test
To run all the tests and measurements of the artifact, use the
runAll.sh
script. runAll.sh
runs a series of speedup tests to
produce the data presented in Section 7 of the paper. This will take
some time (about 6 hours on our server), and produce a lot of
output. We recommend you pipe the output to a file for inspection
later.
./runAll.sh > ~/popl18.log
If you're running on EC2, you may want to run shutdown the instance
after the run completes. We supplied a script runAllEC2.sh
that
saves the output from runAll.sh
to ~/popl18.log
and then halts
the instance. Since running everything takes a while, we recommend you
run it with nohup
so it will continue if your ssh connection drops:
$ nohup ./runAllEC2.sh &
3 Step 2: Evaluation
3.1 Description of concrete syntax
Our parser supports the declaration of abbreviations via the decl
directive. For example, the following example (taken from
fomega/lib/Id.fw
) declares id
to be the polymorphic application
function, and Id
to be its type.
decl Id : * = ∀A:*. A → A; decl id : Id = ΛA:*. λx:A. x;
We can load all the declarations from a file in fomega/lib/
with the
load
directive. For example, the following example (taken from
fomega/id.fw
) loads the above declarations from fomega/lib/Id.fw
and then applies id
to itself.
load "Id"; id Id id
Note that fomega/id.fw
ends with a term, while fomega/lib/Id.fw
does not. The files in fomega/lib
are libraries that consist of
zero or more load
and decl
statements. The files in fomega/
are
programs that consist of zero or more load
and decl
statements
followed by a term to be evaluated.
An important difference between the artifact and the paper is the
notation for representations of terms. The paper uses an overline
notation \(\overline{\cdot}\) to denote the representations of terms,
while the artifact uses square brackets: for a term e
, [e]
or
[TF|e]
denotes its representation in Tagless-final HOAS, [MS|e]
denotes its representation in Mogensen-Scott HOAS, and [DB|e]
denotes its representation in Tagless-final deBruijn. The artifact
does not provide syntax for constructing pre-representations of terms.
The syntax <e>
denotes the beta-normal form of e
. This is used in
a few places as a convenience.
3.2 The Evaluators
The artifact includes an evaluator for each of the reductions considered in the paper – call-by-value, normal order, and memoized normal order. The evaluators are instrumented to measure steps of reduction to a normal form after erasing types. The normal order and memoized normal order evaluators count steps to to β-normal form, and for call-by-value we count steps to a value (which, after type-erasure, is a λ-abstraction). We implement two evaluators for call-by-value – a simple one that performs poorly on large tests (the Fourth Futamura projection in particular), and a more complex one that is also more efficient. The latter reduces to β-normal form, but counts reductions to a value. We use the simple evaluator primarily to validate the more efficient/complex one.
3.3 The Test Types
This artifact includes three types of test. Each type of test is to be run by the corresponding test runner. The types are:
3.3.1 Speedup tests
Speedup tests compare the runtime of two terms on some
input(s). Each test declares a term pgm
and a Tagless-final HOAS
representation rep
. The expectation is that the β-normal form
of rep
represents a program equivalent to pgm
. The test runner
normalizes rep
and recovers the represented term. Then it
compares the runtime of that term with pgm
, when applied to some
arguments also specified in the test file.
Here is a typical example of a specialization test that measures
specialization speedup. The code is from
speedupTest-ssrmix-bbint-compile-fact-1.fw
, one of the tests we
use to experimentally demonstrate Jones-optimality of our partial
evaluator:
load "BB"; load "SSRMix"; load "BBUnquote"; load "Church"; decl pgm : Nat → Nat = fact ; decl rep : Exp (Nat → Nat) = mix (Exp (Nat → Nat)) (Nat → Nat) [unquote (Nat → Nat)] [[fact]] ; decl arg1 : Nat = one;
This test computes the speedup obtained by "compiling" the
factorial function fact
, when run with input one
(the Church
encoding of 1
). Compilation is done by specializing the
Tagless-final HOAS (Boehm-Berarducci) interpreter unquote
to the
representation of the factorial function fact
.
We also use specialization tests to measure interpretational
overhead. Here's an example taken from
speedupTest-bbint-overhead-fact-1.fw
:
load "BB"; load "BBUnquote"; load "Church"; decl pgm : Nat → Nat = fact ; decl rep : Exp (Nat → Nat) = [unquote (Nat → Nat) [fact]] ; decl arg1 : Nat = one;
This test compares the runtime of fact
with that of unquote
(Nat
→ Nat) [fact]
on input one
. Because rep
is a
representation, it is already a β-normal form, so the test
runner simply extracts the represented term and runs it.
To run this test, invoke the runSpeedupTest
program:
$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-bbint-overhead-fact-1.fw
The `stack exec runSpeedupTest` part tells the Haskell build tool
stack
to execute the Haskell program runSpeedupTest
. The
remaining arguments are passed on to runSpeedupTest
. The argument
CBV
specifies to count steps of call-by-value reduction to a
value. The parameter 1
indicates that pgm
and the program
represented by rep
should be run on one argument, namely
arg1
. Running this produces the following output:
$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-bbint-overhead-fact-1.fw Running pgm Computing specialized pgm Running specialized pgm fomega/speedupTest-bbint-overhead-fact-1.fw,PASSED,34,463,0.07
The last line is in CSV format with columns Filename,Check,pgm
Steps,rep Steps,Speedup
. The Filename
column indicates the test
file. The Check
column indicates whether the outputs from pgm
and the program represented by rep
are β-equivalent on the
input(s). The pgm Steps
column contains the number of steps pgm
takes on the input(s). The rep Steps
column contains the number
of steps the program represented by rep
takes on the
input(s). For CBV
and SimpleCBV
, the steps are to a value; for
NormalOrder
and MemoNormalOrder
they are steps to a
β-normal form. The last column, Speedup
, is \(\frac{\tt{pgm\
steps}}{\tt{rep\ steps}}\) (see the Wikipedia article on speedup
for details). A speedup greater than 1 indicates that the program
represented by rep
runs faster (i.e. in fewer steps) than pgm
on the given input(s). A speedup less than 1 means pgm
is
faster. In this case we are measuring the interpretational
overhead, so rep
represents the interpretation of pgm
and we
get a slowdown. In this case we are not interested in speedup but
in the overhead, which is \(\frac{\tt{rep\ steps}\ -\ \tt{pgm\
steps}}{\tt{pgm\ steps}} * 100\) or \((\tt{speedup}^{-1}-1)*100\).
An example of a failing (not β-equivalent) speedup test is
fomega/speedupTest-not-equivalent.fw
, which compares plus one
and plus two
on input one
:
load "Church"; load "BB"; decl pgm : Nat → Nat = plus one ; decl rep : Exp (Nat → Nat) = [plus two] ; decl arg1 : Nat = one ;
When we run this test, we see that runSpeedupTest
reports that
the β-equivalence check failed:
$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-not-equivalent.fw Running pgm Computing specialized pgm Running specialized pgm fomega/speedupTest-not-equivalent.fw,FAILED,5,6,0.83
3.3.2 Typecheck tests
Typecheck tests which check if a program typechecks successfully. In particular, it tests that the final term and its transitive dependencies all typecheck. A type error lurking in an unused declaration will not be detected. Examples:
$ stack exec runTypecheckTest fomega/typecheckTestSuccess.fw Succeeded. $ stack exec runTypecheckTest fomega/typecheckTestFailure.fw Failed. expected arrow type: l = x0 ty l = A0
The second example attempts to typecheck the term \A:*. \x:A. x
x
, which fails. This demonstrates that the test runner does in
fact detect and report type errors (i.e. it doesn't simply output
"Succeeded." for all inputs). It also demonstrates that our type
errors are not user-friendly. Our results do not rely on these
error messages. We use typecheck tests only to determine whether a
program type checks.
3.3.3 Equivalence tests
Equivalence tests check if the normal forms of two terms or types
are α-equivalent. The terms are given in a Church pair. The
runner extracts each component, normalizes them, and tests that the
normal forms are α-equivalent. These tests are run with
runEquivTest
.
$ stack exec runEquivTest NormalOrder fomega/equivTestSuccess.fw normalized fst in 6 steps normalized snd in 13 steps Succeeded. $ stack exec runEquivTest NormalOrder fomega/equivTestFailure.fw normalized fst in 0 steps normalized snd in 3 steps Failed. Not alpha-equivalent: λs9024:*. λz9025:*. z9025... λs9026:*. λz9027:*. s9026 z9027... Diff: x9031 ... x9028 x9031 ...
Again, the second test demonstrates that our equivalence test
runner does catch and report failures when the two terms are not
equivalent (though it doesn't provide very informative error
messages). In this case, it fails because the encoding of 0
is
not equivalent to the encoding of 1
. Again, our results do not
rely upon these error messages.
We can use runEquivTest
to demonstrate the difference between the
two call-by-value evaluators. The file
fomega/equivTestSimpleCBVFailure.fw
compares two values that are
β-equivalent but not β-normal. The SimpleCBV evaluator
reduces both terms to values in 0 steps and then the
α-equivalence check fails.
$ stack exec runEquivTest SimpleCBV fomega/equivTestSimpleCBVFailure.fw normalized fst in 0 steps normalized snd in 0 steps Failed. ...
The CBV evaluator reports that the terms reduce to values in 0 steps. However, it continues reducing to β-normal form, so the α-equivalence check succeeds.
normalized fst in 0 steps normalized snd in 0 steps Succeeded.
3.4 The Tests
We now describe the various test files included in the artifact, and relate them to the results in the paper. You can run each test individually, or run them all via the command:
$ bin/popl18-artifact.sh
3.4.1 Correctness of the CBV evaluator
As mentioned above, we implement two call-by-name evaluators. One is simple and quite standard, but takes a very long time on large benchmarks like the fourth Futamura projection. The other is more efficient, but reduces to beta normal form instead of a value. However it only counts the number of call-by-value reduction steps to reach the value form, and the subsequent normalization work is not counted.
Our first test compares the counts reported by these two evaluators on our suite of first Futamura projection tests. We run the entire suite with each evaluator, and compare that the outputs are identical.
$ ./testCBVEvaluator.sh PASSED
To inspect the output manually, you can use the show
parameter. You
will see output of the following form:
$ ./testCBVEvaluator.sh show Output from CBV ... Output from SimpleCBV ...
3.4.2 Table 2: Interpretational Overhead
Table 2 measures the average overhead of interpreting a suite of
programs with each of our interpreters. The script
interpretationalOverhead.sh
performs these measurements.
$ interpretationalOverhead.sh
3.4.3 Table 3: Jones Optimality
Table 3 demonstrates Jones optimality of our partial evaluator based
on specialization-safe normalization. We specialize our
self-interpreters to the cube
, fact
, and ackermann
functions on
Church numerals, and measure speedups on a range of inputs for
each. To perform these measurements, run JonesOptimality.sh
with the
argument ssr
, specifying to use the Jones optimal partial evaluator
based on specialization-safe normalization.
$ ./JonesOptimality.sh ssr
3.4.4 Table 4: Non-Optimality of Specialization by Beta-Normalization
Table 4 demonstrates that specialization by beta-normalization is not
Jones optimal. It reports speedups obtained by the same
specializations as in Table 3, but using a beta-normalizer as the
partial evaluator. To perform these measurements, run
JonesOptimality.sh
with the argument beta
.
$ ./JonesOptimality.sh beta
3.4.5 Tables 5-7: Futamura Projection Speedups
Tables 5 through 7 show our speedups for the second, third, and fourth
Futamura projections (Table 3 on Jones optimality covers the first
projection). Run Futamura.sh
with argument 2
, 3
or 4
to
perform the measurements for the second, third, or fourth projection,
respectively.
$ ./Futamura.sh 2 ... $ ./Futamura.sh 3 ... $ ./Futamura.sh 4 ...