# Jones-Optimal Partial Evaluation by Specialization-Safe Normalization

## Table of Contents

## 1 Introduction

This page describes the software artifact for the paper Jones Optimal Partial Evaluation Specialization-Safe Normalization by Matt Brown and Jens Palsberg (POPL'18). The full paper including the appendix containing our proofs is available here.

The paper presents a typed and Jones optimal self-applicable partial evaluator for F\(_\omega^{\mu i}\), a language first presented in our POPL'17 paper that supports typed self-applicable meta-programming. The primary purpose of this artifact is to reproduce our experimental results presented in Section 7. It also includes the following tools for doing further experiments:

- An implementation of F\(_\omega^{\mu i}\) in Haskell consisting of:
- a parser with syntactic sugar for defining abbreviations, loading definitions from other files, building representations, and beta-normalizing terms.
- a type checker
- quoters for the Tagless-final HOAS (Boehm-Berarducci), Mogensen-Scott, and Tagless-final deBruijn representations described in the paper
- evaluators that count steps of call-by-value reduction to a value, steps of normal-order reduction to a beta-normal form, and steps of memoized normal-order reduction to a beta-normal form.
- a beta-equivalence checker.

- Source code in our concrete syntax for our partial evaluator and self-interpreters.
- A suite a F\(_\omega^{\mu i}\) programs and libraries.

## 2 Step 1: Installation

### 2.1 Requirements

The tests should be run on a modern Unix-like system (e.g. Linux or Mac OS X). Our programs are memory intensive – we test on a computer with 192GB of RAM, and the Unix time utility reports a maximum resident set size of approximately 80GB. For reviewers that don't have access to a computer with at least 80GB of RAM, we provide instructions for running on Amazon EC2.

### 2.2 Running locally

If you have a computer with 80GB of RAM, you can run the artifact
locally. The artifact uses standard posix tools like `bash`

, `awk`

,
`cat`

and `diff`

. We also require that Python 3.X and the Haskell
build tool `stack`

are installed. Click here for `stack`

installation
instructions. Once these requirements are installed, download
extract the artifact sources:

- Download from http://compilers.cs.ucla.edu/popl18/popl18-artifact.zip
Extract

$ unzip popl18-artifact.zip $ cd popl18-artifact/

The following commands will download and compile all the Haskell
dependencies, and build and test the F\(_\omega^{\mu i}\)
implementation. Make sure to run these from within the
`popl18-artifact`

directory.

$ stack setup $ stack build $ stack test

To run all the tests and measurements of the artifact, use the
`runAll.sh`

script. This will take some time (about 6 hours on our
server), and produce a lot of output. `runAll.sh`

runs a series of
speedup tests to produce the data presented in Section 7 of the
paper. We recommend you pipe the output to a file for inspection
later.

./runAll.sh > ~/popl18.log

### 2.3 Running on Amazon EC2

If you don't have a computer with 80GB of RAM, you might consider running on Amazon EC2. We tested using a r4.4xlarge instance running Ubuntu 16.04. Unfortunately, the r4.4xlarge instance does not qualify for Amazon's free tier.

$ sudo apt-get update $ sudo apt-get install -y python3 unzip $ curl -sSL https://get.haskellstack.org/ | sh

Next, download and build the artifact sources, and run the unit test suite:

$ wget http://compilers.cs.ucla.edu/popl18/popl18-artifact.zip $ unzip popl18-artifact.zip $ cd popl18-artifact/ $ stack setup $ stack build $ stack test

To run all the tests and measurements of the artifact, use the
`runAll.sh`

script. `runAll.sh`

runs a series of speedup tests to
produce the data presented in Section 7 of the paper. This will take
some time (about 6 hours on our server), and produce a lot of
output. We recommend you pipe the output to a file for inspection
later.

./runAll.sh > ~/popl18.log

If you're running on EC2, you may want to run shutdown the instance
after the run completes. We supplied a script `runAllEC2.sh`

that
saves the output from `runAll.sh`

to `~/popl18.log`

and then halts
the instance. Since running everything takes a while, we recommend you
run it with `nohup`

so it will continue if your ssh connection drops:

$ nohup ./runAllEC2.sh &

## 3 Step 2: Evaluation

### 3.1 Description of concrete syntax

Our parser supports the declaration of abbreviations via the `decl`

directive. For example, the following example (taken from
`fomega/lib/Id.fw`

) declares `id`

to be the polymorphic application
function, and `Id`

to be its type.

decl Id : * = ∀A:*. A → A; decl id : Id = ΛA:*. λx:A. x;

We can load all the declarations from a file in `fomega/lib/`

with the
`load`

directive. For example, the following example (taken from
`fomega/id.fw`

) loads the above declarations from `fomega/lib/Id.fw`

and then applies `id`

to itself.

load "Id"; id Id id

Note that `fomega/id.fw`

ends with a term, while `fomega/lib/Id.fw`

does not. The files in `fomega/lib`

are *libraries* that consist of
zero or more `load`

and `decl`

statements. The files in `fomega/`

are
*programs* that consist of zero or more `load`

and `decl`

statements
followed by a term to be evaluated.

An important difference between the artifact and the paper is the
notation for representations of terms. The paper uses an overline
notation \(\overline{\cdot}\) to denote the representations of terms,
while the artifact uses square brackets: for a term `e`

, `[e]`

or
`[TF|e]`

denotes its representation in Tagless-final HOAS, `[MS|e]`

denotes its representation in Mogensen-Scott HOAS, and `[DB|e]`

denotes its representation in Tagless-final deBruijn. The artifact
does not provide syntax for constructing pre-representations of terms.

The syntax `<e>`

denotes the beta-normal form of `e`

. This is used in
a few places as a convenience.

### 3.2 The Evaluators

The artifact includes an evaluator for each of the reductions considered in the paper – call-by-value, normal order, and memoized normal order. The evaluators are instrumented to measure steps of reduction to a normal form after erasing types. The normal order and memoized normal order evaluators count steps to to β-normal form, and for call-by-value we count steps to a value (which, after type-erasure, is a λ-abstraction). We implement two evaluators for call-by-value – a simple one that performs poorly on large tests (the Fourth Futamura projection in particular), and a more complex one that is also more efficient. The latter reduces to β-normal form, but counts reductions to a value. We use the simple evaluator primarily to validate the more efficient/complex one.

### 3.3 The Test Types

This artifact includes three types of test. Each type of test is to be
run by the corresponding *test runner*. The types are:

#### 3.3.1 Speedup tests

Speedup tests compare the runtime of two terms on some
input(s). Each test declares a term `pgm`

and a Tagless-final HOAS
representation `rep`

. The expectation is that the β-normal form
of `rep`

represents a program equivalent to `pgm`

. The test runner
normalizes `rep`

and recovers the represented term. Then it
compares the runtime of that term with `pgm`

, when applied to some
arguments also specified in the test file.

Here is a typical example of a specialization test that measures
specialization speedup. The code is from
`speedupTest-ssrmix-bbint-compile-fact-1.fw`

, one of the tests we
use to experimentally demonstrate Jones-optimality of our partial
evaluator:

load "BB"; load "SSRMix"; load "BBUnquote"; load "Church"; decl pgm : Nat → Nat = fact ; decl rep : Exp (Nat → Nat) = mix (Exp (Nat → Nat)) (Nat → Nat) [unquote (Nat → Nat)] [[fact]] ; decl arg1 : Nat = one;

This test computes the speedup obtained by "compiling" the
factorial function `fact`

, when run with input `one`

(the Church
encoding of `1`

). Compilation is done by specializing the
Tagless-final HOAS (Boehm-Berarducci) interpreter `unquote`

to the
representation of the factorial function `fact`

.

We also use specialization tests to measure interpretational
overhead. Here's an example taken from
`speedupTest-bbint-overhead-fact-1.fw`

:

load "BB"; load "BBUnquote"; load "Church"; decl pgm : Nat → Nat = fact ; decl rep : Exp (Nat → Nat) = [unquote (Nat → Nat) [fact]] ; decl arg1 : Nat = one;

This test compares the runtime of `fact`

with that of ```
unquote
(Nat
```

→ `Nat) [fact]`

on input `one`

. Because `rep`

is a
representation, it is already a β-normal form, so the test
runner simply extracts the represented term and runs it.

To run this test, invoke the `runSpeedupTest`

program:

$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-bbint-overhead-fact-1.fw

The `stack exec runSpeedupTest` part tells the Haskell build tool
`stack`

to execute the Haskell program `runSpeedupTest`

. The
remaining arguments are passed on to `runSpeedupTest`

. The argument
`CBV`

specifies to count steps of call-by-value reduction to a
value. The parameter `1`

indicates that `pgm`

and the program
represented by `rep`

should be run on one argument, namely
`arg1`

. Running this produces the following output:

$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-bbint-overhead-fact-1.fw Running pgm Computing specialized pgm Running specialized pgm fomega/speedupTest-bbint-overhead-fact-1.fw,PASSED,34,463,0.07

The last line is in CSV format with columns ```
Filename,Check,pgm
Steps,rep Steps,Speedup
```

. The `Filename`

column indicates the test
file. The `Check`

column indicates whether the outputs from `pgm`

and the program represented by `rep`

are β-equivalent on the
input(s). The `pgm Steps`

column contains the number of steps `pgm`

takes on the input(s). The `rep Steps`

column contains the number
of steps the program represented by `rep`

takes on the
input(s). For `CBV`

and `SimpleCBV`

, the steps are to a value; for
`NormalOrder`

and `MemoNormalOrder`

they are steps to a
β-normal form. The last column, `Speedup`

, is \(\frac{\tt{pgm\
steps}}{\tt{rep\ steps}}\) (see the Wikipedia article on speedup
for details). A speedup greater than 1 indicates that the program
represented by `rep`

runs faster (i.e. in fewer steps) than `pgm`

on the given input(s). A speedup less than 1 means `pgm`

is
faster. In this case we are measuring the interpretational
overhead, so `rep`

represents the interpretation of `pgm`

and we
get a slowdown. In this case we are not interested in speedup but
in the overhead, which is \(\frac{\tt{rep\ steps}\ -\ \tt{pgm\
steps}}{\tt{pgm\ steps}} * 100\) or \((\tt{speedup}^{-1}-1)*100\).

An example of a failing (not β-equivalent) speedup test is
`fomega/speedupTest-not-equivalent.fw`

, which compares `plus one`

and `plus two`

on input `one`

:

load "Church"; load "BB"; decl pgm : Nat → Nat = plus one ; decl rep : Exp (Nat → Nat) = [plus two] ; decl arg1 : Nat = one ;

When we run this test, we see that `runSpeedupTest`

reports that
the β-equivalence check failed:

$ stack exec runSpeedupTest CBV 1 fomega/speedupTest-not-equivalent.fw Running pgm Computing specialized pgm Running specialized pgm fomega/speedupTest-not-equivalent.fw,FAILED,5,6,0.83

#### 3.3.2 Typecheck tests

Typecheck tests which check if a program typechecks successfully. In particular, it tests that the final term and its transitive dependencies all typecheck. A type error lurking in an unused declaration will not be detected. Examples:

$ stack exec runTypecheckTest fomega/typecheckTestSuccess.fw Succeeded. $ stack exec runTypecheckTest fomega/typecheckTestFailure.fw Failed. expected arrow type: l = x0 ty l = A0

The second example attempts to typecheck the term ```
\A:*. \x:A. x
x
```

, which fails. This demonstrates that the test runner does in
fact detect and report type errors (i.e. it doesn't simply output
"Succeeded." for all inputs). It also demonstrates that our type
errors are not user-friendly. Our results do not rely on these
error messages. We use typecheck tests only to determine whether a
program type checks.

#### 3.3.3 Equivalence tests

Equivalence tests check if the normal forms of two terms or types
are α-equivalent. The terms are given in a *Church pair*. The
runner extracts each component, normalizes them, and tests that the
normal forms are α-equivalent. These tests are run with
`runEquivTest`

.

$ stack exec runEquivTest NormalOrder fomega/equivTestSuccess.fw normalized fst in 6 steps normalized snd in 13 steps Succeeded. $ stack exec runEquivTest NormalOrder fomega/equivTestFailure.fw normalized fst in 0 steps normalized snd in 3 steps Failed. Not alpha-equivalent: λs9024:*. λz9025:*. z9025... λs9026:*. λz9027:*. s9026 z9027... Diff: x9031 ... x9028 x9031 ...

Again, the second test demonstrates that our equivalence test
runner does catch and report failures when the two terms are not
equivalent (though it doesn't provide very informative error
messages). In this case, it fails because the encoding of `0`

is
not equivalent to the encoding of `1`

. Again, our results do not
rely upon these error messages.

We can use `runEquivTest`

to demonstrate the difference between the
two call-by-value evaluators. The file
`fomega/equivTestSimpleCBVFailure.fw`

compares two values that are
β-equivalent but not β-normal. The SimpleCBV evaluator
reduces both terms to values in 0 steps and then the
α-equivalence check fails.

$ stack exec runEquivTest SimpleCBV fomega/equivTestSimpleCBVFailure.fw normalized fst in 0 steps normalized snd in 0 steps Failed. ...

The CBV evaluator reports that the terms reduce to values in 0 steps. However, it continues reducing to β-normal form, so the α-equivalence check succeeds.

normalized fst in 0 steps normalized snd in 0 steps Succeeded.

### 3.4 The Tests

We now describe the various test files included in the artifact, and relate them to the results in the paper. You can run each test individually, or run them all via the command:

$ bin/popl18-artifact.sh

#### 3.4.1 Correctness of the CBV evaluator

As mentioned above, we implement two call-by-name evaluators. One is simple and quite standard, but takes a very long time on large benchmarks like the fourth Futamura projection. The other is more efficient, but reduces to beta normal form instead of a value. However it only counts the number of call-by-value reduction steps to reach the value form, and the subsequent normalization work is not counted.

Our first test compares the counts reported by these two evaluators on our suite of first Futamura projection tests. We run the entire suite with each evaluator, and compare that the outputs are identical.

$ ./testCBVEvaluator.sh PASSED

To inspect the output manually, you can use the `show`

parameter. You
will see output of the following form:

$ ./testCBVEvaluator.sh show Output from CBV ... Output from SimpleCBV ...

#### 3.4.2 Table 2: Interpretational Overhead

Table 2 measures the average overhead of interpreting a suite of
programs with each of our interpreters. The script
`interpretationalOverhead.sh`

performs these measurements.

$ interpretationalOverhead.sh

#### 3.4.3 Table 3: Jones Optimality

Table 3 demonstrates Jones optimality of our partial evaluator based
on specialization-safe normalization. We specialize our
self-interpreters to the `cube`

, `fact`

, and `ackermann`

functions on
Church numerals, and measure speedups on a range of inputs for
each. To perform these measurements, run `JonesOptimality.sh`

with the
argument `ssr`

, specifying to use the Jones optimal partial evaluator
based on specialization-safe normalization.

$ ./JonesOptimality.sh ssr

#### 3.4.4 Table 4: Non-Optimality of Specialization by Beta-Normalization

Table 4 demonstrates that specialization by beta-normalization is not
Jones optimal. It reports speedups obtained by the same
specializations as in Table 3, but using a beta-normalizer as the
partial evaluator. To perform these measurements, run
`JonesOptimality.sh`

with the argument `beta`

.

$ ./JonesOptimality.sh beta

#### 3.4.5 Tables 5-7: Futamura Projection Speedups

Tables 5 through 7 show our speedups for the second, third, and fourth
Futamura projections (Table 3 on Jones optimality covers the first
projection). Run `Futamura.sh`

with argument `2`

, `3`

or `4`

to
perform the measurements for the second, third, or fourth projection,
respectively.

$ ./Futamura.sh 2 ... $ ./Futamura.sh 3 ... $ ./Futamura.sh 4 ...