========================================================================
                        JIT Generation in PyPy
========================================================================

.. contents::
.. sectnum::


------------------------------------------------------------------------
                           Usage and Status
------------------------------------------------------------------------

Status
======

A foreword of warning about the JIT of PyPy as of March 2007: single
functions doing integer arithmetic get great speed-ups; about anything
else will be a bit slower with the JIT than without.  We are working
on this - you can even expect quick progress, because it is mostly a
matter of adding a few careful hints in the source code of the Python
interpreter of PyPy.

By construction, the JIT is supposed to work correctly on absolutely any
kind of Python code: generators, nested scopes, ``exec`` statements,
``sys._getframe().f_back.f_back.f_locals``, etc. (the latter is an
example of expression that Psyco_ cannot emulate correctly).  However,
there are a couple of known issues for now (see Caveats_).


In more details
---------------

So far there is little point in trying the JIT on anything else than
integer arithmetic-intensive functions (unless you want to help find bugs).
For small examples, you can also look at the machine code it produces, but
if you do please keep in mind that the assembler will look fundamentally
different after we extend the range of PyPy that the JIT generator
processes.

* The produced machine code is kind of good, in the sense that the
  backends perform some reasonable register allocation.  The 386 backend
  computes the lifetime of values within blocks.  The PPC backend does
  not, but the fact that the processor has plenty of registers mitigates
  this problem to some extent.  The PPC backend has at least one known
  bug left.  An LLVM_ backend is started but blocked half-way on a hard
  problem that might not get solved any time soon.

* The *timeshifter*, which produces the JIT frontend, is able to handle
  rather incredibly tricky situations successfully.

* The remaining work is to continue the necessary adjustments of the
  PyPy interpreter source code so that the timeshifter can process more
  of it.  At the moment, the timeshifter sees the interpreter's main
  dispatch loop, integer arithmetic, and a bit of function call logic.
  This means that the produced JIT can remove the bytecode
  interpretation overhead and do a good job with integer arithmetic, but
  cannot optimize at all the manipulations of any other type of objects.

.. _LLVM: http://llvm.org/


How to compile a pypy-c with a JIT
==================================

Go to ``pypy/translator/goal/`` and run::

    ./translate.py --jit targetpypystandalone

Please read the Status_ section above first.

This will produce the C code for a version pypy-c that includes both a
regular interpreter and an automatically generated JIT compiler.  This
pypy-c uses its interpreter by default, and due to some overhead we
expect this interpreter to be a bit slower than the one found in a
pypy-c compiled without JIT.

In addition to ``--jit``, you can also pass the normal options to
``translate.py`` to compile different flavors of PyPy with a JIT.  See
the `compatibility matrix`_ for the combinations known to be working
right now.  (The combination of the JIT with the thunk or taint object
spaces probably works too, but we don't expect it to generate good code
before we drop a few extra hints in the source code of the object
spaces.)

.. _`compatibility matrix`: image/compat-matrix.png

Usage
=====

You can mark one or many code objects as candidates for being run by
the JIT as follows::

    >>>> def f(x): return x*5
    >>>> import pypyjit
    >>>> pypyjit.enable(f.func_code)
    >>>> f(7)
    # the JIT runs here
    35
    >>>> f(8)
    # machine code already generated, no more jitting occurs here
    40

A few examples of this kind can be found in `demo/jit/`_.  The script
`demo/jit/f1.py`_ shows a function that becomes seriously faster with
the JIT - only 10% to 20% slower than what ``gcc -O0`` produces from the
obvious equivalent C code, a result similar to Psyco.  Although the JIT
generation process is well-tested, we only have a few tests directly for
the final ``pypy-c``.  Try::

    pypy-c test_all.py module/pypyjit/test/test_pypy_c.py -A --nomagic

You can get a dump of the generated machine code by setting the
environment variable ``PYPYJITLOG`` to a file name before you start
pypy-c.  See `In more details`_ above.  To inspect this file, use the
following tool::

    python  pypy/jit/codegen/i386/viewcode.py  dumpfilename

The viewcode.py script is based on the Linux tool ``objdump`` to produce
a disassembly.  It should be easy to port to OS/X.  If you want to port
the tool to Windows, have a look at
http://codespeak.net/svn/psyco/dist/py-utils/xam.py : this is the tool
from which viewcode.py was derived in the first place, but
Windows-specific parts were omitted for lack of a Windows machine to try
them on.

Caveats
-------

When running JIT'ed code, the bytecode tracing hook is not invoked.  This
should be the only visible effect of the JIT, aside from the debug
prints and the speed/memory impact.  In practice, of course, it still
has also got rough edges.

One of them is that all compile-time errors are fatal for now, because
it is hard to recover from them.  Clearly, the compiler is not
*supposed* to fail, but it can occur because the memory runs out, a bug
hits, or for more subtle reasons.  For example, overflowing the stack is
likely to cause the JIT compiler to try to compile the app-level
handlers for the RuntimeError, and compiling takes up stack space too -
so the compiler, running on top of the already-full stack, might hit the
stack limit again before it has got a chance to generate code for the
app-level handlers.


------------------------------------------------------------------------
                     Make your own JIT compiler
------------------------------------------------------------------------

Introduction
============

The central idea of the PyPy JIT is to be *completely independent from
the Python language*.  We did not write down a JIT compiler by hand.
Instead, we generate it anew during each translation of pypy-c.

This means that the same technique works out of the box for any other
language for which we have an interpreter written in RPython.  The
technique works on interpreters of any size, from tiny to PyPy.

Aside from the obvious advantage, it means that we can show all the
basic ideas of the technique on a tiny interpreter.  The fact that we
have done the same on the whole of PyPy shows that the approach scales
well.  So we will follow in the sequel the example of small interpreters
and insert a JIT compiler into them during translation.

The important terms:

* *Translation time*: while you are running ``translate.py`` to produce
  the static executable.

* *Compile time*: when the JIT compiler runs.  This really occurs at
  runtime, as it is a Just-In-Time compiler, but we need a consistent
  way of naming the part of runtime that is occupied by running the
  JIT support code and generating more machine code.

* *Run time*: the execution of the user program.  This can mean either
  when the interpreter runs (for parts of the user program that have not
  been JIT-compiled), or when the generated machine code runs.


A first example
===============

Source
------

Let's consider a very small interpreter-like example::

        def ll_plus_minus(s, x, y):
            acc = x
            pc = 0
            while pc < len(s):
                op = s[pc]
                hint(op, concrete=True)
                if op == '+':
                    acc += y
                elif op == '-':
                    acc -= y
                pc += 1
            return acc

Here, ``s`` is an input program which is simply a string of ``'+'`` or
``'-'``.  The ``x`` and ``y`` are integer input arguments.  The source
code of this example is in `pypy/jit/tl/tiny1.py`_.

Hint
----

Ideally, turning an interpreter into a JIT compiler is only a matter of
adding a few hints.  In practice, the current JIT generation framework
has many limitations and rough edges requiring workarounds.  On the
above example, though, it works out of the box.  We only need one hint,
the central hint that all interpreters need.  In the source, it is the line::

    hint(op, concrete=True)

This hint says: "at this point in time, ensure that ``op`` is a
compile-time constant".  The motivation for such a hint is that the most
important source of inefficiency in a small interpreter is the switch on
the next opcode, here ``op``.  If ``op`` is known at the time where the
JIT compiler runs, then the whole switch dispatch can be constant-folded
away; only the case that applies remains.

The way the ``concrete=True`` hint works is by setting a constraint: it
requires ``op`` to be a compile-time constant.  During translation, a
phase called *hint-annotation* processes these hints and tries to
satisfy the constraints.  Without further hints, the only way that
``op`` could be a compile-time constant is if all the other values that
``op`` depends on are also compile-time constants.  So the
hint-annotator will also mark ``s`` and ``pc`` as compile-time
constants.

Colors
------

You can see the results of the hint-annotator with the following
commands::

    cd pypy/jit/tl
    python ../../translator/goal/translate.py --hintannotate targettiny1.py

Click on ``ll_plus_minus`` in the Pygame viewer to get a nicely colored
graph of that function.  The graph contains the low-level operations
produced by RTyping.  The *green* variables are the ones that have been
forced to be compile-time constants by the hint-annotator.  The *red*
variables are the ones that will generally not be compile-time
constants, although the JIT compiler is also able to do constant
propagation of red variables if they contain compile-time constants in
the first place.

In this example, when the JIT runs, it generates machine code that is
simply a list of additions and subtractions, as indicated by the ``'+'``
and ``'-'`` characters in the input string.  To understand why it does
this, consider the colored graph of ``ll_plus_minus`` more closely.  A
way to understand this graph is to consider that it is no longer the
graph of the ``ll_plus_minus`` interpreter, but really the graph of the
JIT compiler itself.  All operations involving only green variables will
be performed by the JIT compiler at compile-time.  In this case, the
whole looping code only involves the green variables ``s`` and ``pc``,
so the JIT compiler itself will loop over all the opcodes of the
bytecode string ``s``, fetch the characters and do the switch on them.
The only operations involving red variables are the ``int_add`` and
``int_sub`` operations in the implementation of the ``'+'`` and ``'-'``
opcodes respectively.  These are the operations that will be generated
as machine code by the JIT.

Over-simplifying, we can say that at the end of translation, in the
actual implementation of the JIT compiler, operations involving only
green variables are kept unchanged, and operations involving red
variables have been replaced by calls to helpers.  These helpers contain
the logic to generate a copy of the original operation, as machine code,
directly into memory.

Translating
-----------

Now try translating ``tiny1.py`` with a JIT without stopping at the
hint-annotation viewer::

    python ../../translator/goal/translate.py --jit targettiny1.py

Test it::

    ./targettiny1-c +++-+++ 100 10
    150

What occurred here is that the colored graph seen above was turned into
a JIT compiler, and the original ``ll_plus_minus`` function was patched.
Whenever that function is called from the rest of the program (in this
case, from ``entry_point()`` in `pypy/jit/tl/targettiny1.py`_), then
instead of the original code performing the interpretation, the patched
function performs the following operations:

* It looks up the value of its green argument ``s`` in a cache (the red
  ``x`` and ``y`` are not considered here).

* If the cache does not contain a corresponding entry, the JIT compiler
  is called to produce machine code.  At this point, we pass to the JIT
  compiler the value of ``s`` as a compile-time constant, but ``x`` and
  ``y`` remain variables.

* Finally, the machine code (either just produced or retrieved from the
  cache) is invoked with the actual values of ``x`` and ``y``.

The idea is that interpreting the same bytecode over and over again with
different values of ``x`` and ``y`` should be the fast path: the
compilation step is only required the first time.

On 386-compatible processors running Linux, you can inspect the
generated machine code as follows::

    PYPYJITLOG=log ./targettiny1-c +++-+++ 100 10
    python ../../jit/codegen/i386/viewcode.py log

If you are familiar with GNU-style 386 assembler, you will notice that
the code is a single block with no jump, containing the three additions,
the subtraction, and the three further additions.  The machine code is
not particularly optimal in this example because all the values are
input arguments of the function, so they are reloaded and stored back in
the stack at every operation.  The current backend tends to use
registers in a (slightly) more reasonable way on more complicated
examples.


A slightly less tiny interpreter
================================

The interpreter in `pypy/jit/tl/tiny2.py`_ is a reasonably good example
of the difficulties that we meet when scaling up this approach, and how
we solve them - or work around them.  For more details, see the comments
in the source code.  With more work on the JIT generator, we hope to be
eventually able to remove the need for the workarounds.

Promotion
---------

The most powerful hint introduced in this example is ``promote=True``.
It is applied to a value that is usually not a compile-time constant,
but which we would like to become a compile-time constant "just in
time".  Its meaning is to instruct the JIT compiler to stop compiling at
this point, wait until the runtime actually reaches that point, grab the
value that arrived here at runtime, and go on compiling with the value
now considered as a compile-time constant.  If the same point is reached
at runtime several times with several different values, the compiler
will produce one code path for each, with a switch in the generated
code.  This is a process that is never "finished": in general, new
values can always show up later during runtime, causing more code paths
to be compiled and the switch in the generated code to be extended.

Promotion is the essential new feature introduced in PyPy when compared
to existing partial evaluation techniques (it was actually first
introduced in Psyco [JITSPEC]_, which is strictly speaking not a partial
evaluator).

Another way to understand the effect of promotion is to consider it as a
complement to the ``concrete=True`` hint.  The latter tells the
hint-annotator that the value that arrives here is required to be a
compile-time constant (i.e. green).  In general, this is a very strong
constraint, because it forces "backwards" a potentially large number of
values to be green as well - all the values that this one depends on.
In general, it does not work at all, because the value ultimately
depends on an operation that cannot be constant-folded at all by the JIT
compiler, e.g. because it depends on external input or reads from
non-immutable memory.

The ``promote=True`` hint can take an arbitrary red value and returns it
as a green variable, so it can be used to bound the set of values that
need to be forced to green.  A common idiom is to put a
``concrete=True`` hint at the precise point where a compile-time
constant would be useful (e.g. on the value on which a complex switch
dispatches), and then put a few ``promote=True`` hints to copy specific
values into green variables *before* the ``concrete=True``.

The ``promote=True`` hints should be applied where we expect not too
many different values to arrive at runtime; here are typical examples:

* Where we expect a small integer, the integer can be promoted if each
  specialized version can be optimized (e.g. lists of known length can
  be optimized by the JIT compiler).

* The interpreter-level class of an object can be promoted before an
  indirect method call, if it is useful for the JIT compiler to look
  inside the called method.  If the method call is indirect, the JIT
  compiler merely produces a similar indirect method call in the
  generated code.  But if the class is a compile-time constant, then it
  knows which method is called, and compiles its operations (effectively
  inlining it from the point of the view of the generated code).

* Whole objects can be occasionally promoted, with care.  For example,
  in an interpreter for a language which has function calls, it might be
  useful to know exactly which Function object is called (as opposed to
  just the fact that we call an object of class Function).

Other hints
-----------

The other hints mentioned in `pypy/jit/tl/tiny2.py`_ are "global merge
points" and "deepfreeze".  For more information, please refer to the
explanations there.

We should also mention a technique not used in ``tiny2.py``, which is
the notion of *virtualizable* objects.  In PyPy, the Python frame
objects are virtualizable.  Such objects assume that they will be mostly
read and mutated by the JIT'ed code - this is typical of frame objects
in most interpreters: they are either not visible at all for the
interpreted programs, or (as in Python) you have to access them using
some reflection API.  The ``_virtualizable_`` hint allows the object to
escape (e.g. in PyPy, the Python frame object is pushed on the
globally-accessible frame stack) while still remaining efficient to
access from JIT'ed code.


------------------------------------------------------------------------
                  JIT Compiler Generation - Theory
------------------------------------------------------------------------

.. _warning:

    This section is work in progress!

Introduction
============

One the of the central goals of the PyPy project is to automatically
produce a Just in Time Compiler from the interpreter, with as little
as possible intervention on the interpreter codebase itself.  The Just
in Time Compiler should be another aspect as much as possible
transparently introduced by and during the translation process.

Partial evaluation techniques should, at least theoretically, allow
such a derivation of a compiler from an interpreter.

.. XXX references

The forest of flow graphs that the translation process generates and
transforms constitutes a reasonable base for the necessary analyses.
That's a further reason why having an high-level runnable and
analysable description of the language was always a central tenet of
the project.

Transforming an interpreter into a compiler involves constructing a so
called *generating extension*, which takes input programs to the
interpreter, and produces what would be the output of partially
evaluating the interpreter, with the input program fixed and the input
data left as a variable. The generating extension is essentially
capable of compiling the input programs.

Generating extensions can be produced by self-applying partial evaluators,
but this approach may lead to not optimal results or be not scalable.

.. XXX expand this argument

For PyPy, our approach aims at producing the generating extension more
directly from the analysed interpreter in the form of a forest of flow
graphs. We call such process *timeshifting*.

To be able to achieve this, gathering *binding time* information is
crucial.  This means distinguishing values in the data-flow of the
interpreter which are compile-time bound and immutable at run-time,
versus respectively runtime values.

Currently we base the binding time computation on propagating the
information based on a few hint inserted in the interpreter. Propagation
is implemented by reusing our `annotation/type inference framework`__.

__ annotator_

The code produced by a generating extension for an input program may
not be good, especially for a dynamic language, because essentially
the input program doesn't contain enough information to generate good
code. What is really desired is not a generating extension doing
static compilation, but one capable of dynamic compilation, exploiting
runtime information in its result. Compilation should be able to
suspend, let the produced code run to collect run-time information
(for example language-level types), and then resume with this extra
information.  This allow it to generate code optimised for the effective
run-time behaviour of the program.

Inspired by Psyco, which is in some sense a hand-written generating
extension for Python, we added support
for so-called *promotion* to our framework for producing generating
extensions.

Simply put, promotion on a value stops compilation and waits until the
runtime reaches this point.  When it does, the actual runtime value is
promoted into a compile-time constant, and compilation resumes with this
extra information.  Concretely, the promotion expands into a switch in
the generated code.  The switch contains one case for each runtime value
encountered so far, to chose which of the specialized code paths the
runtime execution should follow.  The switch initially contains no case
at all, but only a fall-back path.  The fall-back invokes the compiler
with the newly seen runtime value.  The compiler produces the new
specialized code path.  It then patches the switch to add this new case,
so that the next time the same value is encountered at runtime, the
execution can directly jump to the correct specialized code path.

This can also be thought of as a generalisation of polymorphic inline
caches.

.. XXX reference


Partial Evaluation
==================

Partial evaluation is the process of evaluating a function, say ``f(x,
y)``, with only partial information about the value of its arguments,
say the value of the ``x`` argument only.  This produces a *residual*
function ``g(y)``, which takes less arguments than the original - only
the information not specified during the partial evaluation process need
to be provided to the residual function, in this example the ``y``
argument.

Partial evaluation (PE) comes in two flavors:

* *On-line* PE: a compiler-like algorithm takes the source code of the
  function ``f(x, y)`` (or its intermediate representation, i.e. its
  control flow graph in PyPy's terminology), and some partial
  information, e.g. ``x = 5``.  From this, it produces the residual
  function ``g(y)`` directly, by following in which operations the
  knowledge ``x = 5`` can be used, which loops can be unrolled, etc.

* *Off-line* PE: in many cases, the goal of partial evaluation is to
  improve performance in a specific application.  Assume that we have a
  single known function ``f(x, y)`` in which we think that the value of
  ``x`` will change slowly during the execution of our program - much
  more slowly than the value of ``y``.  An obvious example is a loop
  that calls ``f(x, y)`` many times with always the same value ``x``.
  We could then use an on-line partial evaluator to produce a ``g(y)``
  for each new value of ``x``.  In practice, the overhead of the partial
  evaluator might be too large for it to be executed at run-time.
  However, if we know the function ``f`` in advance, and if we know
  *which* arguments are the ones that we will want to partially evaluate
  ``f`` with, then we do not need a full compiler-like analysis of ``f``
  every time the value of ``x`` changes.  We can precompute off-line a
  specialized function ``f1(x)``, which when called produces a residual
  function ``g(y)``.

Off-line partial evaluation is based on *binding-time analysis*, which
is the process of determining among the variables used in a function (or
a set of functions) which ones are going to be known in advance and which
ones are not.  In the above example, such an analysis would be able to
infer that the constantness of the argument ``x`` implies the
constantness of many intermediate values used in the function.  The
*binding time* of a variable determines how early the value of the
variable will be known.

The PyPy JIT is generated using off-line partial evaluation.  As such,
there are three distinct phases:

* *Translation time*: during the normal translation of an RPython
  program like PyPy, we perform binding-time analysis and off-line
  specialization.  This produces a new set of functions (``f1(x)`` in
  our running example) which are linked with the rest of the program.

* *Compile time*: during the execution of the program, when a new value
  for ``x`` is found, ``f1(x)`` is invoked.  All the computations
  performed by ``f1(x)`` are called compile-time computations.  This is
  justified by the fact that ``f1(x)`` is in some sense a compiler,
  whose sole effect is to produce residual code.

* *Run time*: the normal execution of the program.

The binding-time terminology that we are using in PyPy is based on the
colors that we use when displaying the control flow graphs:

* *Green* variables contain values that are known at compile-time -
  e.g. ``x``.

* *Red* variables contain values that are not known until run-time -
  e.g. ``y``.


For more information
====================

The `expanded version of the present document`_ may be of interest to
you if you are already familiar with the domain of Partial Evaluation
and are looking for a quick overview of some of our techniques.

.. _`expanded version of the present document`: discussion/jit-draft.html

---------------


.. _VMC: http://codespeak.net/svn/pypy/extradoc/talk/dls2006/pypy-vm-construction.pdf
.. _`RPython`: coding-guide.html#rpython
.. _`RPython Typer`: translation.html#rpython-typer
.. _`low-level graphs`: rtyper.html
.. _`pointer-and-structures-like objects`: rtyper.html#low-level-types 
.. _`annotator`: dynamic-language-translation.html
.. _`specialization of functions`: dynamic-language-translation.html#specialization
.. _Psyco: http://psyco.sourceforge.net
.. _`PyPy Standard Interpreter`: architecture.html#standard-interpreter
.. _`exception transformer`: translation.html#making-exception-handling-explicit
.. [JITSPEC] Representation-Based Just-In-Time Specialization and the
           Psyco Prototype for Python, ACM SIGPLAN PEPM'04, August 24-26, 2004,
           Verona, Italy.
           http://psyco.sourceforge.net/psyco-pepm-a.ps.gz

.. include:: _ref.txt
