Possible optimizations for the CLI backend
==========================================

Stack push/pop optimitazion
---------------------------

The CLI's VM is a stack based machine: this fact doesn't play nicely
with the SSI form the flowgraphs are generated in. At the moment
gencli does a literal translation of the SSI statements, allocating a
new local variable for each variable of the flowgraph.

For example, consider the following RPython code and the corresponding
flowgraph::

  def bar(x, y):
      foo(x+y, x-y)


  inputargs: x_0 y_0
  v0 = int_add(x_0, y_0)
  v1 = int_sub(x_0, y_0)
  v2 = directcall((sm foo), v0, v1)

This is the IL code generated by the CLI backend::

  .locals init (int32 v0, int32 v1, int32 v2)
    
  block0:
    ldarg 'x_0'
    ldarg 'y_0'
    add 
    stloc 'v0'
    ldarg 'x_0'
    ldarg 'y_0'
    sub 
    stloc 'v1'
    ldloc 'v0'
    ldloc 'v1'
    call int32 foo(int32, int32)
    stloc 'v2'

As you can see, the results of 'add' and 'sub' are stored in v0 and
v1, respectively, then v0 and v1 are reloaded onto stack. These
store/load is redundant, since the code would work nicely even without
them::

  .locals init (int32 v2)
    
  block0:
    ldarg 'x_0'
    ldarg 'y_0'
    add 
    ldarg 'x_0'
    ldarg 'y_0'
    sub 
    call int32 foo(int32, int32)
    stloc 'v2'

I've checked the native code generated by the Mono Jit on x86 and I've
seen that it does not optimize it. I haven't checked the native code
generated by Microsoft CLR, yet.

Thus, we might consider to optimize it manually; it should not be so
difficult, but it is not trivial becasue we have to make sure that the
dropped locals are used only once.


Mapping RPython exceptions to native CLI exceptions
---------------------------------------------------

Both RPython and CLI have its own set of exception classes: some of
these are pretty similar; e.g., we have OverflowError,
ZeroDivisionError and IndexError on the first side and
OverflowException, DivideByZeroException and IndexOutOfRangeException
on the other side.

The first attempt was to map RPython classes to their corresponding
CLI ones: this worked for simple cases, but it would have triggered
subtle bugs in more complex ones, because the two exception
hierarchies don't completely overlap.

For now I've choosen to build an RPython exception hierarchy
completely indipendent from the CLI one, but this means that we can't
rely on exceptions raised by standard operations. The currently
implemented solution is to do an exception translation on-the-fly; for
example, the 'ind_add_ovf' is translated into the following IL code::

  .try 
  { 
      ldarg 'x_0'
      ldarg 'y_0'
      add.ovf 
      stloc 'v1'
      leave __check_block_2 
  } 
  catch [mscorlib]System.OverflowException 
  { 
      newobj instance void class exceptions.OverflowError::.ctor() 
      dup 
      ldsfld class Object_meta pypy.runtime.Constants::exceptions_OverflowError_meta 
      stfld class Object_meta Object::meta 
      throw 
  } 

I.e., it catches the builtin OverflowException and raises a RPython
OverflowError.

I haven't misured timings yet, but I guess that this machinery brings
to some performance penalties even in the non-overflow case; a
possible optimization is to do the on-the-fly translation only when it
is strictly necessary, i.e. only when the except clause catches an
exception class whose subclass hierarchy is compatible with the
builtin one. As an example, consider the following RPython code::

  try:
    return mylist[0]
  except IndexError:
    return -1

Given that IndexError has no subclasses, we can map it to
IndexOutOfBoundException and directly catch this one::

  try
  {
    ldloc 'mylist'
    ldc.i4 0
    call int32 getitem(MyListType, int32)
    ...
  }
  catch [mscorlib]System.IndexOutOfBoundException
  {
    // return -1
    ...
  }

By contrast we can't do so if the except clause catches classes that
don't directly map to any builtin class, such as LookupError::

  try:
    return mylist[0]
  except LookupError:
    return -1

Has to be translated in the old way::

  .try 
  { 
    ldloc 'mylist'
    ldc.i4 0

    .try 
    {
        call int32 getitem(MyListType, int32)
    }
    catch [mscorlib]System.IndexOutOfBoundException
    { 
        // translate IndexOutOfBoundException into IndexError
        newobj instance void class exceptions.IndexError::.ctor() 
        dup 
        ldsfld class Object_meta pypy.runtime.Constants::exceptions_IndexError_meta 
        stfld class Object_meta Object::meta 
        throw 
    }
    ...
  }
  .catch exceptions.LookupError
  {
    // return -1
    ...
  }


Specializing methods of List
----------------------------

Most methods of RPython lists are implemented by ll_* helpers placed
in rpython/rlist.py. For some of those we have a direct correspondent
already implemented in .NET List<>; we could use the oopspec attribute
for doing an on-the-fly replacement of these low level helpers with
their builtin correspondent. As an example the 'append' method is
already mapped to pypylib.List.append. Thanks to Armin Rigo for the
idea of using oopspec.


Doing some caching on Dict
--------------------------

The current implementations of ll_dict_getitem and ll_dict_get in
ootypesystem.rdict do two consecutive lookups (calling ll_contains and
ll_get) on the same key. We might cache the result of
pypylib.Dict.ll_contains so that the succesive ll_get don't need a
lookup. Btw, we need some profiling before choosing the best way. Or
we could directly refactor ootypesystem.rdict for doing a single
lookup.

XXX
I tried it on revision 32917 and performance are slower! I don't know
why, but pypy.net pystone.py is slower by 17%, and pypy.net
richards.py is slower by 71% (!!!). I don't know why, need to be
investigated further.


Optimize StaticMethod
---------------------

::

  2006-10-02, 13:41

  <pedronis> antocuni: do you try to not wrap static methods that are just called and not passed around
  <antocuni> no
             I think I don't know how to detect them
  <pedronis> antocuni: you should try to render them just as static methods not as instances when possible
             you need to track what appears only in direct_calls vs other places


Optimize Unicode
----------------

We should try to use native .NET unicode facilities instead of our
own. These should save both time (especially startup time) and memory.

On 2006-10-02 I got these benchmarks:

Pypy.NET             Startup time   Memory used
with unicodedata          ~12 sec     112508 Kb
without unicodedata        ~6 sec      79004 Kb

The version without unicodedata is buggy, of course.

Unfortunately it seems that .NET doesn't expose all the things we
need, so we will still need some data. For example there is no way to
get the unicode name of a char.
