
==================
PyPy Parser 
==================

WARNING: this document is out-of-date and unfinished.  Our compiler
is done by now, unlike explained below.


Overview
========

The PyPy parser includes a tokenizer and a recursive descent parser.

Tokenizer
---------

The tokenizer accepts  as string as input and provides tokens through
a ``next()`` and a ``peek()`` method. The tokenizer is implemented as a finite
automata like lex.

Parser
------

The parser is a tree of grammar rules. EBNF grammar rules are decomposed
as a tree of objects.
Looking at a grammar rule one can see it is composed of a set of four basic
subrules. In the following exemple we have all four of them: 

S <- A '+' (C | D) +


The previous line says S is a sequence of symbol A, token '+', a subrule + which
matches one or more of an alternative between symbol C and symbol D.
Thus the four basic grammar rule types are :
* sequence 
* alternative
* multiplicity (called kleen star after the * multiplicity type)
* token

The four types are represented by a class in pyparser/grammar.py
( Sequence, Alternative, KleeneStar, Token) all classes have a ``match()`` method
accepting a source (the tokenizer) and a builder (an object responsible for
building something out of the grammar).

Here's a basic exemple and how the grammar is represented::

    S <- A ('+'|'-') A
    A <- V ( ('*'|'/') V )*
    V <- 'x' | 'y'
    
    In python:
    V = Alternative( Token('x'), Token('y') )
    A = Sequence( V,
           KleeneStar(
               Sequence(
                  Alternative( Token('*'), Token('/') ), V
                       )
                    )
                 )
    S = Sequence( A, Alternative( Token('+'), Token('-') ), A )


Status
======

See README.compiling on the status of the parser(s) implementation of PyPy


Detailed design
===============

Building the Python grammar
---------------------------

The python grammar is built at startup from the pristine CPython grammar file.
The grammar framework is first used to build a simple grammar to parse the
grammar itself.
The builder provided to the parser generates another grammar which is the Python
grammar itself.
The grammar file should represent an LL(1) grammar. LL(k) should still work since
the parser supports backtracking through the use of source and builder contexts
(The memento patterns for those who like Design Patterns)

The match function for a sequence is pretty simple::

    for each rule in the sequence:
       save the source and builder context
       if the rule doesn't match:
           restore the source and builder context
           return false
    call the builder method to build the sequence
    return true

Now this really is an LL(0) grammar since it explores the whole tree of rule
possibilities.
In fact there is another member of the rule objects which is built once the
grammar is complete.
This member is a set of the tokens that match the begining of each rule. Like
the grammar it is precomputed at startup.
Then each rule starts by the following test::

    if source.peek() not in self.firstset: return false


Efficiency should be similar (not too worse) to an automata based grammar since it is
basicly building an automata but it uses the execution stack to store its parsing state.
This also means that recursion in the grammar are directly translated as recursive calls.


Redisigning the parser to remove recursion shouldn't be difficult but would make the code
less obvious. (patches welcome). The basic idea


Parsing
-------

This grammar is then used to parse Python input and transform it into a syntax tree.

As of now the syntax tree is built as a tuple to match the output of the parser module
and feed it to the compiler package

the compiler package uses the Transformer class to transform this tuple tree into an
abstract syntax tree.

sticking to our previous example, the syntax tree for x+x*y would be::

    Rule('S', nodes=[
         Rule('A',nodes=[Rule('V', nodes=[Token('x')])]),
         Token('+'),
         Rule('A',nodes=[
            Rule('V', nodes=[Token('x')]),
            Token('*'),
            Rule('V', nodes=[Token('y')])
           ])
         ])


The abstract syntax tree for the same expression would look like::

    Add(Var('x'),Mul(Var('x'),Var('y')))



Examples using the parser within PyPy
-------------------------------------

The four parser variants are used from within interpreter/pycompiler.py

API Quickref
------------

Modules
~~~~~~~

* interpreter/pyparser contains the tokenizer and grammar parser
* interpreter/astcompiler contains the soon to be rpythonic compiler
* interpreter/stablecompiler contains the compiler used by default
* lib/_stablecompiler contains a modified version of stablecompiler capable
  of running at application level (that is not recursively calling itself
  while compiling)


Main facade functions
~~~~~~~~~~~~~~~~~~~~~

still empty, but non-empty for ReST

Grammar
~~~~~~~

still empty, but non-empty for ReST

Long term goals
===============

having enough flexibility in the design that allows us to change the
grammar at runtime. Such modification needs to be able to:
* modify the grammar representation
* modify the ast builder so it produces existing or new AST nodes
* add new AST nodes
* modify the compiler so that it knows how to deal with new AST nodes


parser implementation
---------------------

We can now build AST trees directly from the parser.
Still missing is the ability to easily provide/change building functions
easily. The functions are referenced at interpreter level through a dictionnary
mapping that has rule names as keys.


compiler implementation
-----------------------

For now we are working at translating the existing compiler module without
changing its design too much. That means we won't have enough flexibility
to be able to handle new AST nodes at runtime.

parser module
-------------

enhance the parser module interface so that it allows acces to the internal grammar
representation, and allows to modify it too.

compiler module
---------------

same as above
