----------------------------------------------------------------------
m2parsergen
----------------------------------------------------------------------

This is a parser generator for top-down (or recursively descending) parsers.
The input file must be structured as follows:

---------------------------------------- Begin of file

<OCAML TEXT ("preamble")>

%%

<DECLARATIONS>

%%

<RULES>

%%

<OCAML TEXT ("postamble")>

---------------------------------------- End of file

The two-character combination %% separates the various sections. The
text before the first %% and after the last %% will be copied verbatim
to the output file.

Within the declarations and rules sections you must use /* ... */ as
comment braces.

There are two types of declarations:

%token Name

declares that Name is a token without associated value, and

%token <> Name

declares that Name is a token with associated value (i.e. Name x).

In contrast to ocamlyacc, you need not to specify a type. This is a
fundamental difference, because m2parsergen will not generate a type
declaration for a "token" type; you must do this yourself.

You need not to declare start symbols; every grammar rule may be used
as start symbol.

The rules look like:

name_of_rule(arg1, arg2, ...):
  label1:symbol1 label2:symbol2 ... {{ CODE }}
| label1:symbol1 label2:symbol2 ... {{ CODE }}
...
| label1:symbol1 label2:symbol2 ... {{ CODE }}

The rules may have arguments (note that you must write the
parantheses, even if the rule does not have arguments). Here, arg1,
arg2, ... are the formal names of the arguments; you may refer to them
in OCaml code.

Furthermore, the symbols may have labels (you can leave the labels
out). You can refer to the value associated with a symbol by its
label, i.e. there is an OCaml variable with the same name as the label
prescribes, and this variable contains the value.

The OCaml code must be embraced by {{ and }}, and these separators
must not occur within the code.

EXAMPLE:

prefix_term():
  Plus_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren
    {{ v1 + v2 }}
| Times_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren
    {{ v1 * v2 }}
| n:Number
    {{ n }}

As you can see in the example, you must pass values for the arguments
if you call non-terminal symbols (here, the argument list is empty: ()).

The generated parsers behave as follows:

- A rule is applicable to a token sequence if the first token is
  matched by the rule.

  In the example: prefix_term is applicable if the first token of a
  sequence is either Plus_symbol, Times_symbol, or Number.

- One branch of the applicable rule is selected: it is the first
  branch that matches the first token. THE OTHER TOKENS DO NOT HAVE
  ANY EFFECT ON BRANCH SELECTION!

  For instance, in the following rule the second branch is never
  selected, because only the A is used to select the branch:

  a():
    A B {{ ... }}
  | A C {{ ... }}

- Once a branch is selected, it is checked whether the branch matches
  the token sequence. If this check succeeds, the code section of the
  branch is executed, and the resulting value is returned to the
  caller.
  If the check fails, the exception Parsing.Parse_error is raised.
  Normally, this exception is not caught, and will force the parser
  to stop.

  The check in detail:

  If the rule demands a terminal, there a must be exactly this
  terminal at the corresponding location in the token sequence.

  If the rule demands a non-terminal, it is checked whether the rule
  for to this non-terminal is applicable. If so, the branch
  is selected, and recursively checked. If the rule is not applicable,
  the check fails immediately.

- THERE IS NO BACKTRACKING! 

  Note that the following works (but the construction is resolved at
  generation time):

  rule1() =
     rule2() A B ... {{ ... }}

  rule2() =
     C {{ ... }}
   | D {{ ... }}

  In this case, the (only) branch of rule1 is selected if the next
  token is C or D.

---



*** Options and repetitions ***

Symbols can be tagged as being optional, or to occur repeatedly:

rule():
  Name whitespace()* Question_mark?

- "*": The symbol matches zero or more occurrences.

- "?": The symbol matches zero or one occurrence.

This is done as follows:

- terminal*: The maximum number of consecutive tokens <terminal> are
             matched.
- non-terminal*: The maximum number of the subsequences matching
                 <non-terminal> are matched. Before another
                 subsequence is matched, it is checked whether the
                 rule for <non-terminal> is applicable. If so, the
                 rule is invoked and must succeed (otherwise Parsing.
		 Parse_error). If not, the loop is exited.

- terminal?: If the next token is <terminal>, it is matched. If not,
             no token is matched.

- non-terminal?: It is checked whether the rule for <non-terminal>
                 is applicable. If so, the rule is invoked, and
                 matches a sequence of tokens. If not, no token is
		 matched.

You may refer to repeated or optional symbols by labels. In this case,
the label is associated with lists of values, or optional values, 
respectively:

rule():
  A  lab:other()*  lab':unlikely()?
    {{ let n = List.length lab in ... 
       match lab' with
         None -> ...
       | Some v -> ... 
    }}

A different scheme is applied if the symbol is a token without
associated value (%token Name, and NOT %token <> Name):

rule():
  A lab:B* lab':C?

Here, "lab" becomes an integer variable counting the number of Bs, and
"lab'" becomes a boolean variable denoting whether there is a C or not.


*** Early let-binding ***

You may put some OCaml code directly after the first symbol of a
branch:

rule():
  A $ {{ let-binding }} C D ... {{ ... }}

The code brace {{ let-binding }} must be preceded by a dollar
sign. You can put "let ... = ... in" statements into this brace:

rule1():
  n:A $ {{ let twice = 2 * n in }} rule2(twice) {{ ... }}

This code is executed once the branch is selected.


*** Very early let-binding ***

This is also possible:

rule():
  $ {{ CODE }}
  A
  ...

The CODE is executed right when the branch is selected, and before any
other happens. (Only for hacks!)



*** Computed rules ***

rule():
  A $ {{ let followup = ... some function ... in }} [ followup ]() 
    {{ ... }}

Between [ and ], you can refer to the O'Caml name of *any* function.
Here, the function "followup" is bound in the let-binding.


*** Error handling ***

If a branch is already selected, but the check fails whether the other
symbols of the branch match, it is possible to catch the resulting
exception and to find out at which position the failure has occurred.

rule():
  x:A y:B z:C {{ ... }} ? {{ ERROR-CODE }}

After a question mark, it is allowed to append another code
brace. This code is executed if the branch check fails (but not if the
branch is not selected nor if no branches are selected). The string
variable !yy_position contains the label of the symbol that caused the
failure (or it contains the empty string if the symbol does not have a
label). 

Example:

rule():
  x:A y:B z:C {{ print_endline "SUCCESS" }} ? {{ print_endline !yy_position }}

If the token sequence is A B C, "SUCCESS" will be printed. If the
sequence is A C, the second symbol fails, and "y" will be printed. If
the sequence is A B D, the third symbol fails, and "z" will be
printed. If the sequence is B, the rule will be never selected because
it is not applicable.



*** Error recovery ***

You may call the functions yy_current, yy_get_next, or one of the
parse_* functions in the error brace to recover from the error
(e.g. to move ahead until a certain token is reached). See below.



*** How to call the parser ***

The rules are rewritten into a OCaml let-binding:

let rec parse_<rule1> ... = ...
    and parse_<rule2> ... = ...
    ...
    and parse_<ruleN> ... = ...
in

i.e. there are lots of functions, and the name of the functions are
"parse_" plus the name of the rules. You can call every function.

The first two arguments of the functions have a special meaning; the
other arguments are the arguments coming from the rule description:

rule(a,b):
  ...

===>

let rec parse_rule yy_current yy_get_next a b = ...

The first argument, yy_current, is a function that returns the current
token. The second arguments, yy_get_next, is a function that switches
to the next token, and returns it.

If the tokens are stored in a list, this may be a definition:

let input = ref [ Token1; Token2; ... ] in
let yy_current() = List.hd !input in
let yy_get_next () =
  input := List.tl !input;
  List.hd !input

When you call one of the parser functions, the current token must
already be loaded, i.e. yy_current returns the first token to match by
the function.

After the functions has returned, the current token is the token
following the sequence of tokens that have been matched by the
function.

The function returns the value computed by the OCaml code brace of the
rule (or the value of the error brace).

If the rule is not applicable, the exception Not_found is raised.

If the rule is applicable, but it does not match, the exception
Parsing.Parse_error is raised.
