Brief notes on the design

THIS HAS BEEN PLACED INTO /HOME/MPI/NOTES/ADI2IMPL/NOTE.TEX.  MAKE ANY CHANGES
THERE.

The goals are to allow for multi-device, multi-message-protocol
implementations of the Abstract Device without (significantly) impacting
performance.  

Here are the steps that occur when sending and receiving a message.

SENDING:
First, we determine the device to use, based on the destination
    (if multiple sets, this can be an array of pointers to structures, indexed
     by global rank)

Next, we determine which protocol to use (short, eager, rendezvous, get).
This is probably based on the size of the message (but could be based on
something else, like the number of pending completions, datatype, etc.)

We then start the send with the appropriate operation.  

Now, if the operation is NOT already complete (short or eager blocking send),
we need to know how to

    test for completion
    wait for completion
    cancel 
    push

The polling design of the MPICH implementation puts the "push" partly into
the test call, so that the MPICH usage is really

    test for completion
    push-and-then-test for completion

(There is also a common push-device, but it doesn't keep track of each
individual item)

The most general way to handle this case is to store the FUNCTIONS in the
request (the device-specific part, of course).  
An alternate method, in use in the first generation system, is to store an
integer that selects the function.  This limits the choices and makes it hard
to add new approaches (see below).

Note that we can combine the test and push-test with the choice that a 
null push-test function means that the request is complete.

RECEIVING:
This is a little harder, because to start with we don't know even know the 
device that a message is ariving on.  So we start could start with commands
to check to see if a message is available, or to wait for one.  Put this
leaves us with the problem of what to do with the message if checking for the
message might receive the envelope.  So, instead of checking for a whether a
message is available, we just use 

     check-device

which has blocking and non-blocking versions FOR EACH DEVICE.

The multi-device version just calls the for each of the devices.  A wait may
then turn into a poll.  Note that the multi-device code could optimize for the
case of a single ACTIVE device, and use a wait when a blocking check-device 
is needed.  

This requires each device to provide its own "check-device"; this really isn't
so bad as the message-channel (eager/rendezvous) versions really need
different code from the shared-memory (get) versions.

Note that this solves a number of problems:

     It doesn't matter whether to low-level code wants to receive a POINTER
     to a control packet or the actual packet.

     The packets are likely different in detail, though we can 
     probably restrict a few fields.  Note that for shared memory, we 
     usually want the address of the data, whereas for "stream/channel"
     interfaces, we receive the data into local data

     Heterogenity information will be needed by only some of the
     packets.

     Note that we may receive forwarding packets that are intended to send
     data to another node.  This is just a data packet, so again the actual
     form is unimportant.

The check device code itself could use either a lookup table based on the
packet type field (small integer) or a switch statement.  In the lookup table
case, this selects the routine that matches the kind of send that 
was used.  The table is initialized when the device starts up.  
Using a switch is simpler but a little less flexible.  Still, since the
check-device code itself is accessed through a function pointer, we could
eventually add a more complex version.

Now the situation is much like the send case.  If the operation is not
complete, we set up pointers to functions to

    test for completion
    wait for completion
    cancel 
    push
	
(again in the device part of the handle).  Note that the push operation can
change the others as a way to implement a finite-state-machine approach to
message delivery.

Now, for receives, there are two situations: the message is expected, and
the message is unexpected.  The above handles the case of an expected message.
In the unexpected case, we also need another routine:

    accept unexpected message

This can start the process of receiving the data for an unexpected message
that is now expected.  The test/wait/cancel/push routines should be setup
by the "accept" appropriately.  By having a separate "accept", we allow
redezvous and get protocols to defer transfering the unexpected message until
the user is ready to receive it.  

Note that in fact we could save the "accept" in the "push", reducing the
number of function pointers.

SYNCHRONOUS SENDS:
Rather than support a completely separate operation, this will just use a
transfer with a rendezvous semantic (rendezvous or get).  In order to ensure
that MPI_Ssend( .., cnt = 0, .. ) works, we must be careful about handling
0-length messages.  

    Details: For heterogenity, there needs to be a agreed form of "send_id" 
    that can be mapped back into the sender's request.  An easy solution 
    would be to use 8 bytes for the id, and transmit it without modification.
    Another is to use 4 bytes, and to convert longer address as in the
    Fortran version.

    This is really a device choice, and we can limit the 8-byte version to 
    the heterogeneous systems.

    Details: A special form of the eager message, the "within the packet"
    form, is also used.  There needs to be a "get" and "channel" version of
    this.

FREEING REQUESTS:
The MPI standard allos

	MPI_Isend( ..., &request )
	MPI_Request_free( &request )
and
	MPI_Irecv( ..., &request )
	MPI_Request_free( &request )

In order for us to properly complete this, we need to continue to "push" the
requests until they complete.  For this to work, we need to put the request on
a global list of these requests, and run through this list "frequently".  
At the very least, the "device push" should process this list.  For this, we
need a list "next" pointer in the request.  This should be in the device part.

SINGLE DEVICE OPTIMIZATIONS:
Much of the function pointer interface can be removed in the single device 
case:
	No lookup on send or on receive
	Switch statements can replace function pointers for test/wait/push

The first of these is fairly easy and may help on fast systems.  The second
may not be worth the trouble, since if there is a function to call, the
overhead is already relatively high.

REQUESTS:
This one is hard.  Each device may have its own needs.  Should we define a
common form?  This is pretty much true now, with the possible exception of the
"nonblocking id", which may be an int (for nx), a small structure (IBM MPL or
aioread/aiowrite), or something else.  The "send_id" may be 4 or 8 bytes, 
depending on the system.

One approach is to provide the basics in the structure (and make them part of
the MPIR handle rather than MPID handle) and use a pointer for any other data.
This is the most flexible but potentially introduces an additional memory
allocation.  Again, the single device and multi-device could use slightly
different forms, though I don't want to maintain that.

What to do?

THE LAYERS:
There is an MPI to "channel" layer that selects the protocol based on message
size.  This could even be a macro, using a global (or device specific?) value.

The "check-incoming" routine remains about the same, with unexpected message
processing doing this:
	
	save information on the incoming message (tag, lrank, len, context)
	based on device/packet type, save additional information
	    (e.g., data for eager send, send_id for get/rendezvous.  Careful
	     of rendezvous for Ssend).


So, a send looks something like this

send( ..., dest, ... ) -> 

    { MPID_Dev *dev = devset->dev[dest];
      if (len < dev->long_len) 
	(*(dev->short->send))( ... );
      else if (len < dev->vlong_len)
	(*(dev->long->send))( ... );
      else 
	(*(dev->vlong->send))( ... );
     }

The single device case can use just the if-else list (dev is fixed, use
MPID_dev), and the fixed protocol case can use the explicit routines.

A receive is more complex, since MPI guarentees that a blocking receive on one
communicator does not affect any other communication.  So a multi-device
blocking receive looks something like

     Is message already available?
     Yes ->
	Call recv_unexpected code (see below)
     No -> 
	(test has already added message to list of posted receives (for
         thread safety))
     (a nonblocking receive exits here)
     while (message not received)
	if (req->push) (req->push)( req )
	for each device
	    check_device( (ndev > 1) ? non-blocking : blocking )

This assumes that, at least in the single device case, you can block in 
check device to await a message.  If this is NOT true, then the check_device
routine should return.  

In the case of finding an unexpected message,

recv_unexpected
    Save matched data (tag, source)
    Check for message data available (short or eager delivery)
    Save functions (test,wait,push)
    if (req->push) (req->push)( req )
    
Note that this does NOT necessarily complete the message; this allows
non-blocking receives that match an unexpected message to act as non-blocking
operations.  

Remaining concerns:

    If the lowest levels send a message with a non-blocking operation,
    is it ever safe to "wait" on it?  Or do we have to spin in a test
    loop?  Should the wait entry be defined ONLY when it can be safely
    called?   For example, in a rendezvous system, the non-blocking
    operation will be started only in response to a direct request;
    could you wait then?  Do we leave this to the device?  Implementation
    of push?



