			MPI Message Queue Dumping
                        -------------------------

Author : James Cownie <jcownie@etnus.com>
Version: 1.06
Date   : 9 October 2000
Status : 

Change log :
October 9 2000:JHC Added mqs_get_comm_group, MPI 2 support functions
	           and discussion.
March 21 2000: JHC Added the new function mqs_dll_taddr_width
June 1 1999:   JHC Added eurompi-paper.ps.gz to files
May 23 1998:   JHC Added extra information results on operations
May 21 1998:   JHC Added name for library TV uses on AIX.

This document describes the interface between TotalView (or any other
debugger) and a dynamic library which is used for extracting message
queue information from an MPI program. Further details are contained in
comments in the "mpi_interface.h" include file, which should be taken as
definitive.

If you want to use this interface with debuggers other than TotalView,
please talk to us. It is extremely unlikely that we will object, however
we would like to know who is using it, so that we can consult them before
making changes to the interface (and it's bound to change !).

The fundamental principle here is that the debugger dynamically loads a
library *into* the debugger itself. Functions in the library are called
by the debugger whenever it needs information about message queues in MPI
processes. Since the library is running inside the debugger, it makes use
of callbacks to the debugger to read target process store, and to convert
the target byte ordered data into the format of the host on which the
debugger is running (remember, we could be cross debugging...).

The debugger also provides callbacks which allow the dll to extract
information about the sizes of fundamental target types, and to look up
structure types defined in the debug information in the executable image
being debugged and find out field offsets within them.

Interface for MPI 1 support
---------------------------

This interface provides the basic support required to handle MPI
message queue dumping in a static process, MPI-1 world.  The functions
and types here are all declared in the mpi_interface.h header no
matter what #define symbols are defined.

The extensions for MPI 2 are discussed below, they are upwards
compatible with the MPI 2 interface, so a DLL built to this interface
can continue to be used with a debugger which supports the full MPI 2
interface (though, of course, the MPI 2 features would not be
available).

Startup
-------

Before any functions in the dll can be called, the debugger has to load
the dynamic library. To do this it must find the name of the library
to be loaded.

The way that this works is outside the scope of the specification for
the message queue dumping interface, however since this information is
likely to be useful to people working with TotalView, here is a
description of the way that we have chosen to implement it.

After it has started up (or attached to) a parallel program, TotalView
looks for a global symbol

char MPIR_dll_name[];

in the target image (*NOT* the dll of course, since that hasn't been
loaded yet). If this symbol exists, then its value is expected to be
the name of a shared library to be loaded.

If this symbol does not exist (or its value is an empty string), then
TotalView will use a default library name which depends on the process
attachment mechanism which was used, as follows

AIX poe attachment		libtvibmmpi.so then libtvmpich.so
All platforms MPICH attachment  libtvmpich.so
SGI SGI/MPI attachment		libxmpi.so

If no library can be found, then MPI message queue dumping is disabled.

On some platforms it may be necessary to take additional steps at
compile or link time to prevent the apparently unreferenced global
variable from being removed. This is compiler/linker specific.

Startup calls in the DLL
------------------------

Once the debugger has loaded the library it will first check that all the
functions it expects are present in the library, and that the library
compatibility version number is that which it requires. 

If any of these tests fail, then it will issue a complaint and disable
message queue dumping.

These tests occur through the functions 

char *mqs_version_string ();
int   mqs_version_compatibility();
int   mqs_dll_taddr_width();

in the DLL.

mqs_dll_taddr_width() should _always_ be implemented as
/* So the debugger can tell what interface width the library was compiled with */
int mqs_dll_taddr_width ()
{
  return sizeof (mqs_taddr_t);
} /* mqs_dll_taddr_width */

Once these tests have been successfully passed, the debugger will call
the initialisation function

void mqs_setup_basic_callbacks (const mqs_basic_callbacks *);

passing it an initialised structure containing the addresses of the
call back functions in the debugger. This function is called *only* when
the library is loaded, the DLL should save the pointer to the callback
structure so that it can call the functions therein as required.

For each executable image which is used by the processes that go to
make up the parallel program TotalView will call the function

int mqs_setup_image (mqs_image *, const mqs_image_callbacks *);

This call should save image specific information by using the
callback the debugger callback

mqs_put_image_info (mqs_image *, mqs_image_info *);

if it requires any, and return mqs_ok, or an error value.

Provided that mqs_setup_image completes successfully,
the debugger will then call

int mqs_image_has_queues (mqs_image *, char **);

mqs_image_has_queues returns mqs_ok or an error indicator to show
whether the image has queues or not, and an error string to be used in
a pop-up complaint to the user, as if in 

printf (error_string,name_of_image); 

The pop-up display is independent of the result. (So you can silently
disable things, or loudly enable them).

For each image which passes the mqs_image_has_queues test, the debugger
will call 

int mqs_setup_process (mqs_process *, const mqs_process_callbacks *);

on each process which is an instance of the image. This should be used
to setup process specific information using 

void mqs_put_process_info (mqs_process *, mqs_process_info *);

If this succeeds, then, to allow a final verification that the process
has all of the necessary data in it to allow message queue display, 

int mqs_process_has_queues (mqs_process *, char **);

will be called in an exactly analogous way to mqs_image_has_queues.

Whenever an error code is returned from the DLL, the debugger should
use the function char * mqs_dll_error_string (int errcode) to obtain a
printable representation of the error, and provide it to the user in
some appropriate way.

That completes the initialisation.

Queue display
-------------

Before displaying the message queues the debugger will call the function

void mqs_update_communicator_list (mqs_process *);

The DLL can use this to check that it has an up to date model of the
communicators in the process and (if necessary) update its list of
communicators.

The debugger will then iterate over each communicator, and within each
communicator over each of the pending operations in each of the
message queues using the functions

int mqs_setup_communicator_iterator (mqs_process *);
int mqs_get_communicator (mqs_process *, mqs_communicator *);
int mqs_next_communicator (mqs_process *);

to iterate over the communicators and

int mqs_setup_operation_iterator (mqs_process *, int);
int mqs_next_operation (mqs_process *, mqs_pending_operation *);

to iterate over each of the operations on a specific queue for the
current communicator.

mqs_setup_*_iterator should return either
mqs_ok			     if it has information
mqs_no_information	     if no information is available about the
			     requested queue
or, if it can tell 
mqs_end_of_list		     if it has information but there are no operations

It is permissible to return mqs_ok, but then have mqs_next_operation
immediately return mqs_end_of_list, and similarly the debugger is at
liberty to call mqs_next_* even if the initialisation function
returned mqs_end_of_list.

mqs_next_* should return 

   mqs_ok           if there is another element to look at (in the case of the
   	            operation iterator it has returned the operation),

   mqs_end_of_list  if there are no more elements (in the case of the
		    operation iterator it has *not* returned anything
		    in the pending_operation).

   some other error code if it detects an error


The useful information is returned in the structures
/* A structure to represent a communicator */
typedef struct
{
  taddr_t unique_id;				/* A unique tag for the communicator */
  tword_t local_rank;				/* The rank of this process (Comm_rank) */
  tword_t size;					/* Comm_size  */
  char    name[64];				/* the name if it has one */
} mqs_communicator;

and


typedef struct
{
  /* Fields for all messages */
  int status;					/* Status of the message (really enum mqs_status) */
  mqs_tword_t desired_local_rank;		/* Rank of target/source -1 for ANY */
  mqs_tword_t desired_global_rank;		/* As above but in COMM_WORLD  */
  int tag_wild;					/* Flag for wildcard receive  */
  mqs_tword_t desired_tag;			/* Only if !tag_wild */
  mqs_tword_t desired_length;			/* Length of the message buffer */
  int system_buffer;				/* Is it a system or user buffer ? */
  mqs_taddr_t buffer;				/* Where data is */

  /* Fields valid if status >= matched or it's a send */
  mqs_tword_t actual_local_rank;		/* Actual local rank */
  mqs_tword_t actual_global_rank;		/* As above but in COMM_WORLD */
  mqs_tword_t actual_tag;				
  mqs_tword_t actual_length;
  
  /* Additional strings which can be filled in if the DLL has more
   * info.  (Uninterpreted by the debugger, simply displayed to the
   * user).  
   *
   * Can be used to give the name of the function causing this request,
   * for instance.
   *
   * Up to five lines each of 64 characters.
   */
  char extra_text[5][64];
} mqs_pending_operation;

The debugger will display communicators in the order in which they are
returned by the communicator iterator, and pending operations in the
order returned by the operation iterator. It is up to the DLL to
ensure that communicators are sorted into a sensible order, and that
operations are shown in the order in which MPI would match them.

Note that the name string can be used by the library to display any
useful information about the communicator (the debugger does not
interpret it at all, so you can put whatever string you like into the
"name"); 

Similarly any additional information about a pending operation can be
returned in the extra_text strings. For instance, if it is available, the
name of the function which created the request could be returned.

The extra_text strings are displayed until a zero length one is
encountered, or all 5 have been displayed, so if any extra information
has been returned the library must ensure that it nulls out the next
string e.g.

       strcpy (res->extra_text[0],"Funny message");
       res->extra_text[1][0] = 0;

Otherwise uninitialized store will likely be displayed as characters :-(

The function mqs_get_comm_group may be called by the debugger to
extract from the DLL the set of processes which are used in a
communicator. This call is not used while displaying the message
queues (it is not required, since the message and communicator
descriptions include global indices), however it will be used to
support partial job acquisition (in which a user acquires only the
processes in a given communicator, rather than all of the processes in
the whole job).

If mqs_get_comm_group is not present, the debugger should continue to
function but will be unable to implement partial job acquisition.

Shutdown
--------

When the debugger terminates a process, or unloads information about an
executable image, it will call back into the DLL to allow it to clean
up any information it had associated with the process or image.

Note that once loaded, the DLL itself is never unloaded, so you cannot
rely on static variables in the DLL being re-initialized.

MPI 2 extensions to the interface
---------------------------------

The objective is to provide the current functionality in an MPI-2
environment where new processes may be being created by Comm_spawn (or
Comm_spawn_multiple).

Support for displaying the state of one-sided communications is not
considered, though it may be added later.

There are two distinct problems here :-

1) Determining that MPI_Spawn has occurred and locating the new
   processes.

2) Providing a unique name for each of the new processes.

All of the functions and datatypes for MPI 2 support are only defined
by mpi_interface.h if the preprocessor symbol FOR_MPI2 is defined to a
non-zero value, otherwise the header only defines the MPI 1
interface. (This is to allow DLL authors who do not need the MPI-2
interface to continue to compile with the correct version number for
the MPI-1 only interface).

Collecting new processes
------------------------

The debugger must be informed by the MPI run time when new processes
are created, and the DLL must also have a call to enable the debugger
to enquire about processes which exist so that it can correctly locate
all processes when attaching to an existing job which has already
performed spawning operations.

To handle the reporting of new processes, we add two new calls to the
DLL from the debugger.

extern int mqs_spawn_breakpoint_name (mqs_process *, char [], int);

This function returns the name of a function on which the debugger
should set a breakpoint. The breakpoint will be hit (i.e. that
function will be called in the target process) when a spawn operation
has completed and the DLL can extract from the target the information
about the newly created processes. The result is returned in the
second argument as a string, the amount of space allocated to hold the
result is passed in the third argument. If the space is too small, an
appropriate error should be returned.

Once a process has hit the breakpoint above, then the debugger will
call into the DLL to discover the new processes.  As elsewhere in the
debug interface, we use an iterator and iterate over the new
processes.

typedef struct mqs_process_location {
  long pid;
  char image_name [FILENAME_MAX];
  char host_name  [64];
} mqs_process_location;

extern int mqs_setup_new_process_iterator (mqs_process *);
extern int next_new_process (mqs_process *, mqs_process_location *);

When attaching to a parallel job other than at job startup, the
debugger will call mqs_setup_new_process_iterator () on each process
which it acquires to collect any new processes which were spawned by
that process.  If the process has not spawned (or connected) to any
new processes, then mqs_end_of_list should be returned.

Once the debugger has acquired the new (spawned) processes, it will
call mqs_set_process_identity (described below) to name the new process.

The Process naming problem
--------------------------

The fundamental problem we have is that in the interface between the
debug DLL and the debugger the index of a process in COMM_WORLD is
used as the "real" name of a process. For instance when dealing with a
pending operation, the DLL fills in an mqs_pending_operation struct
which contains the fields

  /* Fields for all messages */
  mqs_tword_t desired_local_rank;		/* Rank of target/source -1 for ANY */
  mqs_tword_t desired_global_rank;		/* As above but in COMM_WORLD  */

  /* Fields valid if status >= matched or it's a send */
  mqs_tword_t actual_local_rank;		/* Actual local rank */
  mqs_tword_t actual_global_rank;		/* As above but in COMM_WORLD */

to provide information about the source or target of a message.

However, in MPI-2 as soon as the program has used Comm_spawn, there
will be processes in the MPI job which do not have an index in
COMM_WORLD, since they are not members of COMM_WORLD. It then becomes
impossible to use the current interface without extension.

The core issue here is that in a dynamic MPI-2 environment MPI does
not force a process to have a unique name. Rather processes are
addressed within communicators which are formed hierarchically, and,
indeed, simultabeous process creation may be going on in two different
communicators simultaneously without synchronisation between them.

The Solution
------------

The proposed solution is to have the debugger generate a unique
identifier (integer) for each process, and pass that down into the DLL
at the time that the newly created process is registered with the
debugger as a result of a Comm_spawn. 

This integer is then used in the results of the enquiry functions as
though it were the index of the new process in COMM_WORLD. If you
like, you can think of this as changing the global_rank fields to be
indices into the debugger's concept of COMM_UNIVERSE, within which the
debugger is allocating the indices.

All of the current enquiry functions to the DLL can then continue to
return exactly the same data structures, all that changes is the
interpretation of the global rank fields (and that remains the same
for MPI-1 programs).

However, to make this workable from the debug library writer's
perspective, we do need to add an additional object which is missing
in the current library interface, i.e. a "job" object. This is
necessary because the debug library writer needs some object on which
to hang the translation table which will now be required to handle the
translation form whatever internal name is used inside the MPI system
for the spawned processes back to the index which the debugger wants
to see. This cannot be held on the individual process object, since the
name will turn up when looking at message queues in other processes,
therefore it must be held on the job object which is accessible from
all of the process objects.

Therefore we have to extend the interface to add these new types,
functions and callbacks :-

/*
 * New types	
 */

/* Information hung from a job object */
typedef struct _mqs_job_info mqs_job_info;

/* A job object */
typedef struct mqs_job_ mqs_job;

/* 
 * New callbacks
 */

/* Given a job and an index return the process object */
typedef mqs_process * (*mqs_get_process_ft) (mqs_job *, int);

typedef struct mqs_job_callbacks {
  mqs_get_process_ft	mqs_get_process_fp;
} mqs_job_callbacks;

/* Hang information on the job */
typedef void (*mqs_put_job_info_ft) (mqs_job *, mqs_job_info *);
/* Get it back */
typedef mqs_job_info * (*mqs_get_job_info_ft) (mqs_job *);

/* Given a process, return the job it belongs to */
typedef mqs_job * (*mqs_get_process_job_ft) (mqs_job *);
typedef int       (*mqs_get_process_identity_ft) (mqs_job *);

/* The callback tables are extended so that the basic callbacks
 * includes 
 */
mqs_put_job_info_ft    mqs_put_job_info_fp;
mqs_get_job_info_ft    mqs_get_job_info_fp;

/*
 * and the process_callbacks include
 */
mqs_get_process_job_ft       mqs_get_process_job_fp;
mqs_get_process_identity_ft  mqs_get_process_identity_fp;

/*
 * New calls from the debugger into the DLL are defined
 * for managing the job object.
 */

extern int mqs_setup_job (mqs_job *, const mqs_job_callbacks *);
extern int mqs_destroy_job_info (mqs_job_info *);

/*
 * Finally we have the additional function we really need !
 */
extern int mqs_set_process_identity (mqs_process *, int); 

Compatibility
-------------
Although there may seem to be many changes here, in fact they remain
upwards compatible with the existing interface, and MPI-1 libraries do
not need to implement any of these new MPI-2 relevant calls.

This is possible because in an MPI-1 environment the process
identities remain the same as before these changes, and still
represent the index of the process in COMM_WORLD. The new model that
the process identity is actually allocated by the debugger and
represents an index into a debugger maintained concept of
COMM_UNIVERSE does not affect the processes in COMM_WORLD, since the
debugger can arrange that its model of COMM_UNIVERSE includes
COMM_WORLD as a prefix and, therefore, that the identities of
processes in COMM_WORLD remains the same.

When the debugger knows that the DLL supports the new MPI-2 interface,
then it will call mqs_setup_job and mqs_set_process_identity when it
attaches to an MPI job. When detaching, or when the job exits
mqs_destroy_job_info will be called (after all of the processes have
been destroyed). 

An MPI-1 only DLL need not provide any of these functions (and should
use the mpi_interface.h file without the FOR_MPI2 symbol #defined (or
defined as 0).

Store allocation
----------------

Callbacks are provided to allow the DLL to use the debugger's heap
package. These should be used in preference to calling malloc/free
directly, since better debug information from store corruption may be
possible.

Type safety across the interface
--------------------------------
All of the functions in the interface have complete prototypes, and there
are opaque structures defined for all of the different types.

This means that inside the functions in the library you will need
explicit casts from these opaque types to the corresponding concrete
types that you require. 

The use of explicit casts is unpleasant, however it seems hard to avoid
this without mixing implementation and interface issues in the
mpi_interface.h file.


Miscellaneous issues
--------------------

The debugger may be written in C++, and may not have been compiled
with the "default" C++ compiler for the host on which it is
running. You are extremely likely to run into library compatibility
and version problems if you try to write the DLL in C++ as well.

The dll should *not* perform operations which could cause it to block
(such as reading from sockets, calling sleep or wait()), since this
could cause the whole debugger to be blocked and unresponsive to user
input. 

The dll should avoid declarations of external functions, ideally by
using static functions. If this is not possible (because the DLL
becomes larger than one file), then it should use external names with
a suitable prefix, such as the vendor of the MPI library,
e.g. "Mqs_sgi_", or "Mqs_ibm_". 

Sample code
-----------

With this README you should also have

eurompi-paper.ps.gz  A paper by James Cownie and Bill Gropp which 
		     describes the interface, presented at the
		     "EuroPVM and MPI 99" conference.

mpi_interface.h	     The header which defines the interface

Makefile.simple	     A simple makefile to build a dynamic library on Solaris
dll_mpich.c	     Sample example code which uses this interface to
		     provide message queue dumping for MPICH 1.x
mpich_dll_defs.h     Header for the sample code.

The last three files are not a part of the interface, or definition,
but rather a sample user of the interface, provided on an as is basis
so that you can see how the interface can actually be used.

mpich-attach.txt     Documentation on the process acquisition
                     technique used with MPICH (and recommended for
                     other third party MPI implementations).
	
