In porting this device to other systems, there are a number of issues
to consider. Some of these are correctness issues and are particularly
important, because they deal with ensuring that there are no race
conditions in load/stores.  Such correctness problems may show up only
at high load or numbers of processes, and so it is important to be
careful with them.

Other issues address performance problems.  With this device, very
good performance is possible.  But a single misstep (say an extra
cache miss caused by an out-of-order load or store) can cause a
significant performance degradation.

Choice of device: 

If it is possible to make ALL memory (including the
stack) visible from other "processes" (process in the MPI sense, see
below), then a modification of this device that does single copies from
source to destination, at least for long messages (rendezvous) should be used.

Correctness:

Is the system (hardware + software) sequentially consistent?  If not,
can it be forced to be (e.g., through the use of "sync" instructions
and special compiler options)?  Are a subset of instructions sequentially
consistent (e.g., vector move instructions)?

Is the system cache-coherent?  If not, is there a way to avoid forcing
cache flushes (for example, using cache bypass instructions for all read/write
operations)?  Systems with vector instructions (and assembly routines) may
find those useful.

Performance:

These are not so much a series of steps as things to think about.

What is the cache line size (both to avoid false sharing in locks and to
attempt to reduce the number of separate cache lines touched)?

What is the best way to transfer a cache line from one processor to another?

Is the cache write-through or write-back?

What is the best way (in terms of performance) to allocate shared
memory (for example, is shmat preferable to mmap?  mmap with special
options?)?

Are there ways to implement lock-free single consumer/multiple writer
queues and/or stacks?

What is the fastest way to perform a memcpy?  Can special
alignment/length rules help?

What are the preferred locks?  Can special locks be written that are
faster (e.g., not persistent like SYSV or not nestable like some
thread locks)?

Are there performance improvement tools?  What is the preferred way of tuning 
a program?  Are there special restrictions (e.g., not on forked programs
or threaded programs)?

What is the penalty for a cache miss?  Are there TLB issues?  

What is the proper way to implement a spin look in an exclusive-use
environment?  In an over-subscribed environment?

How are "processes" scheduled?  Can they be gang-scheduled?  Note that
this may suggest different strategies for creating the processes (most
MPI applications are more efficient if gang-scheduled).

Is there a high-resolution, low-overhead clock?  In an SMP, is the
same value returned by all processors (so that MPI_WTIME_IS_GLOBAL can
be set true)?

Miscellany:

If you DO use the SYSV routines, be sure that the users know about the
dangers.  The script util/cleanipcs can be used to remove all ipcs.
WARNING: if another application is using the SYSV ipcs deliberatly (i.e., 
is still running or is exploiting their persistence), cleanipcs will 
break that application.  Users should use ipcs and ipcrm in that case.

The "ready" value that is used is a single 0/1 value.  Since an entire
word is used, this could contain more data as long as an empty
indication was always easy to test for (probably must be 0 for that,
at least in some bit).  One possibility is to store the address of the
receivable packet instead of "1"; in this case, the user can run
through the packet ready list (possibly with a vector or
cache-optimized loop) and, once a ready packet is found, needs not
compute the address of the packet.

It may (or may not!) be a good idea to put the "ready" indicator into
the packet.
