


		     Notes on using p4 with Fiber Channel


We are just beginning to experiment with fiber channel links on our SP1, and
are currently using p4 as a testbed.  p4 can use the fiber channel links in
two ways:  TCP/IP, and "direct".  This interface is also being tried on some
other systems.  The file lib/p4_fc.c has the code.


TCP/IP
------

To use the TCP-IP interface, all you need to do is change the hostnames in
a standard p4 procgroup file.  Note that you have to use a specific hostname
in the first line, rather than "local" so that the fiber channel port on the
machine where the program is originally started can be identified.  On the
Argonne SP1, every fourth node starting with node 2 has a fiber channel
connection, so a valid procgroup file might look like:

fcnode2 1 /sphome/lusk/p4test/systest
fcnode6 1 /sphome/lusk/p4test/systest

And then start systest on spnode2.

Our initial measurements (using the systest ring test with two processes) show
a bandwidth of about 2.5 Mbytes/sec.

Direct
------

To use the "direct" interface, you need to use three additional calls in your
p4 program, at least until we have it better integrated into p4.  After
calling p4_initenv and p4_create_procgroup to start the processes and set
up a few initial socket links, you must call p4_initfc() to initialize the
fiber channel data structures.  Then instead of p4_send and p4_recv, use
p4_sendfc and p4_recvfc, with the same arguments as p4_send and p4_recv.  You
must supply a buffer for the receive (rather than having p4 allocate it for
you) and you cannot use the wild card option on the sender (you must specify
the source of the message you are receiving).  The program systest_fc.c in the
messages directory is an example.

The procgroup file should have spnodes, not fcnodes or swnodes.

Using the direct interface, we measured a bandwidth of about 17 Mbytes/sec,
with messages of from 256K to 1M.

Once high-bandwidth has been achieved, the FC card seems to go into a bad
state, and cannot handle even small messages.

Both measurements were done with a load of about 2.0 on each node.

This is still a very preliminary implementation, not thoroughly tested or
tuned. 

We are currently relying on the programmer to have his receives ready when
sends occur.  This has not been a problem with our current examples.  In the
long run we are going to use the IBM DCE thread package to keep an outstanding
receive going all the time.
