NEC-SX4

Cache is not coherent.
Vector instructions bypass the cache.
The cache is write-through.
Shared memory pages not cached.
Spin loops need nops.
Use -hacct and set the environment variable C_PROGINF to DETAIL to get some
   performance info.  Also can use prof (but no gprof!)
8ns clock; 10-20 cycles for a (scalar) memory ref (guess)
Function calls are expesive (large register set); perhaps 300ns.  Assembly
routines can "cheat" (callee-saves) and be careful of register usage.

No plans for mmap, but shmat will have "delete on not ref'ed" in a
later release.

128 byte cache line

memcpy is fast and USUALLY uses vector instructions (yes if 8 byte aligned and
length a multiple of 8 bytes)

There are test-and-set instructions (e.g., TS1AM) that enforce sequentiallity
(ordering) of writes.

There is a single (fast) clock for each node (a node may have upto 32 CPUs)

Make sure 

    setenv FLMOD float0

is set.

    stty intr ^c erase ^\? kill ^u echoe

is important in .login.

The following options were used in the NEC thread-based version

    -DSX4 -D_REENTRANT -hfloat0 -hansi -hmap -hobjlst

The last two give a thread-variable use map (I think) and the assembly listing
(in foo.O, not foo.s).

A load map uses -Wl -d -Wl -m

The compiler supports inlining of functions with the -S
filename,... option, as well as with #pragma odir (names-of-routines).  
See the manual (we did not try this)

In using prof, we used the default and the -t and -f option.  -f didn't seem to
give much useful information.

