2000/02/25 : Ordering of Stores are re-scheduled in gemm_EV6_k.S.

2000/02/23 : Time calculating routine in benchmark is fixed.
             The instruction "WH64" is not so affected.  According to 
             21264 hardware reference manual, WH64 improves memory
             performance dramatically....

2000/02/16 : A bug of ZGEMM_BETA was fixed(Thanks, Wen Hu).
             Benchmarking/Chenking programs were included.
             ZGEMM SMP routine is also released.
             Documentation has moved to doc directory(Under Developed).

             The performance of 21264 has been down because of
             stability of performance. But 21164 is just faster
             than before.

             To get higher performance, I add WH64 instruction in
             inner copy routine for a test.

2000/02/05 : ZGEMM_BETA routine has been written in Assembler.
             Now all significant routines are written in Assembler.
             There are pretty less rooms to optimize futher more.

             ToDo : ZGEMM SMP routine.

2000/01/02 : EV5 and EV6 routines are merged.
             Prefech address have changed.
             Now dgemm reaches 1.2GFlops(90% of full power)!!  I think
             this is theoretical value of 21264.  Do you know how jump
             latency in EV6 ?  1 clock, or hidden?
               --> The answer is "hidden". The most inner loops's 
                   efficiency is 97%, according to actual survey.

99/10/23 : added "R" treated routine.
	    by Wen Hu - Compaq <hu@mathaump.zko.dec.com>

99/10/14 : Quick return condition in zgemm.c was fixed.
	    by Wen Hu - Compaq <hu@mathaump.zko.dec.com>
	   zgemm_k.S is common for EV6 and EV5.

99/10/12 : argument of .prologue is fixed.
	    by Wen Hu - Compaq <hu@mathaump.zko.dec.com>

99/10/10 : _IO_STDERR change to stderr

99/10/07 : PAL CALL bug was fixed.
	    by Wen Hu - Compaq <hu@mathaump.zko.dec.com>

99/10/01 : dgemm/sgemm/zgemm/cgemm routine for 21264 are available.

           initial peprformance(21264 500MHz 2MB L2 cache, 2000x2000)
 
              SGEMM  : 897 MFlops
              DGEMM  : 868 MFlops
              CGEMM  : 878 MFlops
              ZGEMM  : 853 MFlops

          Hmm,  SGEMM/DGEMM are fast enough, but CGEMM/ZGEMM are not.
