Copyright 1997, 1999, 2000, 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
License for more details.

You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
02111-1307, USA.





This directory contains mpn functions for 64-bit V9 SPARC

RELEVANT OPTIMIZATION ISSUES

The Ultra I/II pipeline executes up to two simple integer arithmetic operations
per cycle.  The 64-bit integer multiply instruction mulx takes from 5 cycles to
35 cycles, depending on the position of the most significant bit of the 1st
source operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
Furthermore, it stalls the processor while executing.  We stay away from that
instruction, and instead use floating-point operations.

Integer conditional move instructions cannot dual-issue with other integer
instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
something such bizarre.)  We don't use these.

Integer branches can issue with two integer arithmetic instructions.  Likewise
for integer loads.  Four instructions may issue (iop, iop, ld/st/fop,
branch/fop) but only if a branch or fop is last.

STATUS

Timings on UltraSPARC-1/2:

* lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.

* add_n, sub_n: The current code runs at 4 cycles/limb.

* mul_1/addmul_1/submul_1: The current code runs at about 33 cycles/limb.  By
  splitting the invariant operand into 16-bit chunks and other operand into
  32-bit chunks, we could reach 14 cycles/limb.
