DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
XBLAS  a Proposal to the BLAST Forum for Extra Precise BLAS
James Demmel
Greg Henry
W. Kahan
4 Sept. 1997
Section 1: Introduction

We propose two kinds of extensions to the current BLAS, one that allows
mixing of operand matrices of different precisions, and a second kind that
allows arithmetic inside the BLAS to be carried out to higher precision
than the operands, when doing so confers some advantage. These extensions
are motivated by the following facts:
A number of important linear algebra algorithms can become more
accurate and sometimes faster if internal computations carry range
and/or precision wider than carried by input and output arguments.
These include linear system solving, least squares problems, and
eigenvalue problems. Often the benefits of wider arithmetic cost only
a small fractional addition to the total work.
For single precision input, the computer's native double precision is a
way to achieve these benefits easily on all commercially significant
computers, at least when only a few extraprecision operations are
needed. Crays, Convex, and their emulators implement 64bit single
in hardware and much slower 128bit double in software, so if
many double precision operations are needed, these machines will
slow down significantly.
The overwhelming majority of computers on desktops, containing Intel
processors or their AMD and Cyrix clones, are designed to run fastest
performing arithmetic to the full width, wider than double precision,
of their internal registers. These computers confer some benefits of
wider arithmetic at little or no performance penalty. Some BLAS on
these computers already perform wider arithmetic internally but,
without knowing this for sure, programmers cannot exploit it.
All computers can simulate quadruple precision or something like it in
software at the cost of arithmetic slower than double precision by
at worst an order of magnitude. Less slowdown is incurred for a
rough doubleddouble precision on machines ( IBM RS/6000, PowerPC/Mac,
SGI/MIPS R8000, HP PA RISC 2.0 ) with special fused
multiplyaccumulate instructions. The slowdown is practically
negligible for a few algorithms that require very little of this kind
of arithmetic to surpass the speeds of competitive algorithms
that avoid extraprecise arithmetic entirely.
In this proposal we shall explain our motivation in more detail and
present detailed specifications for our proposed extra precise BLAS, which
we call XBLAS. The greatest challenge to our design is dealing with too
many possible combinations of input types (including precisions), output
type, and internal precision, not all of which are useful. We must name
all useful possibilities, provide reasonable defaults so they are easier
to use, make them callable from different languages, and arrange these
numerous routines into a smaller number of groups each of which may be
automatically generated, tested and even optimized from a few prototypes.
Here are three examples to illustrate the opportunities and challenges.
First, extraprecisely accumulated matrixvector products provide an
inexpensive way to refine solutions of linear systems so well that the
error is small independently of the system's condition number, unless
the condition number is extremely big.
The cost of the extraprecise arithmetic is negligible for moderate sized
and larger matrices.
Second, multiplying a complex matrix by a real matrix is so much faster
than multiplying two complex matrices that a special LAPACK auxiliary
routine CLACRM was invented to do this instead of promoting real to complex
and calling CGEMM. This routine is usually the bottleneck in what is
currently the fastest Hermitian tridiagonal eigensolver, CSTEDC.
Thus allowing a mixture of input types enhances speed.
Accuracy can be enhanced too if extraprecise accumulation is allowed.
Third, the algorithm in Example 2.1 below needs extra precision in just
a few operations, and could obtain that at negligible cost from the
80bit extended format on Intel/AMD/Cyrix platforms. For portability,
it should permit other platforms to use doubleddouble instead of extended.
The extra precision need only be used inside the BLAS, so no
user variables need to be declared in the high precision format.
On the other hand, if the user's input data is double, the algorithms
in Examples 2.4, 2.5 and 2.7 would require running a high level LAPACK
subroutine with variables declared to be in the high precision format;
not all compilers support such declarations. What is the right
tradeoff between portability and functionality for libraries like LAPACK?
To accommodate all these possibilities, we have chosen a simple naming
scheme that permits ALL possible combinations of types and precisions to be
specified, even uninteresting ones. We do not expect all combinations to
be implemented. The naming scheme would be no simpler if we were to
explicitly limit the combinations.
Our specification prioritizes the most important routines
to be implemented, and leaves open the possibility of implementing others
as well. In addition, we specify an environmental inquiry function to
determine the properties of extra precise BLAS, and whether they exist at
all. (For example, the user might be told that no doubledouble BLAS
exist on his Intel platform.) Optional arguments are used judiciously so
that the user can specify as little or as much detail about precisions of
input arguments and internal precision as desired.
For example, a call to matrix multiplication could appear as
XDGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC, PREC )
where D indicates that the precision of the output floating point variable C
is double.
The only difference between this and the standard BLAS calling sequence
is the optional string PREC. If PREC left blank, this would indicate that
all input variables are also double, and that the widest indigenous
floating point format should be used internally (extended on Pentiums,
and double on most other machines). If PREC='E', it indicates that the
input variables are also double, and that a format
wider than double should be used; on Pentiums this would be extended, and
on other machines probably doubleddouble. If PREC='ES' it indicates
in addition that all the input variables are in fact single precision.
If it is simpler or
cheaper to carry more precision internally than the user requested, the
BLAS implementor is free to do so provided the environmental inquiry
routine returns the actual precision used.
In this spirit we invite implementors of the standard BLAS library to carry
extra precision internally whenever this is insignificantly slower than not
carrying extra precision, since this is permitted by the BLAS specification
now and usually gives users better results. We also propose that our
environmental inquiry report the precision actually used in the standard
BLAS, along with other information like the use of Strassen's matrix multiply.
Left unsolved is the problem of obtaining from one platform exactly the same
result as was generated by another, since this problem is already intractable
in the face of BLAS optimized for speed on platforms with diverse memory
management, opportunities for concurrency (pipelines), and heterogeneous
collections of processors.
Though we concentrate on extending the dense BLAS to accommodate extra
precision, we believe our general scheme will accommodate other extensions,
for example to sparse or parallel BLAS, to interval BLAS, and to other
routines we discuss below.
The rest of this proposal is organized as follows.
Section 2 lists compelling examples to motivate our proposal. Most involve
extraprecise or mixedtype versions of existing BLAS, and some involve
new routines.
Section 3 surveys implementation technicalities for the higherthandouble
precisions supported by current hardware.
Section 4 describes the naming scheme and calling sequences, including
optional arguments and defaults.
Section 5 describes the environmental inquiries used to describe the
highprecision BLAS, including the accuracy model they must satisfy.
Section 6 presents the prioritized list of extra precision BLAS that we
propose all developers implement.
Section 7 describes how one can automatically generate these routines.
Section 8 describes how one can automatically test these routines.
Section 9 gives sample implementations of some basic routines.
Section 10 will eventually give sample applications showing how useful
these routines are. It currently justs discusses how to pick these
applications, and what we should expect from compiler writers.
Section 11 compares this to a previous proposal, which was part of the
BLAS 2 specification [1].
Section 2: Example Applications

We present a variety of examples that illustrate different advantages of
our proposal, including improved accuracy, improved speed, and improved
reliability. Not all features of our proposal are used by all examples:
Examples 2.1, 2.2, 2.4, 2.5, 2.6, 2.7 and 2.9 require higher precision.
Examples 2.3, 2.4, and 2.5 require wider range. They can also
take additional advantage of IEEE exception handling internally,
but do not require it.
Examples 2.7 and 2.8 require mixedtype versions of existing BLAS;
2.8 requires no extra precision or range.
Most examples have wellunderstood benefits, but examples 2.6, 2.9
and the last part of example 2.2 are research questions.
Most examples use mixed or higher precision versions of existing BLAS, but
examples 2.4, 2.5 and 2.7 would require new BLAS (actually the promotion
of certain LAPACK or ScaLAPACK routines to BLAS status). They are easy
to implement when the input precision is single and the extra precision
is double. We include them to illustrate the broader utility of mixed
precision computations.
Example 2.1: Iterative Refinement of Linear Systems

Given an LU or other stable factorization of a matrix A, we can solve Ax=b
by forward and back substitution. The relative error of this algorithm is
bounded by
norm(x_computed  x_true) / norm(x_true) <= O( N * cond(A) * macheps )
where cond(A) >= 1 is the condition number of A and N its dimension, and
macheps is the machine epsilon for the input and output data, which we call
input/output precision. This error bound is not excessively pessimistic;
errors about that big are encountered occasionally.
The accuracy of the computed solution x may be improved by iterative
refinement roughly as follows:
repeat
Compute the residual r = A*x  b
Solve A*d = r for d using the existing factorization
Update the solution x = xd
until { the residual is small enough or stops decreasing, or
a maximum iteration count is exceeded }.
LAPACK currently implements this algorithm entirely in the input/output
precision, in routine xGERFS. What it accomplishes, according to a
well understood erroranalysis, is essentially to replace the condition
number cond(A) in the error bound by another possibly smaller one,
which can be as small as
min cond(D*A)
where the minimum is over all diagonal matrices D. In
effect, this algorithm compensates, up to a point, for bad rowscaling
The residual is never worsened but the solution x, though usually improved,
frequently gets worse if this last condition number is very big.
An improved version of iterative refinement differs from LAPACK's in two
places. First, the residual r = A*x  b is accumulated to at least twice
the input/output precision and then, after massive cancellation has taken
place, is rounded to input/output precision. Second, refinement is stopped
after x appears to have settled down about as much as it ever will. Now
another classical error analysis says that the ultimate relative error will
be bounded by O(macheps), essentially independent of the condition number
and dimension of A. This assumes that N * cond(A) is not comparable with
or bigger than 1/macheps (in which case A could be made exactly singular
by a few rounding errors in its components). In short, the
improved version of iterative refinement almost always produces the correct
solution x, and almost always gives fair warning otherwise. This is a much
simpler and stronger result than LAPACK provides now.
If the improved algorithm's residual r is accumulated to sufficiently more
than input/output precision but less than twice as much, its improvement
over what LAPACK obtains currently becomes proportionately weakened.
In extensive experiments on machines with 80bit registers, i.e.
11 more bits of precision than double, iterative refinement either
returned the solution correct to all but the last few bits, or with
at least 10 more correct bits than without refinement
[2].
The current LAPACK algorithm does not have to change much to benefit from a
more accurate residual: macheps has to be replaced in a few places by a
roundoff threshold that reflects the possibly extraprecise accumulation of
r; and if this latter threshold is sufficiently smaller than macheps then
the iteration's stopping criterion should be changed from testing r to
testing x and d, and an estimate of the error in x should be obtained not
from the current condition estimator but instead (at lower cost) from the
last few vectors d.
The indispensable requirement is an extraprecise version of GEMV in the
Level 2 BLAS. One iteration of refinement costs only 4*N^2 flops per
righthand side column b. If, as usual, there are only a few columns b and
at most a few iterations suffice, their cost is asymptotically negligible
compared to the O(N^3) operations required to compute the LU or other
factorization in the first place, even if extraprecise operations run
somewhat slowly. Therefore, we want to insist that GEMV use extra precision,
even if it is expensive.
Example 2.2: Iterative refinement of other linear algebra problems

The availability of extraprecise residuals would make it possible to solve
many other linear algebra problem more accurately as well. The simplest
example is the overdetermined least squares problem
choose x to minimize norm(A*x  b)
This can be formulated as the linear system
[ I A ] * [ r ] = [ b ]
[ A' 0 ] [ x ] = [ 0 ]
and refined using the QR decomposition of A from which the initial solution
x was obtained. Again, the final error depends much less upon the
condition number of A (though the situation is a bit more complicated than
a simple linear system). Again, an extra precise GEMV is indispensable,
and its cost is asymptotically small provided that A is not too far from
square.
The foregoing is most appropriate for double precision data A and b,
since it only requires extra precision hidden within BLAS.
For single precision data Example 2.7 provides an attractive alternative,
where we use more mixed precision.
Iterative refinement for eigenproblems is still a topic for research
encouraged by better results from better mathematical algorithms than have
been published in previous years. The indispensable requirement is a
matrix residual
R = A*X  Y*B = [ A Y ]*[ X ]
[ B ]
accumulated extraprecisely and rounded back to input/output precision
after massive cancellation has taken place. Sometimes Y = X; sometimes
B is diagonal or triangular. Such residuals can be computed from one
call upon a GEMM that accumulates extraprecisely, but a lot of waste
motion can be avoided if versions of GEMM that accept and produce
extraprecise matrices are available too. Initially these
versions of GEMM could be simulated by calls upon extraprecise versions
of GEMV instead, just to get the ball rolling sooner.
Iterative refinement of nonsymmetric eigenproblems, other than Schur
decompositions, depends crucially upon how well iterative refinement works
for extremely nearly singular linear systems. For best results, the LU
factorization should ideally be performed by Crout factorization with
extraprecisely accumulated scalar products, but how much good this would
do is not yet clear. Exception handling is important too because infinity
and 0/0 turn up in the better mathematical algorithms mentioned above
unless entirely algebraic computations are replaced in spots by invocations
of transcendental functions elementwise upon arrays. The last question
remaining is how much precision would be needed to perform an entire
eigencalculation by conventional means, without refinement, to obtain
results as good and as soon as are obtained from refinement with a little
extraprecise arithmetic in the program; attempts to answer this question
empirically have occasionally generated astonishment.
Example 2.3: Solving Illconditioned Triangular Systems

The LAPACK subroutine SLATRS solves a triangular linear system T*x = scale*b
with careful scaling to prevent over/underflow.
It is used in place
of the Level 2 BLAS routine STRSV when we expect very illconditioned
triangular systems, and want to get the solution vector (with a scale
factor) despite the fact that it may be very large or very small.
This is the case with condition estimation (SGECON, SPOCON, STRCON)
and inverse iteration to find eigenvectors of nonsymmetric matrices
(SLAEIN). SLATRS does careful scaling in the innermost loops to
prevent over/underflow, and is significantly slower than an optimized
STRSV. Also, it contains about 320 executable statements,
(versus 160 for the unoptimized version of STRSV)
a measure of the complexity of producing such code.
There are three ways to accelerate SLATRS.
The first way is to use extra exponent range to guarantee that
over/underflow cannot occur for a given number of iterations of
the outermost loop of triangular solve:
Solve L*x = b ... L is lower triangular
for i = 1 to n
x(i) = [ b(i)  ( sum_{j=1}^{i1} l(i,j)*x(j) ) ] / l(i,i)
For example, if the entries of L and b are represented in IEEE
single, and no diagonal entry of L is denormalized, and native
80bit Pentium arithmetic is used to implement this algorithm,
testing the scaling is needed only once every 64
values of i. In particular, this could be implemented by calling
an optimized SGEMV on blocks of size 64, which would get much of
the benefit of the optimized SGEMV. Double is not so good, since
we could only do blocks of 8. Alternatively, we could use scaling code
like that in SLATRS to compute the largest block size we could get
away with. This requires information about the norms of columns of L.
In the case of SLAEIN, where we call SLATRS many times, the cost of
this information is small. In the case of condition estimation, the
cost is larger.
The second way to accelerate SLATRS was explored in the paper
"Faster Numerical Algorithms via Exception Handling" [3].
The IEEE exception flags can be used to hide over/underflow problems
as follows: We first run using the optimized STRSV, and then check the
exception flags. Since over/underflow is rare, this will almost always
indicate that the solution is fine. Only in the rare cases that an
exception occurs does one need to recompute more carefully. This is
definitely the way to go on Pentium. Speedups reported in [3]
ranged from 1.4 to 6.5.
The third way to accelerate SLATRS, which could complement the above
ways, is to more cleverly estimate how big a block we can call STRSV on,
based on some data about sizes of offdiagonal entries of the matrix.
SLATRS currently does this in a naive way, and we have explored better
alternatives as well.
The complex analog CLATRS is used in the above routines as well as
in some other LAPACK routines like CTREVC, which computes eigenvectors
of triangular matrices. Speedups of CTREVC using exception handling
reported in [3] ranged from 1.38 to 1.65. The overall
routine CGEEV for the complex nonsymmetric eigenproblem sped up by 8%,
since CTREVC is just part of the calculation.
Example 2.4: Eigenvalue calculation using Bisection

This is an inner loop in the ScaLAPACK routine PDLAIECT to compute
eigenvalues of symmetric matrices. The simplest version is as follows
Count the number of eigenvalues less than x of the symmetric tridiagonal
matrix T with diagonal entries a(1:n) and off diagonals b(1:n1)
count = 0
d(0) = 1
for i = 1 to n
d(i) = a(i)  x  b(i1)^2/d(i1) ... b(0) = 0
if d(i)<0, count = count + 1
In practice d(i) overwrites d(i1), and we replace the inner loop by
d = a(i)x  b(i1)^2/d.
The trouble with this is that if d is small or zero, overflow occurs.
The standard fix is to test to see if d is too small, and explicitly
set it to a tiny safe value if it is:
for i = 1 to n
d = a(i)  x  b(i1)^2/d
if (abs(d) < too_small) d = too_small
if d<0, count = count + 1
A faster way used by PDLAIECT if the user is on an machine
implementing IEEE infinity arithmetic (i.e. most of the time)
is to go ahead and compute an
infinite d, since the next time through the loop
b(i1)^2/d will be zero, and the recurrence will continue uneventfully.
This also requires updating count slightly differently, to account for 0s.
Speedups reported in [3] range from 1.18 to 1.47.
for i = 1 to n
d = a(i)  x  b(i1)^2/d
if count = count + signbit(d)
One can further accelerate this recurrence by reorganizing it to
remove the division, which can cost 6 times or more than the other
floating point operations. This loop looks like
for i = 1 to n
p(i) = (a(i)x)*p(i1)  b(i1)^2*p(i2)
if (p(i) and p(i1) have opposite signs) count = count + 1
The relationship between these loops is that d(i) = p(i)/p(i1).
Again we can remove the subscript on p(i) if we use two temporaries:
for i = 1 to n step 2
p = (a(i)x)*p1  b(i1)^2*p2
if (p and p1 have opposite signs) count = count + 1
p1 = (a(i+1)x)*p  b(i)^2*p1
if (p and p1 have opposite signs) count = count + 1
p2 = p
This loop has no division, and so may be much faster than the first one.
Not all machines have such slow division, so this will be machine dependent.
However this loop is much more susceptible to over/underflow,
because p(i) is the determinant of the leading ibyi submatrix of Tx*I,
and determinants easily over/underflow. For example det(x*eye(n)) = x^n
in Matlab notation, and neither x or n has to be very large for
overflow to occur. We again have two
approaches as in Example 2.3: use extra exponent range in the inner
loop (which can reduce the number of scaling tests in the inner loop
to one every 128 iterations in single or 16 in double),
or IEEE exception handling.
There are various ways to accelerate bisection by using Newton's
method or related ideas, but all involve inner loops similar to
the one above. [many refs]
This computation could not be done with extra precise versions
of existing BLAS alone, but would require new code with a small
number of explicitly
declared extra precise variables. If PDLAIECT is "promoted" to a BLAS,
these could be hidden from the user.
Example 2.5: Orthogonal Eigenvector calculation with Inverse Iteration

Given the eigenvalues computed from bisection, the standard algorithm
to find their corresponding eigenvectors is inverse iteration,
which involves solving a tridiagonal system of linear equations.
In exact arithmetic, this algorithm would cost O(n) work per eigenvector,
i.e. an optimal O(1) work per solution component, and be embarrassingly
parallel. Until very recently, roundoff (from reorthogonalization) has
made this algorithm cost as much as O(n^2) per eigenvector and be serial,
depending on the eigenvalue distribution. Recent work by Parlett and
Dhillon promises to attain the promised complexity of the infinitely
precise algorithm, and using only input/output precision. Currently their
algorithm works better than they can explain, but a complete proof may
remain elusive for a while. In the meantime, it is straightforward to
drastically reduce the incidence of the "hard cases" (when eigenvalues
are tightly clustered) by using higher intermediate precision.
Briefly, if computations are done in b bits, and an eigenvalue shares
sFrom the last section, we can see the possibility of the following
6 additional types (which may or may not be supported by a particular
compiler)
Table 3: Possible High Precision Data Types
X (extended precision real)
Y (extended precision complex)
W (doubleddouble precision real)
U (doubleddouble precision complex)
Q (quadruple precision real)
P (quadruple precision complex)
As mentioned in section 3, one can imagine that type X takes 10bytes to
store, but that it would be more efficient to store it on 12byte or
16byte boundaries because of the memory system. We assume there is
only one way to store it per machine, and do not distinguish among these,
though they could differ from machine to machine.
In principle, the type of each input and output parameter could be
specified independently.
Independent of the input and output types, we could specify the
internal precision to be used during the matrix multiplication itself.
Clearly the following are all possibilities:
Table 4: Possible Internal Precisions  Preliminary List
S (single precision)
D (double precision)
X (extended precision)
W (doubleddouble precision)
Q (quadruple precision)
However, there are several other possibilities we should consider.
First, for most calculations we would be happiest if the internal
precision were the widest one done in hardware (i.e. fast) by the machine,
which we call "native arithmetic".
On the Pentium and its clones, this means precision X (80bit arithmetic).
On other machines (from SUN, HP, IBM and DEC) it means precision D
(double precision).
So we might want to designate internal precision I for indigenous arithmetic,
which would be either X or D depending on the machine.
Second, for iterative refinement and some other applications where extra
precision is essential and its cost low, we want to use whatever
extraprecise arithmetic is available, and do not care whether it is
X (on Pentiums) or W (on most other machines). So we might want to
designate E for Extra precision, which would be either X or W, depending
on the machine.
But these are still not the only possibilities that one might want. The
expression tree for xGEMM is
ALPHA*(op_1(A)*op_2(B)) + BETA*C


+
/ \
/ \
/ \
/ \
* *
/ \ / \
/ \ / \
ALPHA * BETA C
/ \
/ \
/ \
op_1(A) op_2(B)
The traditional way to evaluate this or any other expression in
Fortran or C is to use the wider precision of either operand
of any binary operator. So if ALPHA, A, B, and C are single precision
real, but BETA is double precision, BETA*C and the addition will
be done in double precision, and the other two multiplications in single.
We denote this socalled "strict" evaluation scheme by T for traditional
(since S is already taken).
Finally, a perhaps more natural way to evaluate such a mixed
precision expression is to do all operations in the widest precision of any
type in the expression. In the example of the last paragraph, this means
all operations would be in double precision. This presumes the existence
of a total ordering on all the precisions in Table 4. A possible difficulty
here is that if a machine has both X and W arithmetic, it is conceivable that
X might have less precision but a wider exponent range than W, making them
hard to order. However, we doubt that any machine would have 3 distinct
precisions D, X and W that were in addition not totally ordered, and in any
case leave it to the implementation to impose a total order.
The natural letter to designate this "evaluate to widest" scheme is
W, which is already taken by doubleddouble, so we use M for
"evaluate using the Most precise type appearing."
Finally, we note that a few machines like the IBM 4361 actually support
exact inner products. This entails simulating a register wide enough
(hundreds or thousands of bits)
so that the square of the smallest floating point number can be added
to the square of the largest floating point number without roundoff.
We mention this possibility not because it is common or because we recommend
it, but to list all the possibilities that we might want to accommodate.
We indicate this internal precision by L for Long, since E (for Exact)
is already taken.
Altogether, this yields the following table of possible "internal precisions":
Table 5: Possible Internal Precisions  Final List
S (single precision)
D (double precision)
I (indigenous precision, either D or X)
X (extended precision)
E (extra precision, either X or W)
W (doubleddouble precision)
Q (quadruple precision)
T (traditional)
M (most precise type appearing in expression)
L (long)
Finally, we can present our proposed calling sequence for extended
precision matrix multiplication:
XxGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC, PREC )
The differences between this calling sequence and the one for xGEMM above
are as follows:
1) The first letter is X, and the prefix x can be taken from either Table 2
or Table 3. x specifies the precision of the output variable
(in this case the matrix C). Not all precisions from Table 3 may be
implemented.
2) The final (optional) parameter PREC is a character string specifying the
internal precision and input variable types. It is interpreted as follows.
2.1) If PREC is a null string (or missing), then the types of all the input
variables default to be the same as the output variable as specified
by x. The internal precision must be AT LEAST as high as the
precision associated with x. The actual precision is available
from the enironmental enquiry specified below. Note that this
assumes a total ordering on the precisions available in Table 5
(except for T and M, which are equivalent to one of the other
precisions in this case, since all input and output types
are identical to x).
2.2) If PREC contains a single character, it must be from Table 5
(otherwise it is an error) and specifies the internal precision.
Again, the implementation is free to use at least as much precision
as specified, as reported by the environmental enquiry routine.
Since the output and input precisions are all the same (x),
choosing either PREC=T or PREC=M is equivalent to choosing PREC=x.
2.3) If PREC contains more characters, these specify the input types,
associating with the input variables from left to right. For
example PREC = "XSDCZQ" means that the internal precision is
(at least) X, and the types of ALPHA, A, B, BETA and C are
S, D, C, Z and Q, respectively. If there are fewer characters
than input variables, then the rightmost character applies to
the remaining input variables. Thus
XDGEMM( ... , 'DS' )
returns a double precision product from single precision inputs,
using (at least) double precision internally.
We believe that this scheme requires the least change (a single extra
character) to get most of the desired functionality, but permits detailed
control over precision if desired. If a system does not permit the
user to declare variables of type Q, say, then this will be obvious
at load time because XQGEMM will not exist. If the user asks for
internal precision W but the system better supports the wider precision Q,
then the implementor is free to use Q instead.
One possible exception to the rule that the system may use precision
at least as high as requested by the user might be T. We cannot think
of a situation where the user would explicitly specify traditional
evaluation, as well as differing individual input precisions, without
wanting this specific evaluation scheme.
What about C++, Java, F90?
Section 5: Environmental Enquiries

We propose to extend xLAMCH to handle enquiries about the internal
precision used. Ideally, one would use the same routine name with extra
arguments, which if present, would have the meaning specified below.
If this is not possible or desirable, then we would use XxLAMCH instead.
The calling sequence is as follows:
xLAMCH( CMACH, PREC, ROUTINE )
CMACH as the same meaning as in the current xLAMCH, except that the arguments
S (safe minimum, smallest number such that 1/safmin does not overflow)
U (underflow threshold)
O (overflow threshold)
are not permitted, because their values may not be representable in
precision x (for example, suppose the internal precision is D and x=S).
Information about these values is available from arguments
M (exponent of U) and L (exponent of O). This way, all data about the
widest precision will be representable in the narrowest precision.
PREC has the same meaning as in the last section. If the precision
is not implemented, xLAMCH returns 1 for any CMACH.
ROUTINE, if present, is a string with the name of the routine being
asked about (eg 'XSGEMM'). This provides a way to ask about the
implementation of a particular routine. If the routine is not implemented
with the precision PREC, xLAMCH returns 1 for any CMACH.
If ROUTINE is not present, only the first character of PREC is relevant.
If ROUTINE is present, the other characters of PREC are interpreted as
in the last section.
Here is what the values returned by xLAMCH mean.
If xLAMCH( CMACH, PREC, 'XxGEMM' ) returns
EPSz, BASEz, EMINz and EMAXz for internal precision PREC, and
xLAMCH( CMACH, 'x' ) returns
EPSx, BASEx, EMINx and EMAXx for output type x,
then this means that XxGEMM satisfies the following error bound.
Let
Uz = BASEz**(EMINz1) ... underflow threshold for internal precision
Ux = BASEx**(EMINx1) ... underflow threshold for input/output type
Oz = BASEz**(EMAXz1)*(1EPSz) ... overflow threshold for internal precision
Ox = BASEx**(EMAXx1)*(1EPSx) ... overflow threshold for input/output type
Then the error bound is
 computed C(i,j)  true C(i,j)  <=
(K+2)*EPSz* \sum_{k=1}^K ALPHA*op_1(A)(i,k)*op_2(B)(k,j)
+(K+1)*EPSz* BETA*initial C(i,j)
... roundoff from internal precision
+EPSx* computed C(i,j)
... roundoff from rounding result to output precision
+(2*K*ALPHA+3)*Uz
... underflow from internal precision
+Ux
... underflow from rounding result to output precision
This bound reflects the following assumptions:
1) Strassen's or other fast methods may not be used, but the terms
in the dot products may be accumulated in any order.
2) The parentheses in ALPHA*(op_1(A)*op_2(B)) are respected.
3) Gradual Underflow or FlushtoZero is permitted.
4) No intermediate value exceeded Oz in magnitude, and the final
result does not exceed Ox in magnitude.
Analogous formulas exist for other BLAS.
Since the standard BLAS library is not forbidden from using
extra precision, we propose that this enquiry also apply to
the standard BLAS. In other words, one could equally well call
xLAMCH(CMACH,'SGEMM') as xLAMCH(CMACH,'XSGEMM','X'), say.
Furthermore, we encourage developers to use higher precision if
the performance penalty is insignificant.
This gives us added flexibility in enquiring about the internals
of the BLAS. For example, we could have an argument CMACH='A'
and return data about the algorithm used. The return argument might be
1 if standard matrix multiply used, with order of summation
is guaranteed to be in some particular order.
2,3, ... same as 1 with other orders, or no order guaranteed
4 Strassen's method
Section 6: Prioritized List of Extra Precision BLAS

In this section we described the most important extended precision
BLAS to implement.
XxDOT(N, X, INCX, Y, INCY, PREC)
The dot product is the basis of most other high precision operations.
Initially just a single character PREC would be supported, but
eventually it should be possible to compute and return the result
to higher precision than the inputs, to support using the normal
equations for least squares problems. Also, permitting one input
to be real and the other complex would support Hermitian eigenvalue
algorithms based on divideandconquer; a simple implementation
would simply consist of two calls to the existing BLAS xDOT.
XxDOTG(TRANS, N, ALPHA, X, INCX, Y, INCY, BETA, C, PREC )
This implements
C < ALPHA*sum_i op(X(1+i*INCX))*Y(1+i*INCY) + BETA*C
and is intended to be the natural building block for XxGEMV and XxGEMM
below. The G in DOTG stands for "general".
XxGEMV( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY, PREC )
Matrixvector multiplication is the most natural routine with which
to implement iterative refinement as described in section 2. It
cannot be implemented directly using XxDOT for general ALPHA and
BETA, which is why we proposed XxDOTG.
A sparse version of this routine (or of DOT), where A is lower precision
than X and Y, would support Lanczos.
XxTRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX, SCALE, PREC )
Solution of triangular systems can also be based on XxDOT, but
can take special advantage of exception handling as described above.
The routine solves op(A)*Y=SCALE*X (where the solution Y overwrites X),
and SCALE is chosen to avoid over/underflow. Alternatively, the
name and calling sequence of the LAPACK auxiliary routine xLATRS
could be used instead.
Section 7: Automatic Generation of These Routines

We assume that C is our implementation language. In this
case, all the floating point variables would be passed in
as void*. There would be a big case statement at the beginning
based on PREC to branch to right set of nested loops.
Each set of loops would differ only in the casts of the
void* variables, and could be generated automatically
(optimization would require use of PHIPAC or a similar system).
What about C++, Java, F90?
Section 8: Automatic Testing of These Routines

We could generate test cases whose answers we know exactly
as a function of the intermediate precision. These matrices
would be chosen to to give different answers as a function of
the input precision, and so serve as a signature of the precision.
The same examples could be used to implement xLAMCH.
Section 9: Sample Implementations

We could just do XxDOTG in C assuming single precision input as an
illustration. This should be quite short.
Section 10: Sample Applications

The first example to do is iterative refinement for linear systems
and least squares, since the effort is so small and the payoff so large.
This is particularly true for single precision input. We would code
these without respecting the detailed specifications above, just
quickanddirty to measure the speed and accuracy.
Some of the greatest benefits described above for least squares and the
symmetric eigenvalue problem will require new routines not currently part
of the BLAS. Writing them would require a few variables
that are higher precision and/or range than the input/output variables.
When the input variables are single, we can simply use double, and this
should be the next thing we do.
When the input is double, we have a problem, because most compilers
do not support the declaration of higher precision variables.
This is even true on Pentiums, which have hardware extended precision.
How should we write such code?
One way is to propose new BLAS like XXMUL(A,B,C) which would compute
the product C=A*B of extended precision variables, where A, B and C
are arrays of appropriate length. Given similar routines for add, subtract,
divide, sqrt and comparison, we could essentially write all our routines
in assembly language. This approach has two big drawbacks: it encourages
us to produce inscrutable, machinedependent, hardtomaintain and
hardtoport code, i.e. life before IEEE 754. Second, it largely removes
pressure from compiler writers to accommodate us.
The second way is to just write relatively simple code using double as the
extended precision when the input is single, and perhaps one impressive
(if painful) benchmark with double precision input, to hold out a carrot
to compiler writers to provide appropriate support for extended variables.
So we might write
XxMUL , XxADD, XxSUB, XxDIV, XxSQRT, XxCMP, XxCOPY
for our own use, but not propose them as a standard.
Section 11: Comparison to a previous proposal

Appendix B of the BLAS2 paper [2] specifies a set of extraprecise
BLAS, which the authors called EBLAS. This proposal naturally has
a lot in common with this earlier proposal, but there are some
important differences: the earlier proposal made several
more limiting assumptions about uses of the BLAS, machine architectures,
and compilers than we have here. They assumed that it was necessary to
have at least one extended precision argument if the internal precision
were to be extended. In other words, not all uses of extra precision
could be hidden in the BLAS, so the ability to declare REAL*16 and
COMPLEX*32 variables was required. In contrast, we can get a lot of
use out of extra precision BLAS in cases where no such declarations
need to be made (iterative refinement). Furthermore, they did not
specify environmental enquiries, which are important to get error
bounds. We believe our current updated proposal better matches
current understanding of algorithms and the limitations and opportunities
of modern architectures and compilers.
Bibliography

[1] Dongarra, J. and Du Croz, J. and Hammarling, S. and Hanson, Richard J.,
"An Extended Set of FORTRAN Basic Linear Algebra Subroutines",
ACM TOMS, March 1988
[2] W. Kahan, M. Ivory, personal communication
[3] J. Demmel, X. Li, "Faster Numerical Algorithms via Exception Handling",
IEEE Trans. Comp. v. 43, n.8, 1994