%%
%% LAST MODIFIED by Susan Blackford on August 12, 1997
%%
\documentstyle[12pt]{article}
\begin{document}
\bibliographystyle{plain}
\title{Linear Algebra Kernel Functionality}
\author{The BLAST Forum
\thanks{This document so far edited by Susan Blackford, Jack Dongarra,
Sven Hammarling, Linda Kaufman, Antoine Petitet and Barry Smith} }
\date{30 November 1997}
\maketitle
\begin{center}
http://www.netlib.org/utk/papers/blast-forum.html
\end{center}
\begin{abstract}
This Chapter lists the perceived functionality of numerical linear algebra
library kernels as discussed at meetings of the BLAS Technical Forum. Included
are kernels needed for dense matrices, sparse direct methods, sparse iterative
methods as well as related kernels needed in non-traditional linear algebra
computations such as CFD and finite elements. This document covers overall
functionality without regard for whether or not the function is required in a
particular context, such as sparsity.
The focus is on sequential and shared memory computers, but functionality for
the parallel case has been borne in mind as well.
This version of the document gives a brief introduction and then presents tables
listing the functionality. Further information on the BLAS technical Forum can
be found at the URL given above. Comments on this document and active
participation in the Forum are encouraged.
\end{abstract}
% --------------------------------------------------------------------------
\newpage
\section{Introduction}
$<${\bf This section still needs some re-writing}$>$
A technical forum has been established to consider expanding the Basic Linear
Algebra Subprograms (BLAS) in a number of directions in the light of modern
software, language, and hardware developments.
Various working groups have been established to consider issues such as the
overall functionality, language interfaces, sparse BLAS, parallel BLAS and
extensions to the existing BLAS.
This document summarizes, in tabular form, the discussions of the Functionality
Working Group. Note that this document does not consider issues such as storage
formats or data types.
The list is fairly exhaustive and it is not assumed that everything will
necessarily be included in the final report. Some indication of the priority of
the functions is given in the tables, together with some justification for the
inclusion of the functions. The list includes the functionality covered by the
existing Level 1, 2 and 3 BLAS
\cite{LHKK:TOMS:79,DGL:TOMS:91,DDHH:TOMS:88,DDDH:TOMS:90}.
This document considers overall functionality, irrespective of whether or not a
function is required in the dense or the sparse case. Many of the operations
could usefully also be made available as multiple instances and that is
discussed in a separate Section.
In this version of the document, the functionality is categorized more or less
along traditional BLAS lines, but the Forum discussed other schemes and it may
be appropriate to categorize the functionality in more than one way in the
final document.
If what you need is not listed here, please comment and if possible join the
Forum, to let us know what you need.
\section{Justifications for extensions to the BLAS}
$<${\bf Susan added this section from email correspondence with Linda.
Perhaps this section should be renamed, re-written, or moved. In this
version this section has been revised by Sven.}$>$
In the original BLAS, each level was categorised by the type of operation; Level
1 covered scalar and vector operations, Level 2 covered matrix vector
operations, while Level 3 covered matrix-matrix operations. The functionality
tables in this document are categorised in a similar manner, but with some
additional categories to cover operations which it was realized were missing in
the original BLAS.
The motivation for the new operations is proven functionality. No function was to be
added because it just neatly fit into a framework and no subroutine was to be
discarded because it did not fit into a particular framework.
Many of the new functions are auxiliary routines in LAPACK
\cite{ABBDDDGHMOS:95}. Only after the LAPACK project was begun was it realized
that there were operations like the matrix copy routine, the computation of a
norm of a matrix and the generation of Householder transformations that occurred
so often that it was wise to make separate subroutines for them.
A second group of these extensions extended the functionality of some of the
existing BLAS. For example, the Level 3 BLAS for the rank $k$ update of a
symmetric matrix only allows a positive update, which means that it cannot be
used for the reduction of a symmetric matrix to tridiagonal form (to facilitate
the computation of the eigensystem of a symmetric matrix), or for the
decomposition of an indefinite matrix, or for a quasi-Newton update
in an optimization routine.
Some of the new BLAS might be considered special cases of the original
BLAS. The number of Level 2 and 3 BLAS was purposely kept small, but
each had a number of parameters that could extend their functionality.
Most of these parameters in practice were either 1.0 or 0.0.
Adding these parameters, rather than specifying different subroutines
increased the opportunities for the user to make mistakes and increased
the amount of overhead of a call to the routine. In this document many of the
functions are presented in their general case, but we make recommendations about
special cases in the Language Independent Specification chapter.
The extensions considered in the section on multiple instances can be thought of
as multiple instances of the Level 1 BLAS. They tend to arise when working with
narrow banded and sparse or structured matrices. The canonical example is the
decomposition of many narrow banded matrices simultaneously. This type of
problem arises when finding several eigenvectors using inverse iteration, when
using parallel bisection to determine the eigenvalues of a symmetric matrix,
when solving two dimensional separable partial differential equations or
preconditioning an iterative method for a nonseparable differential equation.
Certainly if one was on a shared memory multi-processor computer using a
compiler with compiler directives one could specify that various Level 1 BLAS
like IxAMAX, xSWAP and xAXPY be done simultaneously. However, when one is not in
this situation and one wishes to use portable code, then the existence of a BLAS
to cover this situation is helpful.
Other contenders for multiple instance routines are routines for generating and
applying Givens transformation, used for example to reduce a banded
matrix to tridiagonal form in order to solve the symmetric eigenvalue problem.
Simultaneous Givens transformations are also useful when
solving the tridiagonal positive definite symmetric generalized eigenvalue
problem and when using the QR algorithm to solve the standard tridiagonal
symmetric eigenvalue problem.
For large matrix problems, it might not be feasible or desirable to store the
full matrix. In this situation one cannot simply use the matrix-vector
multiplication routines in the present BLAS. However, one may be able to form
that product with some simultaneous inner product routines because they have
some repeated submatrices. This feature is particular true of Toeplitz and
Hankel matrices used in signal and image processing.
%\bibliography{svens}
\begin{thebibliography}{1}
\bibitem{ABBDDDGHMOS:95}
E.~Anderson, Z.~Bai, C.~H. Bischof, J.~Demmel, J.~J. Dongarra, J.~Du~Croz,
A.~Greenbaum, S.~Hammarling, A.~McKenney, S.~Ostrouchov, and D.~C. Sorensen.
\newblock {\em {LAPACK} Users' Guide}.
\newblock SIAM, Philadelphia, PA, USA, 2nd edition, 1995.
\newblock (URL: http://www.netlib.org/lapack/lug/lapack{\_}lug.html).
\bibitem{DGL:TOMS:91}
D.~S. Dodson, R.~G. Grimes, and J.~G. Lewis.
\newblock Sparse extensions to the {FORTRAN} {Basic Linear Algebra
Subprograms}.
\newblock {\em ACM Trans. Math. Software}, 17:253--272, 1991.
\newblock (Algorithm 692).
\bibitem{DDDH:TOMS:90}
J.~J. Dongarra, J.~Du~Croz, I.~S. Duff, and S.~Hammarling.
\newblock A set of {Level~3 Basic Linear Algebra Subprograms}.
\newblock {\em ACM Trans. Math. Software}, 16:1--28, 1990.
\newblock (Algorithm 679).
\bibitem{DDHH:TOMS:88}
J.~J. Dongarra, J.~Du~Croz, S.~Hammarling, and R.~J. Hanson.
\newblock An extended set of {FORTRAN} {Basic Linear Algebra Subprograms}.
\newblock {\em ACM Trans. Math. Software}, 14:1--32, 1988.
\newblock (Algorithm 656).
\bibitem{KW}
L. Kaufman and D. Warner,
\newblock High-order, fast direct methods for Separable Elliptical Equations.
\newblock {\em SIAM Journal of Numerical Analysis}, 21:672-695, 1984.
\bibitem{K}
L. Kaufman,
\newblock Implementing the MDM(transpose) Decomposition.
\newblock {\em ACM Trans. Math. Software}, 21:476-489, 1995.
\bibitem{LHKK:TOMS:79}
C.~L. Lawson, R.~J. Hanson, D.~Kincaid, and F.~T. Krogh.
\newblock {Basic Linear Algebra Subprograms} for {FORTRAN} usage.
\newblock {\em ACM Trans. Math. Software}, 5:308--323, 1979.
\newblock (Algorithm 539).
\bibitem{SHJ}
C. Smith and B. Hendrickson and E. Jessup,
\newblock A Parallel Algorithm for Householder Tridiagonalization.
\newblock {\em Proceedings of the Fifth SIAM Conference on Applied
Linear Algebra}, 361-365, 1994.
\end{thebibliography}
% --------------------------------------------------------------------------
\newpage
\appendix
\section{Notation}
The following notation is used throughout the tables in the appendices.
\begin{itemize}
\item $x_i$ - an element of a one-dimensional array
\item $x_I$ - subset of entries of $x$ indexed by $I$
\item $y[i]$ - an array from an array of arrays
\item $x_{\mbox{c}}$ - compressed (sparse) array, only non-zeros are stored,
associated with an index set $I$
\item $e$ - vector whose elements are all unity
\end{itemize}
The last column of each table gives the justification for a routine's
inclusion.
\begin{itemize}
\item ``LEGACY'' refers to the fact that it is a legacy BLAS,
\item ``AUX'' means that it has appeared as an auxiliary routine
in LAPACK, reference numbers refer to items in the bibliography,
\item ``DERIV'' means that it is a commonly used subset of an existing
BLAS which deserves its own subroutine, and
\item ``COMMON'' means that it appears so commonly in existing software,
that it is surprising that it is not presently a BLAS.
\end{itemize}
A question mark in the last column suggests that it will probably be
a new BLAS if there is sufficient support and a ``-'' sign suggests
that it will probably never be in the final list.
Not all of the auxiliary subroutine in LAPACK are included and some
have question marks or - signs next to them because their utility
outside of LAPACK has not yet been demonstrated sufficiently.
% --------------------------------------------------------------------------
\newpage
\section{Scalar and Vector Operations}
This Appendix gives the operations corresponding to the traditional Level 1 BLAS
operations. Table \ref{tab:red} lists the scalar and vector reduction
operations, Table \ref{tab:vec} lists the vector operations, and Table
\ref{tab:movev} lists those vector operations that involve only data movement.
The operations are intended to apply to both real and complex arguments. For the
sake of compactness the complex operators are omitted, so that whenever a
transpose operation is given the conjugate transpose should also be assumed for
the complex case.
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Inner product & $r \leftarrow x^{T} y,$ & Legacy \\
& $r \leftarrow r + x^{T} y$ & Legacy \\
& $r \leftarrow \alpha x^{T} y,$ & - \\
& $r \leftarrow r + \alpha x^{T} y$ & - \\
Vector norms & $r \leftarrow ||x||_1,$ & Legacy \\
& $ r \leftarrow ||x||_{\infty}$ & Common \\
& $r \leftarrow ||x||_2, $ & Legacy \\
& $ r \leftarrow ||x||_{p}$ & - \\
Sum & $r \leftarrow \sum_{i} x_{i}$ & Common \\
Location of max value & $i = \mbox{arg}\max_i |x_i|$ & Legacy \\
Max value \& location & $i, x_i, ; i = \mbox{arg}\max_i x_i$ & Common \\
Max absolute value \&
location & $i, |x_i|, i = \mbox{arg}\max_i |x_i|$ & Common \\
Sum of squares & $(a,b) \leftarrow \sum x_i^2,$ & Common \\
& $ a \cdot b^2 = \sum x^2 $ & - \\
Generate Givens
rotation & $(c, s) \leftarrow \mbox{rot}(a, b)$ & Legacy \\
Generate Modified
Givens rotation & $(z, r) \leftarrow \mbox{mrot}(a, b)$ & Legacy \\
Generate Jacobi
rotation & $(c, s) \leftarrow \mbox{jrot}(a, b)$ & ? \\
Generate Householder
transform & ($\alpha, u) \leftarrow \mbox{house}(x),$ & \\
& $H = I - \alpha u u^T$ & Aux \\
\hline
\end{tabular}
\end{center}
\caption{Reduction Operations} \label{tab:red}
\end{table}
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Zero & $x \leftarrow 0$ & Common \\
Constant & $x \leftarrow \alpha e$ & Common \\
Conjugate copy & $y \leftarrow \bar{x}$ & Common \\
Scale & $x \leftarrow \alpha x,$ & Legacy \\
& $y \leftarrow \alpha x$ & Common \\
Group scale & $x_{I_i} \leftarrow a_i x_{I_i}$ & Common,\\
& & diagonal
scaling \\
Sum & $y \leftarrow \pm x + y, $ & ? \\
& $w \leftarrow \pm x + y$ & ? \\
Shift & $x \leftarrow x + \sigma e, $ & ? \\
& $y \leftarrow x + \sigma e$ & ? \\
AXPY \& AXPY like & $y \leftarrow \alpha x + y, $ & Legacy \\
& $w \leftarrow \alpha x + y$ & Common \\
& $y \leftarrow \alpha x + \beta y,$ & ? \\
& $w \leftarrow \alpha x + \beta y$ & Common,\\
& & optimization \\
& $y \leftarrow \alpha x + \beta y + \sigma e,$ & ? \\
& $w \leftarrow \alpha x + \beta y + \sigma e$ & ? \\
Absolute value & $x_i \leftarrow |x_{i}|$ & \\
& $y_i \leftarrow |x_{i}|$ & Common \\
Elementwise
multiply & $x_{i} \leftarrow x_{i} y_{i}, $ & Common,\\
& & diagonal
scaling \\
& $w_{i} \leftarrow x_{i} y_{i}$ & ? \\
Elementwise divide & $x_{i} \leftarrow x_{i}/y_{i}, $ & Common \\
& $w_{i} \leftarrow x_{i}/y_{i}$ & ? \\
Apply plane
rotation & $(\begin{array}{cc}x & y\end{array})
\leftarrow
(\begin{array}{cc}x & y\end{array})R$ & Legacy \\
Apply modified
plane rotation & $(\begin{array}{cc}\tilde{x} &
\tilde{y}\end{array})D
\leftarrow
(\begin{array}{cc}\tilde{x} &
\tilde{y}\end{array})D R$ & Legacy \\
\hline
\end{tabular}
\end{center}
\caption{Vector Operations} \label{tab:vec}
\end{table}
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Copy & $y \leftarrow x$ & Common \\
Swap & $y \leftrightarrow x$ & Legacy \\
Sort into order \&
return index vector & $(iv, y) \leftarrow \mbox{sort}(x)$ & Common \\
Sort according to
index vector & $y \leftarrow \mbox{sort}(x, iv)$ & Common \\
\hline
\end{tabular}
\end{center}
\caption{Data Movement with Vectors} \label{tab:movev}
\end{table}
% --------------------------------------------------------------------------
\clearpage
\newpage
\section{Matrix Vector Operations}
This Appendix gives the operations corresponding to the traditional Level 2 BLAS
operations. Table \ref{tab:matvec} lists the matrix-vector operations. The
operations are intended to apply to both real and complex arguments, and the
matrix arguments $A, T, L \mbox{ and } U$ are intended to be dense or banded or
sparse. In addition, where appropriate, the matrix $A$ can be symmetric
(Hermitian) or triangular or general. The matrix $T$ specifically represents a
triangular matrix, either upper or lower, $L$ represents a lower triangular
matrix and $U$ an upper triangular matrix, all three of which can be unit or
non-unit triangular.
$D$ represents at least a diagonal matrix, and maybe also a $2 \times 2$ blocked
diagonal or a tridiagonal matrix. $Q$ represents an orthogonal (unitary) matrix
and it is an open question as to how $Q$ should be stored.
For the sake of compactness the complex operators are omitted, so that whenever
a transpose operation is given both the conjugate and conjugate transpose should
also be assumed for the complex case.
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Matrix vector product & $x \leftarrow A x, \; x \leftarrow
A^T x$ & Deriv \\
& $y \leftarrow \pm A x, \;
y \leftarrow \pm A^T x$ & Deriv \\
& $y \leftarrow \alpha A x + \beta y, \;
y \leftarrow \alpha A^T x + \beta y$ & Legacy \\
Triangular solve & $x \leftarrow T^{-1}x, \;
x \leftarrow T^{-T}x$ & Legacy \\
& $x \leftarrow \alpha T^{-1}x, \;
x \leftarrow \alpha T^{-T}x$ & - \\
& $y \leftarrow \alpha T^{-1}x, \;
y \leftarrow \alpha T^{-T}x$ & - \\
Rank one and symmetric
($A = A^T$) & $A \leftarrow \beta x y^T + \alpha A$ & ? \\
~~rank one \& two updates & $A \leftarrow \beta x y^T + A$ & Legacy \\
& $A \leftarrow \beta x x^T + A$ & Legacy \\
& $A \leftarrow \beta x x^T + \alpha A, \;
A \leftarrow \beta x D x^T + \alpha A$ & ? \\
& $A \leftarrow (\beta x) y^T +
y (\beta x)^T + \alpha A$ & - \\
& $A \leftarrow (\beta x D) y^T +
y (\beta x D)^T + \alpha A$ & - \\
Application of $Q$ from & $x \leftarrow Q x, \;
x \leftarrow Q^T x$ & Aux \\
~~$QR$ factorization & $y \leftarrow Q^T x \;
y \leftarrow Q^T x$ & Aux \\
\hline
\end{tabular}
\end{center}
\caption{Matrix Vector Operations} \label{tab:matvec}
\end{table}
% --------------------------------------------------------------------------
\clearpage
\newpage
\section{Matrix Operations}
This Appendix gives the operations corresponding to the traditional Level 3 BLAS
operations, as well as a number of additional single matrix operations. Table
\ref{tab:mat} lists single matrix operations and matrix operations that are
involve $O(n^2)$ operations, Table \ref{tab:matmat} lists the $O(n^3)$
matrix-matrix operations and Table \ref{tab:movem} lists those matrix operations
that involve only data movement. The operations are intended to apply to both
real and complex arguments. Where appropriate one or more of the matrices can
also be symmetric (Hermitian) or triangular or general. The matrix $T$
specifically represents a triangular matrix, either upper or lower, $L$
represents a lower triangular matrix and $U$ an upper triangular matrix, all
three of which can be unit or non-unit triangular.
$D$ represents at least a diagonal matrix, and maybe also a $2 \times 2$ blocked
diagonal or a tridiagonal matrix. $Q$ represents an orthogonal (unitary) matrix
and it is an open question as to how $Q$ should be stored.
For the sake of compactness the complex operators are omitted, so that whenever
a transpose operation is given both the conjugate and conjugate transpose should
also be assumed for the complex case.
\newpage
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Matrix set & $A \leftarrow \alpha I$, & Common \\
& $A \leftarrow \alpha I + \beta(e^T e - I)$ & - \\
Matrix norms & $r \leftarrow ||A||_1,
r \leftarrow ||A||_F,
r \leftarrow ||A||_{\infty}$ & Aux \\
Maximum absolute value & $r \leftarrow \max |a_{i,j}|$ & Common \\
Matrix scaling & $A \leftarrow \alpha A, \;
B \leftarrow \alpha A$ & Aux \\
Diagonal scaling & $A \leftarrow D A, \;
A \leftarrow A D, \;
A \leftarrow D_1 A D_2$ & Common \\
Matrix add & $A \leftarrow \pm A + B, \;
C \leftarrow \pm A + B$ & Common \\
Matrix add and scale & $A \leftarrow \alpha A + \beta B, \;
C \leftarrow \alpha A + \beta B$ & ? \\
Elementwise multiply & $a_{ij} \leftarrow a_{ij} b_{ij}, \;
c_{ij} \leftarrow a_{ij} b_{ij}$ & - \\
Elementwise divide & $a_{ij} \leftarrow a_{ij}/b_{ij}, \;
c_{ij} \leftarrow a_{ij}/b_{ij}$ & - \\
\hline
\end{tabular}
\end{center}
\caption{Matrix Operations -- $O(n^2)$} \label{tab:mat}
\end{table}
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Matrix matrix product & $A \leftarrow A B, \;
A \leftarrow A^T B, \;
A \leftarrow A B^T$ & ? \\
& $C \leftarrow A B, \;
C \leftarrow A^T B, \;
C \leftarrow A B^T$ & Deriv \\
& $C \leftarrow \alpha A B + \beta C, $ & Legacy \\
& $C \leftarrow \alpha A B + \beta C,
(C = C^T)$ & - \\
& $F \leftarrow \alpha A B + \beta C$ & - \\
& $C \leftarrow \alpha A^T B + \beta C $ & Legacy \\
& $F \leftarrow \alpha A^T B + \beta C$ & - \\
& $C \leftarrow \alpha A B^T + \beta C$ & Legacy \\
& $F \leftarrow \alpha A B^T + \beta C$ & - \\
Triangular & $B \leftarrow \alpha T^{-1} B$ & Legacy \\
& $C \leftarrow \alpha T^{-1} B$ & - \\
& $B \leftarrow \alpha B T^{-1}, \;
C \leftarrow \alpha B T^{-1}$ & - \\
Symmetric rank $k$ \& $k2$ & $C \leftarrow \alpha A A^T + \beta C, \;
C \leftarrow \alpha A^T A + \beta C$ & Legacy \\
~~updates ($C = C^T$) & $C \leftarrow \alpha A D A^T + \beta C, \;
C \leftarrow \alpha A^T D A + \beta C$ & \\
& $C \leftarrow (\alpha A) B^T +
B (\alpha A)^T + \beta C$ & Legacy \\
& $C \leftarrow (\alpha A)^T B +
B^T (\alpha A) + \beta C$ & - \\
& $C \leftarrow (\alpha A D) B^T +
B (\alpha A D)^T + \beta C$ & ? \\
& $C \leftarrow (\alpha D A)^T B +
B^T (\alpha D A) + \beta C$ & - \\
Application of $Q$ from & $X \leftarrow Q X, \; X \leftarrow X Q, \;
X \leftarrow Q^T X, \;
X \leftarrow X Q^T$ & ? \\
~~$QR$ factorization & $Y \leftarrow Q X, \; Y \leftarrow X Q, \;
Y \leftarrow Q^T X, \;
Y \leftarrow X Q^T$ & ? \\
\hline
\end{tabular}
\end{center}
\caption{Matrix Matrix Operations} \label{tab:matmat}
\end{table}
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Matrix copy & $B \leftarrow A$ & Aux \\
Matrix transpose & $A \leftarrow A^T, \; B \leftarrow A^T$& ? \\
Get and put submatrix & $B \leftarrow a_{I,J}$ & Aux, Common \\
~~(dense and sparse) & $b_{I,J} \leftarrow A, \;
b_{I,J} \leftarrow A + b_{I,J}$ & Aux, Common \\
Permute Matrix & $A \leftarrow P A$, $A \leftarrow A P$ & Aux, Common \\
& $A \leftarrow Q A P$ & Aux, Common \\
Diagonal extract & $d \leftarrow \mbox{diag}( A ) $ & ? \\
\hline
\end{tabular}
\end{center}
\caption{Data Movement with Matrices} \label{tab:movem}
\end{table}
\end{document}