HPL_pdpancrN(3)

1HPL_pdpancrN(3)              HPL Library Functions             HPL_pdpancrN(3)
2
3
4

NAME

6       HPL_pdpancrN - Crout panel factorization.
7

SYNOPSIS

9       #include "hpl.h"
10
11       void HPL_pdpancrN( HPL_T_panel * PANEL, const int M, const int N, const
12       int ICOFF, double * WORK );
13

DESCRIPTION

15       HPL_pdpancrN factorizes  a panel of columns that is a  sub-array  of  a
16       larger  one-dimensional  panel  A using the Crout variant of the  usual
17       one-dimensional algorithm.  The lower triangular N0-by-N0  upper  block
18       of  the  panel is stored in no-transpose form (i.e. just like the input
19       matrix itself).
20
21       Bi-directional  exchange  is  used  to  perform   the   swap::broadcast
22       operations   at  once  for one column in the panel.  This  results in a
23       lower number of slightly larger  messages than usual.  On  P  processes
24       and  assuming  bi-directional links,  the running time of this function
25       can be approximated by (when N is equal to N0):
26
27          N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) +
28          N0^2 * ( M - N0/3 ) * gam2-3
29
30       where M is the local number of rows of  the panel, lat and  bdwth   are
31       the  latency  and bandwidth of the network for  double  precision  real
32       words, and gam2-3 is  an  estimate  of the  Level 2 and Level  3   BLAS
33       rate  of  execution. The  recursive  algorithm  allows indeed to almost
34       achieve  Level 3 BLAS  performance  in the panel factorization.   On  a
35       large   number  of modern machines,  this  operation is however latency
36       bound,  meaning  that its cost can  be estimated  by only  the  latency
37       portion  N0  * log_2(P) * lat.  Mono-directional links will double this
38       communication cost.
39
40       Note that  one  iteration of the the main loop is unrolled.  The  local
41       computation  of  the absolute value max of the next column is performed
42       just after its update by the current column. This allows to  bring  the
43       current  column  only  once through  cache at each  step.  The  current
44       implementation  does not perform  any blocking  for  this  sequence  of
45       BLAS  operations,  however the design allows for plugging in an optimal
46       (machine-specific) specialized  BLAS-like kernel.  This idea  has  been
47       suggested to us by Fred Gustavson, IBM T.J. Watson Research Center.
48

ARGUMENTS

50       PANEL   (local input/output)    HPL_T_panel *
51               On  entry,   PANEL  points to the data structure containing the
52               panel information.
53
54       M       (local input)           const int
55               On entry,  M specifies the local number of rows of sub(A).
56
57       N       (local input)           const int
58               On entry,  N specifies the local number of columns of sub(A).
59
60       ICOFF   (global input)          const int
61               On entry, ICOFF specifies the row and column offset  of  sub(A)
62               in A.
63
64       WORK    (local workspace)       double *
65               On entry, WORK  is a workarray of size at least 2*(4+2*N0).
66

NAME

SYNOPSIS

DESCRIPTION

ARGUMENTS

SEE ALSO