1HPL_pdpancrT(3)              HPL Library Functions             HPL_pdpancrT(3)
2
3
4

NAME

6       HPL_pdpancrT - Crout panel factorization.
7

SYNOPSIS

9       #include "hpl.h"
10
11       void HPL_pdpancrT( HPL_T_panel * PANEL, const int M, const int N, const
12       int ICOFF, double * WORK );
13

DESCRIPTION

15       HPL_pdpancrT factorizes  a panel of columns that is a  sub-array  of  a
16       larger  one-dimensional  panel  A using the Crout variant of the  usual
17       one-dimensional algorithm.  The lower triangular N0-by-N0  upper  block
18       of the panel is stored in transpose form.
19
20       Bi-directional   exchange   is  used  to  perform  the  swap::broadcast
21       operations  at once  for one column in the panel.  This  results  in  a
22       lower  number  of slightly larger  messages than usual.  On P processes
23       and assuming bi-directional links,  the running time of  this  function
24       can be approximated by (when N is equal to N0):
25
26          N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) +
27          N0^2 * ( M - N0/3 ) * gam2-3
28
29       where  M  is the local number of rows of  the panel, lat and bdwth  are
30       the latency and bandwidth of the network for  double   precision   real
31       words,  and   gam2-3  is an  estimate of the  Level 2 and Level 3  BLAS
32       rate of execution. The  recursive  algorithm  allows indeed  to  almost
33       achieve   Level  3 BLAS  performance  in the panel factorization.  On a
34       large  number of modern machines,  this  operation is  however  latency
35       bound,   meaning   that its cost can  be estimated  by only the latency
36       portion N0 * log_2(P) * lat.  Mono-directional links will  double  this
37       communication cost.
38
39       Note  that   one  iteration of the the main loop is unrolled. The local
40       computation of the absolute value max of the next column  is  performed
41       just  after  its update by the current column. This allows to bring the
42       current column only  once through  cache at each  step.   The   current
43       implementation   does  not perform  any blocking  for  this sequence of
44       BLAS operations, however the design allows for plugging in  an  optimal
45       (machine-specific)  specialized   BLAS-like kernel.  This idea has been
46       suggested to us by Fred Gustavson, IBM T.J. Watson Research Center.
47

ARGUMENTS

49       PANEL   (local input/output)    HPL_T_panel *
50               On entry,  PANEL  points to the data structure  containing  the
51               panel information.
52
53       M       (local input)           const int
54               On entry,  M specifies the local number of rows of sub(A).
55
56       N       (local input)           const int
57               On entry,  N specifies the local number of columns of sub(A).
58
59       ICOFF   (global input)          const int
60               On  entry,  ICOFF specifies the row and column offset of sub(A)
61               in A.
62
63       WORK    (local workspace)       double *
64               On entry, WORK  is a workarray of size at least 2*(4+2*N0).
65

SEE ALSO

67       HPL_dlocmax (3), HPL_dlocswpN (3),  HPL_dlocswpT (3),  HPL_pdmxswp (3),
68       HPL_pdpancrN (3), HPL_pdpanllN (3), HPL_pdpanllT (3), HPL_pdpanrlN (3),
69       HPL_pdpanrlT (3).
70
71
72
73HPL 2.1                        October 26, 2012                HPL_pdpancrT(3)
Impressum