1HPL_pdpancrN(3) HPL Library Functions HPL_pdpancrN(3)
2
3
4
6 HPL_pdpancrN - Crout panel factorization.
7
9 #include "hpl.h"
10
11 void HPL_pdpancrN( HPL_T_panel * PANEL, const int M, const int N, const
12 int ICOFF, double * WORK );
13
15 HPL_pdpancrN factorizes a panel of columns that is a sub-array of a
16 larger one-dimensional panel A using the Crout variant of the usual
17 one-dimensional algorithm. The lower triangular N0-by-N0 upper block
18 of the panel is stored in no-transpose form (i.e. just like the input
19 matrix itself).
20
21 Bi-directional exchange is used to perform the swap::broadcast
22 operations at once for one column in the panel. This results in a
23 lower number of slightly larger messages than usual. On P processes
24 and assuming bi-directional links, the running time of this function
25 can be approximated by (when N is equal to N0):
26
27 N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) +
28 N0^2 * ( M - N0/3 ) * gam2-3
29
30 where M is the local number of rows of the panel, lat and bdwth are
31 the latency and bandwidth of the network for double precision real
32 words, and gam2-3 is an estimate of the Level 2 and Level 3 BLAS
33 rate of execution. The recursive algorithm allows indeed to almost
34 achieve Level 3 BLAS performance in the panel factorization. On a
35 large number of modern machines, this operation is however latency
36 bound, meaning that its cost can be estimated by only the latency
37 portion N0 * log_2(P) * lat. Mono-directional links will double this
38 communication cost.
39
40 Note that one iteration of the the main loop is unrolled. The local
41 computation of the absolute value max of the next column is performed
42 just after its update by the current column. This allows to bring the
43 current column only once through cache at each step. The current
44 implementation does not perform any blocking for this sequence of
45 BLAS operations, however the design allows for plugging in an optimal
46 (machine-specific) specialized BLAS-like kernel. This idea has been
47 suggested to us by Fred Gustavson, IBM T.J. Watson Research Center.
48
50 PANEL (local input/output) HPL_T_panel *
51 On entry, PANEL points to the data structure containing the
52 panel information.
53
54 M (local input) const int
55 On entry, M specifies the local number of rows of sub(A).
56
57 N (local input) const int
58 On entry, N specifies the local number of columns of sub(A).
59
60 ICOFF (global input) const int
61 On entry, ICOFF specifies the row and column offset of sub(A)
62 in A.
63
64 WORK (local workspace) double *
65 On entry, WORK is a workarray of size at least 2*(4+2*N0).
66
68 HPL_dlocmax (3), HPL_dlocswpN (3), HPL_dlocswpT (3), HPL_pdmxswp (3),
69 HPL_pdpancrT (3), HPL_pdpanllN (3), HPL_pdpanllT (3), HPL_pdpanrlN (3),
70 HPL_pdpanrlT (3).
71
72
73
74HPL 2.1 October 26, 2012 HPL_pdpancrN(3)