gmx-nonbonded-benchmark(1)

1GMX-NONBONDED-BENCHMARK(1)          GROMACS         GMX-NONBONDED-BENCHMARK(1)
2
3
4

NAME

6       gmx-nonbonded-benchmark  -  Benchmarking  tool  for the non-bonded pair
7       kernels.
8

SYNOPSIS

10          gmx nonbonded-benchmark [-o [<.csv>]] [-size <int>] [-nt <int>]
11                       [-simd <enum>] [-coulomb <enum>] [-[no]table]
12                       [-combrule <enum>] [-[no]halflj] [-[no]energy]
13                       [-[no]all] [-cutoff <real>] [-iter <int>]
14                       [-warmup <int>] [-[no]cycles] [-[no]time]
15

DESCRIPTION

17       gmx nonbonded-benchmark runs benchmarks for one or more so-called Nbnxm
18       non-bonded  pair kernels. The non-bonded pair kernels are the most com‐
19       pute intensive part of MD simulations and usually  comprise  60  to  90
20       percent  of the runtime.  For this reason they are highly optimized and
21       several different setups are available to compute the same physical in‐
22       teractions.   In  addition,  there are different physical treatments of
23       Coulomb interactions and optimizations for atoms without  Lennard-Jones
24       interactions.   There   are   also  different  physical  treatments  of
25       Lennard-Jones interactions, but only a plain cut-off  is  supported  in
26       this  tool,  as that is by far the most common treatment.  And finally,
27       while force output is always necessary, energy output is only  required
28       at  certain  steps.  In total there are 12 relevant combinations of op‐
29       tions. The combinations double to 24 when two different SIMD setups are
30       supported. These combinations can be run with a single invocation using
31       the -all option.  The behavior of each kernel is  affected  by  caching
32       behavior,  which  is  determined by the hardware used together with the
33       system size and the cut-off radius. The larger the number of atoms  per
34       thread,  the  more  L1  cache  is needed to avoid L1 cache misses.  The
35       cut-off radius mainly affects the data reuse: a larger cut-off  results
36       in more data reuse and makes the kernel less sensitive to cache misses.
37
38       OpenMP  parallelization  is  used  to utilize multiple hardware threads
39       within a compute node. In these benchmarks there is no interaction  be‐
40       tween threads, apart from starting and closing a single OpenMP parallel
41       region per iteration. Additionally, threads  interact  through  sharing
42       and  evicting data from shared caches.  The number of threads to use is
43       set with the -nt option.  Thread affinity is important, especially with
44       SMT and shared caches. Affinities can be set through the OpenMP library
45       using the GOMP_CPU_AFFINITY environment variable.
46
47       The benchmark tool times one or more kernels by running them repeatedly
48       for  a  number of iterations set by the -iter option. An initial kernel
49       call is done to  avoid  additional  initial  cache  misses.  Times  are
50       recording  in cycles read from efficient, high accuracy counters in the
51       CPU. Note that these often do not correspond to  actual  clock  cycles.
52       For  each  kernel,  the tool reports the total number of cycles, cycles
53       per iteration, and (total and useful) pair interactions per cycle.  Be‐
54       cause  a cluster pair list is used instead of an atom pair list, inter‐
55       actions are also computed for some  atom  pairs  that  are  beyond  the
56       cut-off  distance.  These  pairs  are not useful (except for additional
57       buffering, but that is not of interest here), only a side effect of the
58       cluster-pair setup. The SIMD 2xMM kernel has a higher useful pair ratio
59       then the 4xM kernel due to a smaller cluster size, but  a  lower  total
60       pair  throughput.   It  is  best  to  run this, or for that matter any,
61       benchmark with locked CPU clocks, as thermal  throttling  can  signifi‐
62       cantly affect performance. If that is not an option, the -warmup option
63       can be used to run initial, untimed iterations to warm up  the  proces‐
64       sor.
65
66       The most relevant regime is between 0.1 to 1 millisecond per iteration.
67       Thus it is useful to run with system sizes that cover both ends of this
68       regime.
69
70       The  -simd  and -table options select different implementations to com‐
71       pute the same physics. The choice of these options  should  ideally  be
72       optimized  for  the target hardware.  Historically, we only found tabu‐
73       lated Ewald correction to be useful on 2-wide SIMD or 4-wide SIMD with‐
74       out FMA support. As all modern architectures are wider and support FMA,
75       we do not use tables by default. The only exceptions are kernels  with‐
76       out  SIMD,  which only support tables.  Options -coulomb, -combrule and
77       -halflj depend on the force field and composition of the simulated sys‐
78       tem.  The optimization of computing Lennard-Jones interactions for only
79       half of the atoms in a cluster is useful for water, which does not  use
80       Lennard-Jones  on  hydrogen  atoms in most water models.  In the MD en‐
81       gine, any clusters where at most half of the atoms have LJ interactions
82       will  automatically  use  this kernel.  And finally, the -energy option
83       selects the computation of energies, which are usually only needed  in‐
84       frequently.
85

OPTIONS

87       Options to specify output files:
88
89       -o [<.csv>] (nonbonded-benchmark.csv) (Optional)
90              Also output results in csv format
91
92       Other options:
93
94       -size <int> (1)
95              The system size is 3000 atoms times this value
96
97       -nt <int> (1)
98              The number of OpenMP threads to use
99
100       -simd <enum> (auto)
101              SIMD  type,  auto runs all supported SIMD setups or no SIMD when
102              SIMD is not supported: auto, no, 4xm, 2xmm
103
104       -coulomb <enum> (ewald)
105              The functional form for the Coulomb interactions:  ewald,  reac‐
106              tion-field
107
108       -[no]table (no)
109              Use lookup table for Ewald correction instead of analytical
110
111       -combrule <enum> (geometric)
112              The LJ combination rule: geometric, lb, none
113
114       -[no]halflj (no)
115              Use optimization for LJ on half of the atoms
116
117       -[no]energy (no)
118              Compute energies in addition to forces
119
120       -[no]all (no)
121              Run all 12 combinations of options for coulomb, halflj, combrule
122
123       -cutoff <real> (1)
124              Pair-list and interaction cut-off distance
125
126       -iter <int> (100)
127              The number of iterations for each kernel
128
129       -warmup <int> (0)
130              The number of iterations for initial warmup
131
132       -[no]cycles (no)
133              Report cycles/pair instead of pairs/cycle
134
135       -[no]time (no)
136              Report micro-seconds instead of cycles
137

COPYRIGHT

145       2022, GROMACS development team
146
147
148
149
1502022.2                           Jun 16, 2022       GMX-NONBONDED-BENCHMARK(1)

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

SEE ALSO

COPYRIGHT