1GMX-NONBONDED-BENCHMARK(1) GROMACS GMX-NONBONDED-BENCHMARK(1)
2
3
4
6 gmx-nonbonded-benchmark - Benchmarking tool for the non-bonded pair
7 kernels.
8
10 gmx nonbonded-benchmark [-o [<.csv>]] [-size <int>] [-nt <int>]
11 [-simd <enum>] [-coulomb <enum>] [-[no]table]
12 [-combrule <enum>] [-[no]halflj] [-[no]energy]
13 [-[no]all] [-cutoff <real>] [-iter <int>]
14 [-warmup <int>] [-[no]cycles] [-[no]time]
15
17 gmx nonbonded-benchmark runs benchmarks for one or more so-called Nbnxm
18 non-bonded pair kernels. The non-bonded pair kernels are the most com‐
19 pute intensive part of MD simulations and usually comprise 60 to 90
20 percent of the runtime. For this reason they are highly optimized and
21 several different setups are available to compute the same physical in‐
22 teractions. In addition, there are different physical treatments of
23 Coulomb interactions and optimizations for atoms without Lennard-Jones
24 interactions. There are also different physical treatments of
25 Lennard-Jones interactions, but only a plain cut-off is supported in
26 this tool, as that is by far the most common treatment. And finally,
27 while force output is always necessary, energy output is only required
28 at certain steps. In total there are 12 relevant combinations of op‐
29 tions. The combinations double to 24 when two different SIMD setups are
30 supported. These combinations can be run with a single invocation using
31 the -all option. The behavior of each kernel is affected by caching
32 behavior, which is determined by the hardware used together with the
33 system size and the cut-off radius. The larger the number of atoms per
34 thread, the more L1 cache is needed to avoid L1 cache misses. The
35 cut-off radius mainly affects the data reuse: a larger cut-off results
36 in more data reuse and makes the kernel less sensitive to cache misses.
37
38 OpenMP parallelization is used to utilize multiple hardware threads
39 within a compute node. In these benchmarks there is no interaction be‐
40 tween threads, apart from starting and closing a single OpenMP parallel
41 region per iteration. Additionally, threads interact through sharing
42 and evicting data from shared caches. The number of threads to use is
43 set with the -nt option. Thread affinity is important, especially with
44 SMT and shared caches. Affinities can be set through the OpenMP library
45 using the GOMP_CPU_AFFINITY environment variable.
46
47 The benchmark tool times one or more kernels by running them repeatedly
48 for a number of iterations set by the -iter option. An initial kernel
49 call is done to avoid additional initial cache misses. Times are
50 recording in cycles read from efficient, high accuracy counters in the
51 CPU. Note that these often do not correspond to actual clock cycles.
52 For each kernel, the tool reports the total number of cycles, cycles
53 per iteration, and (total and useful) pair interactions per cycle. Be‐
54 cause a cluster pair list is used instead of an atom pair list, inter‐
55 actions are also computed for some atom pairs that are beyond the
56 cut-off distance. These pairs are not useful (except for additional
57 buffering, but that is not of interest here), only a side effect of the
58 cluster-pair setup. The SIMD 2xMM kernel has a higher useful pair ratio
59 then the 4xM kernel due to a smaller cluster size, but a lower total
60 pair throughput. It is best to run this, or for that matter any,
61 benchmark with locked CPU clocks, as thermal throttling can signifi‐
62 cantly affect performance. If that is not an option, the -warmup option
63 can be used to run initial, untimed iterations to warm up the proces‐
64 sor.
65
66 The most relevant regime is between 0.1 to 1 millisecond per iteration.
67 Thus it is useful to run with system sizes that cover both ends of this
68 regime.
69
70 The -simd and -table options select different implementations to com‐
71 pute the same physics. The choice of these options should ideally be
72 optimized for the target hardware. Historically, we only found tabu‐
73 lated Ewald correction to be useful on 2-wide SIMD or 4-wide SIMD with‐
74 out FMA support. As all modern architectures are wider and support FMA,
75 we do not use tables by default. The only exceptions are kernels with‐
76 out SIMD, which only support tables. Options -coulomb, -combrule and
77 -halflj depend on the force field and composition of the simulated sys‐
78 tem. The optimization of computing Lennard-Jones interactions for only
79 half of the atoms in a cluster is useful for water, which does not use
80 Lennard-Jones on hydrogen atoms in most water models. In the MD en‐
81 gine, any clusters where at most half of the atoms have LJ interactions
82 will automatically use this kernel. And finally, the -energy option
83 selects the computation of energies, which are usually only needed in‐
84 frequently.
85
87 Options to specify output files:
88
89 -o [<.csv>] (nonbonded-benchmark.csv) (Optional)
90 Also output results in csv format
91
92 Other options:
93
94 -size <int> (1)
95 The system size is 3000 atoms times this value
96
97 -nt <int> (1)
98 The number of OpenMP threads to use
99
100 -simd <enum> (auto)
101 SIMD type, auto runs all supported SIMD setups or no SIMD when
102 SIMD is not supported: auto, no, 4xm, 2xmm
103
104 -coulomb <enum> (ewald)
105 The functional form for the Coulomb interactions: ewald, reac‐
106 tion-field
107
108 -[no]table (no)
109 Use lookup table for Ewald correction instead of analytical
110
111 -combrule <enum> (geometric)
112 The LJ combination rule: geometric, lb, none
113
114 -[no]halflj (no)
115 Use optimization for LJ on half of the atoms
116
117 -[no]energy (no)
118 Compute energies in addition to forces
119
120 -[no]all (no)
121 Run all 12 combinations of options for coulomb, halflj, combrule
122
123 -cutoff <real> (1)
124 Pair-list and interaction cut-off distance
125
126 -iter <int> (100)
127 The number of iterations for each kernel
128
129 -warmup <int> (0)
130 The number of iterations for initial warmup
131
132 -[no]cycles (no)
133 Report cycles/pair instead of pairs/cycle
134
135 -[no]time (no)
136 Report micro-seconds instead of cycles
137
139 gmx(1)
140
141 More information about GROMACS is available at <‐
142 http://www.gromacs.org/>.
143
145 2022, GROMACS development team
146
147
148
149
1502022.3 Sep 02, 2022 GMX-NONBONDED-BENCHMARK(1)