1xl-numa-placement(7) Xen xl-numa-placement(7)
2
3
4
6 xl-numa-placement - Guest Automatic NUMA Placement in libxl and xl
7
9 Rationale
10 NUMA (which stands for Non-Uniform Memory Access) means that the memory
11 accessing times of a program running on a CPU depends on the relative
12 distance between that CPU and that memory. In fact, most of the NUMA
13 systems are built in such a way that each processor has its local
14 memory, on which it can operate very fast. On the other hand, getting
15 and storing data from and on remote memory (that is, memory local to
16 some other processor) is quite more complex and slow. On these
17 machines, a NUMA node is usually defined as a set of processor cores
18 (typically a physical CPU package) and the memory directly attached to
19 the set of cores.
20
21 NUMA awareness becomes very important as soon as many domains start
22 running memory-intensive workloads on a shared host. In fact, the cost
23 of accessing non node-local memory locations is very high, and the
24 performance degradation is likely to be noticeable.
25
26 For more information, have a look at the Xen NUMA Introduction
27 <http://wiki.xen.org/wiki/Xen_NUMA_Introduction> page on the Wiki.
28
29 Xen and NUMA machines: the concept of node-affinity
30 The Xen hypervisor deals with NUMA machines throughout the concept of
31 node-affinity. The node-affinity of a domain is the set of NUMA nodes
32 of the host where the memory for the domain is being allocated (mostly,
33 at domain creation time). This is, at least in principle, different and
34 unrelated with the vCPU (hard and soft, see below) scheduling affinity,
35 which instead is the set of pCPUs where the vCPU is allowed (or
36 prefers) to run.
37
38 Of course, despite the fact that they belong to and affect different
39 subsystems, the domain node-affinity and the vCPUs affinity are not
40 completely independent. In fact, if the domain node-affinity is not
41 explicitly specified by the user, via the proper libxl calls or xl
42 config item, it will be computed basing on the vCPUs' scheduling
43 affinity.
44
45 Notice that, even if the node affinity of a domain may change on-line,
46 it is very important to "place" the domain correctly when it is fist
47 created, as the most of its memory is allocated at that time and can
48 not (for now) be moved easily.
49
50 Placing via pinning and cpupools
51 The simplest way of placing a domain on a NUMA node is setting the hard
52 scheduling affinity of the domain's vCPUs to the pCPUs of the node.
53 This also goes under the name of vCPU pinning, and can be done through
54 the "cpus=" option in the config file (more about this below). Another
55 option is to pool together the pCPUs spanning the node and put the
56 domain in such a cpupool with the "pool=" config option (as documented
57 in our Wiki <http://wiki.xen.org/wiki/Cpupools_Howto>).
58
59 In both the above cases, the domain will not be able to execute outside
60 the specified set of pCPUs for any reasons, even if all those pCPUs are
61 busy doing something else while there are others, idle, pCPUs.
62
63 So, when doing this, local memory accesses are 100% guaranteed, but
64 that may come at he cost of some load imbalances.
65
66 NUMA aware scheduling
67 If using the credit1 scheduler, and starting from Xen 4.3, the
68 scheduler itself always tries to run the domain's vCPUs on one of the
69 nodes in its node-affinity. Only if that turns out to be impossible, it
70 will just pick any free pCPU. Locality of access is less guaranteed
71 than in the pinning case, but that comes along with better chances to
72 exploit all the host resources (e.g., the pCPUs).
73
74 Starting from Xen 4.5, credit1 supports two forms of affinity: hard and
75 soft, both on a per-vCPU basis. This means each vCPU can have its own
76 soft affinity, stating where such vCPU prefers to execute on. This is
77 less strict than what it (also starting from 4.5) is called hard
78 affinity, as the vCPU can potentially run everywhere, it just prefers
79 some pCPUs rather than others. In Xen 4.5, therefore, NUMA-aware
80 scheduling is achieved by matching the soft affinity of the vCPUs of a
81 domain with its node-affinity.
82
83 In fact, as it was for 4.3, if all the pCPUs in a vCPU's soft affinity
84 are busy, it is possible for the domain to run outside from there. The
85 idea is that slower execution (due to remote memory accesses) is still
86 better than no execution at all (as it would happen with pinning). For
87 this reason, NUMA aware scheduling has the potential of bringing
88 substantial performances benefits, although this will depend on the
89 workload.
90
91 Notice that, for each vCPU, the following three scenarios are possbile:
92
93 · a vCPU is pinned to some pCPUs and does not have any soft affinity
94 In this case, the vCPU is always scheduled on one of the pCPUs to
95 which it is pinned, without any specific peference among them.
96
97 · a vCPU has its own soft affinity and is not pinned to any
98 particular pCPU. In this case, the vCPU can run on every pCPU.
99 Nevertheless, the scheduler will try to have it running on one of
100 the pCPUs in its soft affinity;
101
102 · a vCPU has its own vCPU soft affinity and is also pinned to some
103 pCPUs. In this case, the vCPU is always scheduled on one of the
104 pCPUs onto which it is pinned, with, among them, a preference for
105 the ones that also forms its soft affinity. In case pinning and
106 soft affinity form two disjoint sets of pCPUs, pinning "wins", and
107 the soft affinity is just ignored.
108
109 Guest placement in xl
110 If using xl for creating and managing guests, it is very easy to ask
111 for both manual or automatic placement of them across the host's NUMA
112 nodes.
113
114 Note that xm/xend does a very similar thing, the only differences being
115 the details of the heuristics adopted for automatic placement (see
116 below), and the lack of support (in both xm/xend and the Xen versions
117 where that was the default toolstack) for NUMA aware scheduling.
118
119 Placing the guest manually
120 Thanks to the "cpus=" option, it is possible to specify where a domain
121 should be created and scheduled on, directly in its config file. This
122 affects NUMA placement and memory accesses as, in this case, the
123 hypervisor constructs the node-affinity of a VM basing right on its
124 vCPU pinning when it is created.
125
126 This is very simple and effective, but requires the user/system
127 administrator to explicitly specify the pinning for each and every
128 domain, or Xen won't be able to guarantee the locality for their memory
129 accesses.
130
131 That, of course, also mean the vCPUs of the domain will only be able to
132 execute on those same pCPUs.
133
134 It is is also possible to have a "cpus_soft=" option in the xl config
135 file, to specify the soft affinity for all the vCPUs of the domain.
136 This affects the NUMA placement in the following way:
137
138 · if only "cpus_soft=" is present, the VM's node-affinity will be
139 equal to the nodes to which the pCPUs in the soft affinity mask
140 belong;
141
142 · if both "cpus_soft=" and "cpus=" are present, the VM's node-
143 affinity will be equal to the nodes to which the pCPUs present both
144 in hard and soft affinity belong.
145
146 Placing the guest automatically
147 If neither "cpus=" nor "cpus_soft=" are present in the config file,
148 libxl tries to figure out on its own on which node(s) the domain could
149 fit best. If it finds one (some), the domain's node affinity get set
150 to there, and both memory allocations and NUMA aware scheduling (for
151 the credit scheduler and starting from Xen 4.3) will comply with it.
152 Starting from Xen 4.5, this also means that the mask resulting from
153 this "fitting" procedure will become the soft affinity of all the vCPUs
154 of the domain.
155
156 It is worthwhile noting that optimally fitting a set of VMs on the NUMA
157 nodes of an host is an incarnation of the Bin Packing Problem. In fact,
158 the various VMs with different memory sizes are the items to be packed,
159 and the host nodes are the bins. As such problem is known to be NP-
160 hard, we will be using some heuristics.
161
162 The first thing to do is find the nodes or the sets of nodes (from now
163 on referred to as 'candidates') that have enough free memory and enough
164 physical CPUs for accommodating the new domain. The idea is to find a
165 spot for the domain with at least as much free memory as it has
166 configured to have, and as much pCPUs as it has vCPUs. After that, the
167 actual decision on which candidate to pick happens accordingly to the
168 following heuristics:
169
170 · candidates involving fewer nodes are considered better. In case two
171 (or more) candidates span the same number of nodes,
172
173 · candidates with a smaller number of vCPUs runnable on them (due to
174 previous placement and/or plain vCPU pinning) are considered
175 better. In case the same number of vCPUs can run on two (or more)
176 candidates,
177
178 · the candidate with with the greatest amount of free memory is
179 considered to be the best one.
180
181 Giving preference to candidates with fewer nodes ensures better
182 performance for the guest, as it avoid spreading its memory among
183 different nodes. Favoring candidates with fewer vCPUs already runnable
184 there ensures a good balance of the overall host load. Finally, if more
185 candidates fulfil these criteria, prioritizing the nodes that have the
186 largest amounts of free memory helps keeping the memory fragmentation
187 small, and maximizes the probability of being able to put more domains
188 there.
189
190 Guest placement in libxl
191 xl achieves automatic NUMA placement because that is what libxl does by
192 default. No API is provided (yet) for modifying the behaviour of the
193 placement algorithm. However, if your program is calling libxl, it is
194 possible to set the "numa_placement" build info key to "false" (it is
195 "true" by default) with something like the below, to prevent any
196 placement from happening:
197
198 libxl_defbool_set(&domain_build_info->numa_placement, false);
199
200 Also, if "numa_placement" is set to "true", the domain's vCPUs must not
201 be pinned (i.e., "domain_build_info->cpumap" must have all its bits
202 set, as it is by default), or domain creation will fail with
203 "ERROR_INVAL".
204
205 Starting from Xen 4.3, in case automatic placement happens (and is
206 successful), it will affect the domain's node-affinity and not its vCPU
207 pinning. Namely, the domain's vCPUs will not be pinned to any pCPU on
208 the host, but the memory from the domain will come from the selected
209 node(s) and the NUMA aware scheduling (if the credit scheduler is in
210 use) will try to keep the domain's vCPUs there as much as possible.
211
212 Besides than that, looking and/or tweaking the placement algorithm
213 search "Automatic NUMA placement" in libxl_internal.h.
214
215 Note this may change in future versions of Xen/libxl.
216
217 Xen < 4.5
218 The concept of vCPU soft affinity has been introduced for the first
219 time in Xen 4.5. In 4.3, it is the domain's node-affinity that drives
220 the NUMA-aware scheduler. The main difference is soft affinity is per-
221 vCPU, and so each vCPU can have its own mask of pCPUs, while node-
222 affinity is per-domain, that is the equivalent of having all the vCPUs
223 with the same soft affinity.
224
225 Xen < 4.3
226 As NUMA aware scheduling is a new feature of Xen 4.3, things are a
227 little bit different for earlier version of Xen. If no "cpus=" option
228 is specified and Xen 4.2 is in use, the automatic placement algorithm
229 still runs, but the results is used to pin the vCPUs of the domain to
230 the output node(s). This is consistent with what was happening with
231 xm/xend.
232
233 On a version of Xen earlier than 4.2, there is not automatic placement
234 at all in xl or libxl, and hence no node-affinity, vCPU affinity or
235 pinning being introduced/modified.
236
237 Limitations
238 Analyzing various possible placement solutions is what makes the
239 algorithm flexible and quite effective. However, that also means it
240 won't scale well to systems with arbitrary number of nodes. For this
241 reason, automatic placement is disabled (with a warning) if it is
242 requested on a host with more than 16 NUMA nodes.
243
244
245
2464.12.1 2019-12-11 xl-numa-placement(7)