xl-numa-placement(7)

1xl-numa-placement(7)                  Xen                 xl-numa-placement(7)
2
3
4

NAME

6       xl-numa-placement - Guest Automatic NUMA Placement in libxl and xl
7

DESCRIPTION

9   Rationale
10       NUMA (which stands for Non-Uniform Memory Access) means that the memory
11       accessing times of a program running on a CPU depends on the relative
12       distance between that CPU and that memory. In fact, most of the NUMA
13       systems are built in such a way that each processor has its local
14       memory, on which it can operate very fast. On the other hand, getting
15       and storing data from and on remote memory (that is, memory local to
16       some other processor) is quite more complex and slow. On these
17       machines, a NUMA node is usually defined as a set of processor cores
18       (typically a physical CPU package) and the memory directly attached to
19       the set of cores.
20
21       NUMA awareness becomes very important as soon as many domains start
22       running memory-intensive workloads on a shared host. In fact, the cost
23       of accessing non node-local memory locations is very high, and the
24       performance degradation is likely to be noticeable.
25
26       For more information, have a look at the Xen NUMA Introduction
27       <https://wiki.xenproject.org/wiki/Xen_on_NUMA_Machines> page on the
28       Wiki.
29
30   Xen and NUMA machines: the concept of node-affinity
31       The Xen hypervisor deals with NUMA machines throughout the concept of
32       node-affinity. The node-affinity of a domain is the set of NUMA nodes
33       of the host where the memory for the domain is being allocated (mostly,
34       at domain creation time). This is, at least in principle, different and
35       unrelated with the vCPU (hard and soft, see below) scheduling affinity,
36       which instead is the set of pCPUs where the vCPU is allowed (or
37       prefers) to run.
38
39       Of course, despite the fact that they belong to and affect different
40       subsystems, the domain node-affinity and the vCPUs affinity are not
41       completely independent.  In fact, if the domain node-affinity is not
42       explicitly specified by the user, via the proper libxl calls or xl
43       config item, it will be computed basing on the vCPUs' scheduling
44       affinity.
45
46       Notice that, even if the node affinity of a domain may change on-line,
47       it is very important to "place" the domain correctly when it is fist
48       created, as the most of its memory is allocated at that time and can
49       not (for now) be moved easily.
50
51   Placing via pinning and cpupools
52       The simplest way of placing a domain on a NUMA node is setting the hard
53       scheduling affinity of the domain's vCPUs to the pCPUs of the node.
54       This also goes under the name of vCPU pinning, and can be done through
55       the "cpus=" option in the config file (more about this below). Another
56       option is to pool together the pCPUs spanning the node and put the
57       domain in such a cpupool with the "pool=" config option (as documented
58       in our Wiki <https://wiki.xenproject.org/wiki/Cpupools_Howto>).
59
60       In both the above cases, the domain will not be able to execute outside
61       the specified set of pCPUs for any reasons, even if all those pCPUs are
62       busy doing something else while there are others, idle, pCPUs.
63
64       So, when doing this, local memory accesses are 100% guaranteed, but
65       that may come at he cost of some load imbalances.
66
67   NUMA aware scheduling
68       If using the credit1 scheduler, and starting from Xen 4.3, the
69       scheduler itself always tries to run the domain's vCPUs on one of the
70       nodes in its node-affinity. Only if that turns out to be impossible, it
71       will just pick any free pCPU. Locality of access is less guaranteed
72       than in the pinning case, but that comes along with better chances to
73       exploit all the host resources (e.g., the pCPUs).
74
75       Starting from Xen 4.5, credit1 supports two forms of affinity: hard and
76       soft, both on a per-vCPU basis. This means each vCPU can have its own
77       soft affinity, stating where such vCPU prefers to execute on. This is
78       less strict than what it (also starting from 4.5) is called hard
79       affinity, as the vCPU can potentially run everywhere, it just prefers
80       some pCPUs rather than others.  In Xen 4.5, therefore, NUMA-aware
81       scheduling is achieved by matching the soft affinity of the vCPUs of a
82       domain with its node-affinity.
83
84       In fact, as it was for 4.3, if all the pCPUs in a vCPU's soft affinity
85       are busy, it is possible for the domain to run outside from there. The
86       idea is that slower execution (due to remote memory accesses) is still
87       better than no execution at all (as it would happen with pinning). For
88       this reason, NUMA aware scheduling has the potential of bringing
89       substantial performances benefits, although this will depend on the
90       workload.
91
92       Notice that, for each vCPU, the following three scenarios are possbile:
93
94       •   a vCPU is pinned to some pCPUs and does not have any soft affinity
95           In this case, the vCPU is always scheduled on one of the pCPUs to
96           which it is pinned, without any specific peference among them.
97
98       •   a vCPU has its own soft affinity and is not pinned to any
99           particular pCPU. In this case, the vCPU can run on every pCPU.
100           Nevertheless, the scheduler will try to have it running on one of
101           the pCPUs in its soft affinity;
102
103       •   a vCPU has its own vCPU soft affinity and is also pinned to some
104           pCPUs. In this case, the vCPU is always scheduled on one of the
105           pCPUs onto which it is pinned, with, among them, a preference for
106           the ones that also forms its soft affinity. In case pinning and
107           soft affinity form two disjoint sets of pCPUs, pinning "wins", and
108           the soft affinity is just ignored.
109
110   Guest placement in xl
111       If using xl for creating and managing guests, it is very easy to ask
112       for both manual or automatic placement of them across the host's NUMA
113       nodes.
114
115       Note that xm/xend does a very similar thing, the only differences being
116       the details of the heuristics adopted for automatic placement (see
117       below), and the lack of support (in both xm/xend and the Xen versions
118       where that was the default toolstack) for NUMA aware scheduling.
119
120   Placing the guest manually
121       Thanks to the "cpus=" option, it is possible to specify where a domain
122       should be created and scheduled on, directly in its config file. This
123       affects NUMA placement and memory accesses as, in this case, the
124       hypervisor constructs the node-affinity of a VM basing right on its
125       vCPU pinning when it is created.
126
127       This is very simple and effective, but requires the user/system
128       administrator to explicitly specify the pinning for each and every
129       domain, or Xen won't be able to guarantee the locality for their memory
130       accesses.
131
132       That, of course, also mean the vCPUs of the domain will only be able to
133       execute on those same pCPUs.
134
135       It is is also possible to have a "cpus_soft=" option in the xl config
136       file, to specify the soft affinity for all the vCPUs of the domain.
137       This affects the NUMA placement in the following way:
138
139       •   if only "cpus_soft=" is present, the VM's node-affinity will be
140           equal to the nodes to which the pCPUs in the soft affinity mask
141           belong;
142
143       •   if both "cpus_soft=" and "cpus=" are present, the VM's node-
144           affinity will be equal to the nodes to which the pCPUs present both
145           in hard and soft affinity belong.
146
147   Placing the guest automatically
148       If neither "cpus=" nor "cpus_soft=" are present in the config file,
149       libxl tries to figure out on its own on which node(s) the domain could
150       fit best.  If it finds one (some), the domain's node affinity get set
151       to there, and both memory allocations and NUMA aware scheduling (for
152       the credit scheduler and starting from Xen 4.3) will comply with it.
153       Starting from Xen 4.5, this also means that the mask resulting from
154       this "fitting" procedure will become the soft affinity of all the vCPUs
155       of the domain.
156
157       It is worthwhile noting that optimally fitting a set of VMs on the NUMA
158       nodes of an host is an incarnation of the Bin Packing Problem. In fact,
159       the various VMs with different memory sizes are the items to be packed,
160       and the host nodes are the bins. As such problem is known to be NP-
161       hard, we will be using some heuristics.
162
163       The first thing to do is find the nodes or the sets of nodes (from now
164       on referred to as 'candidates') that have enough free memory and enough
165       physical CPUs for accommodating the new domain. The idea is to find a
166       spot for the domain with at least as much free memory as it has
167       configured to have, and as much pCPUs as it has vCPUs.  After that, the
168       actual decision on which candidate to pick happens accordingly to the
169       following heuristics:
170
171       •   candidates involving fewer nodes are considered better. In case two
172           (or more) candidates span the same number of nodes,
173
174       •   candidates with a smaller number of vCPUs runnable on them (due to
175           previous placement and/or plain vCPU pinning) are considered
176           better. In case the same number of vCPUs can run on two (or more)
177           candidates,
178
179       •   the candidate with with the greatest amount of free memory is
180           considered to be the best one.
181
182       Giving preference to candidates with fewer nodes ensures better
183       performance for the guest, as it avoid spreading its memory among
184       different nodes. Favoring candidates with fewer vCPUs already runnable
185       there ensures a good balance of the overall host load. Finally, if more
186       candidates fulfil these criteria, prioritizing the nodes that have the
187       largest amounts of free memory helps keeping the memory fragmentation
188       small, and maximizes the probability of being able to put more domains
189       there.
190
191   Guest placement in libxl
192       xl achieves automatic NUMA placement because that is what libxl does by
193       default. No API is provided (yet) for modifying the behaviour of the
194       placement algorithm. However, if your program is calling libxl, it is
195       possible to set the "numa_placement" build info key to "false" (it is
196       "true" by default) with something like the below, to prevent any
197       placement from happening:
198
199           libxl_defbool_set(&domain_build_info->numa_placement, false);
200
201       Also, if "numa_placement" is set to "true", the domain's vCPUs must not
202       be pinned (i.e., "domain_build_info->cpumap" must have all its bits
203       set, as it is by default), or domain creation will fail with
204       "ERROR_INVAL".
205
206       Starting from Xen 4.3, in case automatic placement happens (and is
207       successful), it will affect the domain's node-affinity and not its vCPU
208       pinning. Namely, the domain's vCPUs will not be pinned to any pCPU on
209       the host, but the memory from the domain will come from the selected
210       node(s) and the NUMA aware scheduling (if the credit scheduler is in
211       use) will try to keep the domain's vCPUs there as much as possible.
212
213       Besides than that, looking and/or tweaking the placement algorithm
214       search "Automatic NUMA placement" in libxl_internal.h.
215
216       Note this may change in future versions of Xen/libxl.
217
218   Xen < 4.5
219       The concept of vCPU soft affinity has been introduced for the first
220       time in Xen 4.5. In 4.3, it is the domain's node-affinity that drives
221       the NUMA-aware scheduler. The main difference is soft affinity is per-
222       vCPU, and so each vCPU can have its own mask of pCPUs, while node-
223       affinity is per-domain, that is the equivalent of having all the vCPUs
224       with the same soft affinity.
225
226   Xen < 4.3
227       As NUMA aware scheduling is a new feature of Xen 4.3, things are a
228       little bit different for earlier version of Xen. If no "cpus=" option
229       is specified and Xen 4.2 is in use, the automatic placement algorithm
230       still runs, but the results is used to pin the vCPUs of the domain to
231       the output node(s).  This is consistent with what was happening with
232       xm/xend.
233
234       On a version of Xen earlier than 4.2, there is not automatic placement
235       at all in xl or libxl, and hence no node-affinity, vCPU affinity or
236       pinning being introduced/modified.
237
238   Limitations
239       Analyzing various possible placement solutions is what makes the
240       algorithm flexible and quite effective. However, that also means it
241       won't scale well to systems with arbitrary number of nodes.  For this
242       reason, automatic placement is disabled (with a warning) if it is
243       requested on a host with more than 16 NUMA nodes.
244
245
246
2474.15.1                            2021-11-23              xl-numa-placement(7)