1xen-tscmode(7) Xen xen-tscmode(7)
2
3
4
6 As of Xen 4.0, a new config option called tsc_mode may be specified for
7 each domain. The default for tsc_mode handles the vast majority of
8 hardware and software environments. This document is targeted for Xen
9 users and administrators that may need to select a non-default
10 tsc_mode.
11
12 Proper selection of tsc_mode depends on an understanding not only of
13 the guest operating system (OS), but also of the application set that
14 will ever run on this guest OS. This is because tsc_mode applies
15 equally to both the OS and ALL apps that are running on this domain,
16 now or in the future.
17
18 Key questions to be answered for the OS and/or each application are:
19
20 · Does the OS/app use the rdtsc instruction at all? (We will explain
21 below how to determine this.)
22
23 · At what frequency is the rdtsc instruction executed by either the
24 OS or any running apps? If the sum exceeds about 10,000 rdtsc
25 instructions per second per processor, we call this a "high-TSC-
26 frequency" OS/app/environment. (This is relatively rare, and
27 developers of OS's and apps that are high-TSC-frequency are usually
28 aware of it.)
29
30 · If the OS/app does use rdtsc, will it behave incorrectly if "time
31 goes backwards" or if the frequency of the TSC suddenly changes?
32 If so, we call this a "TSC-sensitive" app or OS; otherwise it is
33 "TSC-resilient".
34
35 This last is the US$64,000 question as it may be very difficult (or,
36 for legacy apps, even impossible) to predict all possible failure
37 cases. As a result, unless proven otherwise, any app that uses rdtsc
38 must be assumed to be TSC-sensitive and, as we will see, this is the
39 default starting in Xen 4.0.
40
41 Xen's new tsc_mode parameter determines the circumstances under which
42 the family of rdtsc instructions are executed "natively" vs emulated.
43 Roughly speaking, native means rdtsc is fast but TSC-sensitive apps
44 may, under unpredictable circumstances, run incorrectly; emulated means
45 there is some performance degradation (unobservable in most cases), but
46 TSC-sensitive apps will always run correctly. Prior to Xen 4.0, all
47 rdtsc instructions were native: "fast but potentially incorrect."
48 Starting at Xen 4.0, the default is that all rdtsc instructions are
49 "correct but potentially slow". The tsc_mode parameter in 4.0 provides
50 an intelligent default but allows system administrator's to adjust how
51 rdtsc instructions are executed differently for different domains.
52
53 The non-default choices for tsc_mode are:
54
55 · tsc_mode=1 (always emulate).
56
57 All rdtsc instructions are emulated; this is the best choice when
58 TSC-sensitive apps are running and it is necessary to understand
59 worst-case performance degradation for a specific hardware
60 environment.
61
62 · tsc_mode=2 (never emulate).
63
64 This is the same as prior to Xen 4.0 and is the best choice if it
65 is certain that all apps running in this VM are TSC-resilient and
66 highest performance is required.
67
68 · tsc_mode=3 (PVRDTSCP).
69
70 High-TSC-frequency apps may be paravirtualized (modified) to obtain
71 both correctness and highest performance; any unmodified apps must
72 be TSC-resilient.
73
74 If tsc_mode is left unspecified (or set to tsc_mode=0), a hybrid
75 algorithm is utilized to ensure correctness while providing the best
76 performance possible given:
77
78 · the requirement of correctness,
79
80 · the underlying hardware, and
81
82 · whether or not the VM has been saved/restored/migrated
83
84 To understand this in more detail, the rest of this document must be
85 read.
86
88 To determine the frequency of rdtsc instructions that are emulated, an
89 "xl" command can be used by a privileged user of domain0. The command:
90
91 # xl debug-key s; xl dmesg | tail
92
93 provides information about TSC usage in each domain where TSC emulation
94 is currently enabled.
95
97 To understand tsc_mode completely, some background on TSC is required:
98
99 The x86 "timestamp counter", or TSC, is a 64-bit register on each
100 processor that increases monotonically. Historically, TSC incremented
101 every processor cycle, but on recent processors, it increases at a
102 constant rate even if the processor changes frequency (for example, to
103 reduce processor power usage). TSC is known by x86 programmers as the
104 fastest, highest-precision measurement of the passage of time so it is
105 often used as a foundation for performance monitoring. And since it is
106 guaranteed to be monotonically increasing and, at 64 bits, is
107 guaranteed to not wraparound within 10 years, it is sometimes used as a
108 random number or a unique sequence identifier, such as to stamp
109 transactions so they can be replayed in a specific order.
110
111 On most older SMP and early multi-core machines, TSC was not
112 synchronized between processors. Thus if an application were to read
113 the TSC on one processor, then was moved by the OS to another
114 processor, then read TSC again, it might appear that "time went
115 backwards". This loss of monotonicity resulted in many obscure
116 application bugs when TSC-sensitive apps were ported from a
117 uniprocessor to an SMP environment; as a result, many applications --
118 especially in the Windows world -- removed their dependency on TSC and
119 replaced their timestamp needs with OS-specific functions, losing both
120 performance and precision. On some more recent generations of multi-
121 core machines, especially multi-socket multi-core machines, the TSC was
122 synchronized but if one processor were to enter certain low-power
123 states, its TSC would stop, destroying the synchrony and again causing
124 obscure bugs. This reinforced decisions to avoid use of TSC
125 altogether. On the most recent generations of multi-core machines,
126 however, synchronization is provided across all processors in all power
127 states, even on multi-socket machines, and provide a flag that
128 indicates that TSC is synchronized and "invariant". Thus TSC is once
129 again useful for applications, and even newer operating systems are
130 using and depending upon TSC for critical timekeeping tasks when
131 running on these recent machines.
132
133 We will refer to hardware that ensures TSC is both synchronized and
134 invariant as "TSC-safe" and any hardware on which TSC is not (or may
135 not remain) synchronized as "TSC-unsafe".
136
137 As a result of TSC's sordid history, two classes of applications use
138 TSC: old applications designed for single processors, and the most
139 recent enterprise applications which require high-frequency high-
140 precision timestamping.
141
142 We will refer to apps that might break if running on a TSC-unsafe
143 machine as "TSC-sensitive"; apps that don't use TSC, or do use TSC but
144 use it in a way that monotonicity and frequency invariance are
145 unimportant as "TSC-resilient".
146
147 The emergence of virtualization once again complicates the usage of
148 TSC. When features such as save/restore or live migration are
149 employed, a guest OS and all its currently running applications may be
150 invisibly transported to an entirely different physical machine. While
151 TSC may be "safe" on one machine, it is essentially impossible to
152 precisely synchronize TSC across a data center or even a pool of
153 machines. As a result, when run in a virtualized environment, rare and
154 obscure "time going backwards" problems might once again occur for
155 those TSC-sensitive applications. Worse, if a guest OS moves from, for
156 example, a 3GHz machine to a 1.5GHz machine, attempts by an OS/app to
157 measure time intervals with TSC may without notice be incorrect by a
158 factor of two.
159
160 The rdtsc (read timestamp counter) instruction is used to read the TSC
161 register. The rdtscp instruction is a variant of rdtsc on recent
162 processors. We refer to these together as the rdtsc family of
163 instructions, or just "rdtsc". Instructions in the rdtsc family are
164 non-privileged, but privileged software may set a cpuid bit to cause
165 all rdtsc family instructions to trap. This trap can be detected by
166 Xen, which can then transparently "emulate" the results of the rdtsc
167 instruction and return control to the code following the rdtsc
168 instruction.
169
170 To provide a "safe" TSC, i.e. to ensure both TSC monotonicity and a
171 fixed rate, Xen provides rdtsc emulation whenever necessary or when
172 explicitly specified by a per-VM configuration option. TSC emulation
173 is relatively slow -- roughly 15-20 times slower than the rdtsc
174 instruction when executed natively. However, except when an OS or
175 application uses the rdtsc instruction at a high frequency (e.g. more
176 than about 10,000 times per second per processor), this performance
177 degradation is not noticeable (i.e. <0.3%). And, TSC emulation is
178 nearly always faster than OS-provided alternatives (e.g. Linux's
179 gettimeofday). For environments where it is certain that all apps are
180 TSC-resilient (e.g. "TSC-safeness" is not necessary) and highest
181 performance is a requirement, TSC emulation may be entirely disabled
182 (tsc_mode==2).
183
184 The default mode (tsc_mode==0) checks TSC-safeness of the underlying
185 hardware on which the virtual machine is launched. If it is TSC-safe,
186 rdtsc will execute at hardware speed; if it is not, rdtsc will be
187 emulated. Once a virtual machine is save/restored or migrated,
188 however, there are two possibilities: TSC remains native IF the source
189 physical machine and target physical machine have the same TSC
190 frequency (or, for HVM/PVH guests, if TSC scaling support is
191 available); else TSC is emulated. Note that, though emulated, the
192 "apparent" TSC frequency will be the TSC frequency of the initial
193 physical machine, even after migration.
194
195 For environments where both TSC-safeness AND highest performance even
196 across migration is a requirement, application code can be specially
197 modified to use an algorithm explicitly designed into Xen for this
198 purpose. This mode (tsc_mode==3) is called PVRDTSCP, because it
199 requires app paravirtualization (awareness by the app that it may be
200 running on top of Xen), and utilizes a variation of the rdtsc
201 instruction called rdtscp that is available on most recent generation
202 processors. (The rdtscp instruction differs from the rdtsc instruction
203 in that it reads not only the TSC but an additional register set by
204 system software.) When a pvrdtscp-modified app is running on a
205 processor that is both TSC-safe and supports the rdtscp instruction,
206 information can be obtained about migration and TSC frequency/offset
207 adjustment to allow the vast majority of timestamps to be obtained at
208 top performance; when running on a TSC-unsafe processor or a processor
209 that doesn't support the rdtscp instruction, rdtscp is emulated.
210
211 PVRDTSCP (tsc_mode==3) has two limitations. First, it applies to all
212 apps running in this virtual machine. This means that all apps must
213 either be TSC-resilient or pvrdtscp-modified. Second, highest
214 performance is only obtained on TSC-safe machines that support the
215 rdtscp instruction; when running on older machines, rdtscp is emulated
216 and thus slower. For more information on PVRDTSCP, see below.
217
218 Finally, tsc_mode==1 always enables TSC emulation, regardless of the
219 underlying physical hardware. The "apparent" TSC frequency will be the
220 TSC frequency of the initial physical machine, even after migration.
221 This mode is useful to measure any performance degradation that might
222 be encountered by a tsc_mode==0 domain after migration occurs, or a
223 tsc_mode==3 domain when it is running on TSC-unsafe hardware.
224
225 Note that while Xen ensures that an emulated TSC is "safe" across
226 migration, it does not ensure that it continues to tick at the same
227 rate during the actual migration. As an oversimplified example, if TSC
228 is ticking once per second in a guest, and the guest is saved when the
229 TSC is 1000, then restored 30 seconds later, TSC is only guaranteed to
230 be greater than or equal to 1001, not precisely 1030. This has some OS
231 implications as will be seen in the next section.
232
234 Related to TSC emulation, the "TSC Invariant" bit is architecturally
235 defined in a cpuid bit on the most recent x86 processors. If set, TSC
236 invariance ensures that the TSC is "safe", that is it will increment at
237 a constant rate regardless of power events, will be synchronized across
238 all processors, and was properly initialized to zero on all processors
239 at boot-time by system hardware/BIOS. As long as system software never
240 writes to TSC, TSC will be safe and continuously incremented at a fixed
241 rate and thus can be used as a system "clocksource".
242
243 This bit is used by some OS's, and specifically by Linux starting with
244 version 2.6.30(?), to select TSC as a system clocksource. Once
245 selected, TSC remains the Linux system clocksource unless manually
246 overridden. In a virtualized environment, since it is not possible to
247 synchronize TSC across all the machines in a pool or data center, a
248 migration may "break" TSC as a usable clocksource; while time will not
249 go backwards, it may not track wallclock time well enough to avoid
250 certain time-sensitive consequences. As a result, Xen can only expose
251 the TSC Invariant bit to a guest OS if it is certain that the domain
252 will never migrate. As of Xen 4.0, the "no_migrate=1" VM configuration
253 option may be specified to disable migration. If no_migrate is
254 selected and the VM is running on a physical machine with "TSC
255 Invariant", Linux 2.6.30+ will safely use TSC as the system
256 clocksource. But, attempts to migrate or, once saved, restore this
257 domain will fail.
258
259 There is another cpuid-related complication: The x86 cpuid instruction
260 is non-privileged. HVM domains are configured to always trap this
261 instruction to Xen, where Xen can "filter" the result. In a PV OS, all
262 cpuid instructions have been replaced by a paravirtualized equivalent
263 of the cpuid instruction ("pvcpuid") and also trap to Xen. But apps in
264 a PV guest that use a cpuid instruction execute it directly, without a
265 trap to Xen. As a result, an app may directly examine the physical TSC
266 Invariant cpuid bit and make decisions based on that bit. This is
267 still an unsolved problem, though a workaround exists as part of the
268 PVRDTSCP tsc_mode for apps that can be modified.
269
271 Paravirtualized OS's use the "pvclock" algorithm to manage the passing
272 of time. This sophisticated algorithm obtains information from a
273 memory page shared between Xen and the OS and selects information from
274 this page based on the current virtual CPU (vcpu) in order to properly
275 adapt to TSC-unsafe systems and changes that occur across migration.
276 Neither this shared page nor the vcpu information is available to a
277 userland app so the pvclock algorithm cannot be directly used by an
278 app, at least without performance degradation roughly equal to the cost
279 of just emulating an rdtsc.
280
281 As a result, as of 4.0, Xen provides capabilities for a userland app to
282 obtain key time values similar to the information accessible to the PV
283 OS pvclock algorithm. The app uses the rdtscp instruction which is
284 defined in recent processors to obtain both the TSC and an auxiliary
285 value called TSC_AUX. Xen is responsible for setting TSC_AUX to the
286 same value on all vcpus running any domain with tsc_mode==3; further,
287 Xen tools are responsible for monotonically incrementing TSC_AUX
288 anytime the domain is restored/migrated (thus changing key time
289 values); and, when the domain is running on a physical machine that
290 either is not TSC-safe or does not support the rdtscp instruction, Xen
291 is responsible for emulating the rdtscp instruction and for setting
292 TSC_AUX to zero on all processors.
293
294 Xen also provides pvclock information via a "pvcpuid" instruction.
295 While this results in a slow trap, the information changes (and thus
296 must be reobtained via pvcpuid) ONLY when TSC_AUX has changed, which
297 should be very rare relative to a high frequency of rdtscp
298 instructions.
299
300 Finally, Xen provides additional time-related information via other
301 pvcpuid instructions. First, an app is capable of determining if it is
302 currently running on Xen, next whether the tsc_mode setting of the
303 domain in which it is running, and finally whether the underlying
304 hardware is TSC-safe and supports the rdtscp instruction.
305
306 As a result, a pvrdtscp-modified app has sufficient information to
307 compute the pvclock "elapsed nanoseconds" which can be used as a
308 timestamp. And this can be done nearly as fast as a native rdtsc
309 instruction, much faster than emulation, and also much faster than
310 nearly all OS-provided time mechanisms. While pvrtscp is too complex
311 for most apps, certain enterprise TSC-sensitive high-TSC-frequency apps
312 may find it useful to obtain a significant performance gain.
313
315 Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read by
316 guest rdtsc/p increasing in a different frequency than the host TSC
317 frequency.
318
319 If a HVM container in default TSC mode (tsc_mode=0) or PVRDTSCP mode
320 (tsc_mode=3) is created on a host that provides constant TSC, its guest
321 TSC frequency will be the same as the host. If it is later migrated to
322 another host that provides constant TSC and supports Intel VMX TSC
323 scaling/AMD SVM TSC ratio, its guest TSC frequency will be the same
324 before and after migration.
325
326 For above HVM container in default TSC mode (tsc_mode=0), if above
327 hosts support rdtscp, both guest rdtsc and rdtscp instructions will be
328 executed natively before and after migration.
329
330 For above HVM container in PVRDTSCP mode (tsc_mode=3), if the
331 destination host does not support rdtscp, the guest rdtscp instruction
332 will be emulated with the guest TSC frequency.
333
335 Dan Magenheimer <dan.magenheimer@oracle.com>
336
337
338
3394.11.1 2018-11-29 xen-tscmode(7)