1xen-tscmode(7)                        Xen                       xen-tscmode(7)
2
3
4

OVERVIEW

6       As of Xen 4.0, a new config option called tsc_mode may be specified for
7       each domain.  The default for tsc_mode handles the vast majority of
8       hardware and software environments.  This document is targeted for Xen
9       users and administrators that may need to select a non-default
10       tsc_mode.
11
12       Proper selection of tsc_mode depends on an understanding not only of
13       the guest operating system (OS), but also of the application set that
14       will ever run on this guest OS.  This is because tsc_mode applies
15       equally to both the OS and ALL apps that are running on this domain,
16       now or in the future.
17
18       Key questions to be answered for the OS and/or each application are:
19
20       ·   Does the OS/app use the rdtsc instruction at all?  (We will explain
21           below how to determine this.)
22
23       ·   At what frequency is the rdtsc instruction executed by either the
24           OS or any running apps?  If the sum exceeds about 10,000 rdtsc
25           instructions per second per processor, we call this a "high-TSC-
26           frequency" OS/app/environment.  (This is relatively rare, and
27           developers of OS's and apps that are high-TSC-frequency are usually
28           aware of it.)
29
30       ·   If the OS/app does use rdtsc, will it behave incorrectly if "time
31           goes backwards" or if the frequency of the TSC suddenly changes?
32           If so, we call this a "TSC-sensitive" app or OS; otherwise it is
33           "TSC-resilient".
34
35       This last is the US$64,000 question as it may be very difficult (or,
36       for legacy apps, even impossible) to predict all possible failure
37       cases.  As a result, unless proven otherwise, any app that uses rdtsc
38       must be assumed to be TSC-sensitive and, as we will see, this is the
39       default starting in Xen 4.0.
40
41       Xen's new tsc_mode parameter determines the circumstances under which
42       the family of rdtsc instructions are executed "natively" vs emulated.
43       Roughly speaking, native means rdtsc is fast but TSC-sensitive apps
44       may, under unpredictable circumstances, run incorrectly; emulated means
45       there is some performance degradation (unobservable in most cases), but
46       TSC-sensitive apps will always run correctly.  Prior to Xen 4.0, all
47       rdtsc instructions were native: "fast but potentially incorrect."
48       Starting at Xen 4.0, the default is that all rdtsc instructions are
49       "correct but potentially slow".  The tsc_mode parameter in 4.0 provides
50       an intelligent default but allows system administrator's to adjust how
51       rdtsc instructions are executed differently for different domains.
52
53       The non-default choices for tsc_mode are:
54
55       ·   tsc_mode=1 (always emulate).
56
57           All rdtsc instructions are emulated; this is the best choice when
58           TSC-sensitive apps are running and it is necessary to understand
59           worst-case performance degradation for a specific hardware
60           environment.
61
62       ·   tsc_mode=2 (never emulate).
63
64           This is the same as prior to Xen 4.0 and is the best choice if it
65           is certain that all apps running in this VM are TSC-resilient and
66           highest performance is required.
67
68       ·   tsc_mode=3 (PVRDTSCP).
69
70           High-TSC-frequency apps may be paravirtualized (modified) to obtain
71           both correctness and highest performance; any unmodified apps must
72           be TSC-resilient.
73
74       If tsc_mode is left unspecified (or set to tsc_mode=0), a hybrid
75       algorithm is utilized to ensure correctness while providing the best
76       performance possible given:
77
78       ·   the requirement of correctness,
79
80       ·   the underlying hardware, and
81
82       ·   whether or not the VM has been saved/restored/migrated
83
84       To understand this in more detail, the rest of this document must be
85       read.
86

DETERMINING RDTSC FREQUENCY

88       To determine the frequency of rdtsc instructions that are emulated, an
89       "xl" command can be used by a privileged user of domain0.  The command:
90
91           # xl debug-key s; xl dmesg | tail
92
93       provides information about TSC usage in each domain where TSC emulation
94       is currently enabled.
95

TSC HISTORY

97       To understand tsc_mode completely, some background on TSC is required:
98
99       The x86 "timestamp counter", or TSC, is a 64-bit register on each
100       processor that increases monotonically.  Historically, TSC incremented
101       every processor cycle, but on recent processors, it increases at a
102       constant rate even if the processor changes frequency (for example, to
103       reduce processor power usage).  TSC is known by x86 programmers as the
104       fastest, highest-precision measurement of the passage of time so it is
105       often used as a foundation for performance monitoring.  And since it is
106       guaranteed to be monotonically increasing and, at 64 bits, is
107       guaranteed to not wraparound within 10 years, it is sometimes used as a
108       random number or a unique sequence identifier, such as to stamp
109       transactions so they can be replayed in a specific order.
110
111       On most older SMP and early multi-core machines, TSC was not
112       synchronized between processors.  Thus if an application were to read
113       the TSC on one processor, then was moved by the OS to another
114       processor, then read TSC again, it might appear that "time went
115       backwards".  This loss of monotonicity resulted in many obscure
116       application bugs when TSC-sensitive apps were ported from a
117       uniprocessor to an SMP environment; as a result, many applications --
118       especially in the Windows world -- removed their dependency on TSC and
119       replaced their timestamp needs with OS-specific functions, losing both
120       performance and precision. On some more recent generations of multi-
121       core machines, especially multi-socket multi-core machines, the TSC was
122       synchronized but if one processor were to enter certain low-power
123       states, its TSC would stop, destroying the synchrony and again causing
124       obscure bugs.  This reinforced decisions to avoid use of TSC
125       altogether.  On the most recent generations of multi-core machines,
126       however, synchronization is provided across all processors in all power
127       states, even on multi-socket machines, and provide a flag that
128       indicates that TSC is synchronized and "invariant".  Thus TSC is once
129       again useful for applications, and even newer operating systems are
130       using and depending upon TSC for critical timekeeping tasks when
131       running on these recent machines.
132
133       We will refer to hardware that ensures TSC is both synchronized and
134       invariant as "TSC-safe" and any hardware on which TSC is not (or may
135       not remain) synchronized as "TSC-unsafe".
136
137       As a result of TSC's sordid history, two classes of applications use
138       TSC: old applications designed for single processors, and the most
139       recent enterprise applications which require high-frequency high-
140       precision timestamping.
141
142       We will refer to apps that might break if running on a TSC-unsafe
143       machine as "TSC-sensitive"; apps that don't use TSC, or do use TSC but
144       use it in a way that monotonicity and frequency invariance are
145       unimportant as "TSC-resilient".
146
147       The emergence of virtualization once again complicates the usage of
148       TSC.  When features such as save/restore or live migration are
149       employed, a guest OS and all its currently running applications may be
150       invisibly transported to an entirely different physical machine.  While
151       TSC may be "safe" on one machine, it is essentially impossible to
152       precisely synchronize TSC across a data center or even a pool of
153       machines.  As a result, when run in a virtualized environment, rare and
154       obscure "time going backwards" problems might once again occur for
155       those TSC-sensitive applications.  Worse, if a guest OS moves from, for
156       example, a 3GHz machine to a 1.5GHz machine, attempts by an OS/app to
157       measure time intervals with TSC may without notice be incorrect by a
158       factor of two.
159
160       The rdtsc (read timestamp counter) instruction is used to read the TSC
161       register.  The rdtscp instruction is a variant of rdtsc on recent
162       processors.  We refer to these together as the rdtsc family of
163       instructions, or just "rdtsc".  Instructions in the rdtsc family are
164       non-privileged, but privileged software may set a cpuid bit to cause
165       all rdtsc family instructions to trap.  This trap can be detected by
166       Xen, which can then transparently "emulate" the results of the rdtsc
167       instruction and return control to the code following the rdtsc
168       instruction.
169
170       To provide a "safe" TSC, i.e. to ensure both TSC monotonicity and a
171       fixed rate, Xen provides rdtsc emulation whenever necessary or when
172       explicitly specified by a per-VM configuration option.  TSC emulation
173       is relatively slow -- roughly 15-20 times slower than the rdtsc
174       instruction when executed natively.  However, except when an OS or
175       application uses the rdtsc instruction at a high frequency (e.g. more
176       than about 10,000 times per second per processor), this performance
177       degradation is not noticeable (i.e. <0.3%).  And, TSC emulation is
178       nearly always faster than OS-provided alternatives (e.g. Linux's
179       gettimeofday).  For environments where it is certain that all apps are
180       TSC-resilient (e.g.  "TSC-safeness" is not necessary) and highest
181       performance is a requirement, TSC emulation may be entirely disabled
182       (tsc_mode==2).
183
184       The default mode (tsc_mode==0) checks TSC-safeness of the underlying
185       hardware on which the virtual machine is launched.  If it is TSC-safe,
186       rdtsc will execute at hardware speed; if it is not, rdtsc will be
187       emulated.  Once a virtual machine is save/restored or migrated,
188       however, there are two possibilities: TSC remains native IF the source
189       physical machine and target physical machine have the same TSC
190       frequency (or, for HVM/PVH guests, if TSC scaling support is
191       available); else TSC is emulated.  Note that, though emulated, the
192       "apparent" TSC frequency will be the TSC frequency of the initial
193       physical machine, even after migration.
194
195       For environments where both TSC-safeness AND highest performance even
196       across migration is a requirement, application code can be specially
197       modified to use an algorithm explicitly designed into Xen for this
198       purpose.  This mode (tsc_mode==3) is called PVRDTSCP, because it
199       requires app paravirtualization (awareness by the app that it may be
200       running on top of Xen), and utilizes a variation of the rdtsc
201       instruction called rdtscp that is available on most recent generation
202       processors.  (The rdtscp instruction differs from the rdtsc instruction
203       in that it reads not only the TSC but an additional register set by
204       system software.)  When a pvrdtscp-modified app is running on a
205       processor that is both TSC-safe and supports the rdtscp instruction,
206       information can be obtained about migration and TSC frequency/offset
207       adjustment to allow the vast majority of timestamps to be obtained at
208       top performance; when running on a TSC-unsafe processor or a processor
209       that doesn't support the rdtscp instruction, rdtscp is emulated.
210
211       PVRDTSCP (tsc_mode==3) has two limitations.  First, it applies to all
212       apps running in this virtual machine.  This means that all apps must
213       either be TSC-resilient or pvrdtscp-modified.  Second, highest
214       performance is only obtained on TSC-safe machines that support the
215       rdtscp instruction; when running on older machines, rdtscp is emulated
216       and thus slower.  For more information on PVRDTSCP, see below.
217
218       Finally, tsc_mode==1 always enables TSC emulation, regardless of the
219       underlying physical hardware. The "apparent" TSC frequency will be the
220       TSC frequency of the initial physical machine, even after migration.
221       This mode is useful to measure any performance degradation that might
222       be encountered by a tsc_mode==0 domain after migration occurs, or a
223       tsc_mode==3 domain when it is running on TSC-unsafe hardware.
224
225       Note that while Xen ensures that an emulated TSC is "safe" across
226       migration, it does not ensure that it continues to tick at the same
227       rate during the actual migration.  As an oversimplified example, if TSC
228       is ticking once per second in a guest, and the guest is saved when the
229       TSC is 1000, then restored 30 seconds later, TSC is only guaranteed to
230       be greater than or equal to 1001, not precisely 1030.  This has some OS
231       implications as will be seen in the next section.
232

TSC INVARIANT BIT and NO_MIGRATE

234       Related to TSC emulation, the "TSC Invariant" bit is architecturally
235       defined in a cpuid bit on the most recent x86 processors.  If set, TSC
236       invariance ensures that the TSC is "safe", that is it will increment at
237       a constant rate regardless of power events, will be synchronized across
238       all processors, and was properly initialized to zero on all processors
239       at boot-time by system hardware/BIOS.  As long as system software never
240       writes to TSC, TSC will be safe and continuously incremented at a fixed
241       rate and thus can be used as a system "clocksource".
242
243       This bit is used by some OS's, and specifically by Linux starting with
244       version 2.6.30(?), to select TSC as a system clocksource.  Once
245       selected, TSC remains the Linux system clocksource unless manually
246       overridden.  In a virtualized environment, since it is not possible to
247       synchronize TSC across all the machines in a pool or data center, a
248       migration may "break" TSC as a usable clocksource; while time will not
249       go backwards, it may not track wallclock time well enough to avoid
250       certain time-sensitive consequences.  As a result, Xen can only expose
251       the TSC Invariant bit to a guest OS if it is certain that the domain
252       will never migrate.  As of Xen 4.0, the "no_migrate=1" VM configuration
253       option may be specified to disable migration.  If no_migrate is
254       selected and the VM is running on a physical machine with "TSC
255       Invariant", Linux 2.6.30+ will safely use TSC as the system
256       clocksource.  But, attempts to migrate or, once saved, restore this
257       domain will fail.
258
259       There is another cpuid-related complication: The x86 cpuid instruction
260       is non-privileged.  HVM domains are configured to always trap this
261       instruction to Xen, where Xen can "filter" the result.  In a PV OS, all
262       cpuid instructions have been replaced by a paravirtualized equivalent
263       of the cpuid instruction ("pvcpuid") and also trap to Xen.  But apps in
264       a PV guest that use a cpuid instruction execute it directly, without a
265       trap to Xen.  As a result, an app may directly examine the physical TSC
266       Invariant cpuid bit and make decisions based on that bit.  This is
267       still an unsolved problem, though a workaround exists as part of the
268       PVRDTSCP tsc_mode for apps that can be modified.
269

MORE ON PVRDTSCP

271       Paravirtualized OS's use the "pvclock" algorithm to manage the passing
272       of time.  This sophisticated algorithm obtains information from a
273       memory page shared between Xen and the OS and selects information from
274       this page based on the current virtual CPU (vcpu) in order to properly
275       adapt to TSC-unsafe systems and changes that occur across migration.
276       Neither this shared page nor the vcpu information is available to a
277       userland app so the pvclock algorithm cannot be directly used by an
278       app, at least without performance degradation roughly equal to the cost
279       of just emulating an rdtsc.
280
281       As a result, as of 4.0, Xen provides capabilities for a userland app to
282       obtain key time values similar to the information accessible to the PV
283       OS pvclock algorithm.  The app uses the rdtscp instruction which is
284       defined in recent processors to obtain both the TSC and an auxiliary
285       value called TSC_AUX.  Xen is responsible for setting TSC_AUX to the
286       same value on all vcpus running any domain with tsc_mode==3; further,
287       Xen tools are responsible for monotonically incrementing TSC_AUX
288       anytime the domain is restored/migrated (thus changing key time
289       values); and, when the domain is running on a physical machine that
290       either is not TSC-safe or does not support the rdtscp instruction, Xen
291       is responsible for emulating the rdtscp instruction and for setting
292       TSC_AUX to zero on all processors.
293
294       Xen also provides pvclock information via a "pvcpuid" instruction.
295       While this results in a slow trap, the information changes (and thus
296       must be reobtained via pvcpuid) ONLY when TSC_AUX has changed, which
297       should be very rare relative to a high frequency of rdtscp
298       instructions.
299
300       Finally, Xen provides additional time-related information via other
301       pvcpuid instructions.  First, an app is capable of determining if it is
302       currently running on Xen, next whether the tsc_mode setting of the
303       domain in which it is running, and finally whether the underlying
304       hardware is TSC-safe and supports the rdtscp instruction.
305
306       As a result, a pvrdtscp-modified app has sufficient information to
307       compute the pvclock "elapsed nanoseconds" which can be used as a
308       timestamp.  And this can be done nearly as fast as a native rdtsc
309       instruction, much faster than emulation, and also much faster than
310       nearly all OS-provided time mechanisms.  While pvrtscp is too complex
311       for most apps, certain enterprise TSC-sensitive high-TSC-frequency apps
312       may find it useful to obtain a significant performance gain.
313

HARDWARE TSC SCALING

315       Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read by
316       guest rdtsc/p increasing in a different frequency than the host TSC
317       frequency.
318
319       If a HVM container in default TSC mode (tsc_mode=0) or PVRDTSCP mode
320       (tsc_mode=3) is created on a host that provides constant TSC, its guest
321       TSC frequency will be the same as the host. If it is later migrated to
322       another host that provides constant TSC and supports Intel VMX TSC
323       scaling/AMD SVM TSC ratio, its guest TSC frequency will be the same
324       before and after migration.
325
326       For above HVM container in default TSC mode (tsc_mode=0), if above
327       hosts support rdtscp, both guest rdtsc and rdtscp instructions will be
328       executed natively before and after migration.
329
330       For above HVM container in PVRDTSCP mode (tsc_mode=3), if the
331       destination host does not support rdtscp, the guest rdtscp instruction
332       will be emulated with the guest TSC frequency.
333

AUTHORS

335       Dan Magenheimer <dan.magenheimer@oracle.com>
336
337
338
3394.11.1                            2018-11-29                    xen-tscmode(7)
Impressum