1float.h(0P) POSIX Programmer's Manual float.h(0P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
12 float.h — floating types
13
15 #include <float.h>
16
18 The functionality described on this reference page is aligned with the
19 ISO C standard. Any conflict between the requirements described here
20 and the ISO C standard is unintentional. This volume of POSIX.1‐2017
21 defers to the ISO C standard.
22
23 The characteristics of floating types are defined in terms of a model
24 that describes a representation of floating-point numbers and values
25 that provide information about an implementation's floating-point
26 arithmetic.
27
28 The following parameters are used to define the model for each float‐
29 ing-point type:
30
31 s Sign (±1).
32
33 b Base or radix of exponent representation (an integer >1).
34
35 e Exponent (an integer between a minimum e_min and a maximum
36 e_max).
37
38 p Precision (the number of base-b digits in the significand).
39
40 f_k Non-negative integers less than b (the significand digits).
41
42 A floating-point number x is defined by the following model:
43
44 x = sb^e kp=Σ1 f_k b^ −k, e_min ≤ e ≤ e_max
45
46 In addition to normalized floating-point numbers (f_1>0 if x≠0), float‐
47 ing types may be able to contain other kinds of floating-point numbers,
48 such as subnormal floating-point numbers (x≠0, e=e_min, f_1=0) and
49 unnormalized floating-point numbers (x≠0, e>e_min, f_1=0), and values
50 that are not floating-point numbers, such as infinities and NaNs. A NaN
51 is an encoding signifying Not-a-Number. A quiet NaN propagates through
52 almost every arithmetic operation without raising a floating-point
53 exception; a signaling NaN generally raises a floating-point exception
54 when occurring as an arithmetic operand.
55
56 An implementation may give zero and non-numeric values, such as infini‐
57 ties and NaNs, a sign, or may leave them unsigned. Wherever such values
58 are unsigned, any requirement in POSIX.1‐2008 to retrieve the sign
59 shall produce an unspecified sign and any requirement to set the sign
60 shall be ignored.
61
62 The accuracy of the floating-point operations ('+', '-', '*', '/') and
63 of the functions in <math.h> and <complex.h> that return floating-point
64 results is implementation-defined, as is the accuracy of the conversion
65 between floating-point internal representations and string representa‐
66 tions performed by the functions in <stdio.h>, <stdlib.h>, and
67 <wchar.h>. The implementation may state that the accuracy is unknown.
68
69 All integer values in the <float.h> header, except FLT_ROUNDS, shall be
70 constant expressions suitable for use in #if preprocessing directives;
71 all floating values shall be constant expressions. All except DECI‐
72 MAL_DIG, FLT_EVAL_METHOD, FLT_RADIX, and FLT_ROUNDS have separate names
73 for all three floating-point types. The floating-point model represen‐
74 tation is provided for all values except FLT_EVAL_METHOD and
75 FLT_ROUNDS.
76
77 The rounding mode for floating-point addition is characterized by the
78 implementation-defined value of FLT_ROUNDS:
79
80 -1 Indeterminable.
81
82 0 Toward zero.
83
84 1 To nearest.
85
86 2 Toward positive infinity.
87
88 3 Toward negative infinity.
89
90 All other values for FLT_ROUNDS characterize implementation-defined
91 rounding behavior.
92
93 The values of operations with floating operands and values subject to
94 the usual arithmetic conversions and of floating constants are evalu‐
95 ated to a format whose range and precision may be greater than required
96 by the type. The use of evaluation formats is characterized by the
97 implementation-defined value of FLT_EVAL_METHOD:
98
99 -1 Indeterminable.
100
101 0 Evaluate all operations and constants just to the range and pre‐
102 cision of the type.
103
104 1 Evaluate operations and constants of type float and double to the
105 range and precision of the double type; evaluate long double
106 operations and constants to the range and precision of the long
107 double type.
108
109 2 Evaluate all operations and constants to the range and precision
110 of the long double type.
111
112 All other negative values for FLT_EVAL_METHOD characterize implementa‐
113 tion-defined behavior.
114
115 The <float.h> header shall define the following values as constant
116 expressions with implementation-defined values that are greater or
117 equal in magnitude (absolute value) to those shown, with the same sign.
118
119 * Radix of exponent representation, b.
120
121 FLT_RADIX 2
122
123 * Number of base-FLT_RADIX digits in the floating-point significand,
124 p.
125
126 FLT_MANT_DIG
127
128 DBL_MANT_DIG
129
130 LDBL_MANT_DIG
131
132 * Number of decimal digits, n, such that any floating-point number in
133 the widest supported floating type with p_max radix b digits can be
134 rounded to a floating-point number with n decimal digits and back
135 again without change to the value.
136
137 p_max log_10 b if b is a power of 10
138 ⎡ 1 + p_max log_10 b⎤ otherwise
139 DECIMAL_DIG 10
140
141 * Number of decimal digits, q, such that any floating-point number
142 with q decimal digits can be rounded into a floating-point number
143 with p radix b digits and back again without change to the q deci‐
144 mal digits.
145
146 p log_10 b if b is a power of 10
147 ⎣ (p − 1) log_10 b ⎦ otherwise
148 FLT_DIG 6
149
150 DBL_DIG 10
151
152 LDBL_DIG 10
153
154 * Minimum negative integer such that FLT_RADIX raised to that power
155 minus 1 is a normalized floating-point number, e_min.
156
157 FLT_MIN_EXP
158
159 DBL_MIN_EXP
160
161 LDBL_MIN_EXP
162
163 * Minimum negative integer such that 10 raised to that power is in
164 the range of normalized floating-point numbers.
165
166 ⎡ log_10 b^ e_min ^ −1 ⎤
167
168 FLT_MIN_10_EXP
169 -37
170
171 DBL_MIN_10_EXP
172 -37
173
174 LDBL_MIN_10_EXP
175 -37
176
177 * Maximum integer such that FLT_RADIX raised to that power minus 1 is
178 a representable finite floating-point number, e_max.
179
180 FLT_MAX_EXP
181
182 DBL_MAX_EXP
183
184 LDBL_MAX_EXP
185
186 Additionally, FLT_MAX_EXP shall be at least as large as
187 FLT_MANT_DIG, DBL_MAX_EXP shall be at least as large as
188 DBL_MANT_DIG, and LDBL_MAX_EXP shall be at least as large as
189 LDBL_MANT_DIG; which has the effect that FLT_MAX, DBL_MAX, and
190 LDBL_MAX are integral.
191
192 * Maximum integer such that 10 raised to that power is in the range
193 of representable finite floating-point numbers.
194
195 ⎣ log_10 ((1 − b^ −p) b^e _max ) ⎦
196
197 FLT_MAX_10_EXP
198 +37
199
200 DBL_MAX_10_EXP
201 +37
202
203 LDBL_MAX_10_EXP
204 +37
205
206 The <float.h> header shall define the following values as constant
207 expressions with implementation-defined values that are greater than or
208 equal to those shown:
209
210 * Maximum representable finite floating-point number.
211
212 (1 − b^ −p) b^e _max
213
214 FLT_MAX 1E+37
215
216 DBL_MAX 1E+37
217
218 LDBL_MAX 1E+37
219
220 The <float.h> header shall define the following values as constant
221 expressions with implementation-defined (positive) values that are less
222 than or equal to those shown:
223
224 * The difference between 1 and the least value greater than 1 that is
225 representable in the given floating-point type, b^ 1 − p.
226
227 FLT_EPSILON 1E-5
228
229 DBL_EPSILON 1E-9
230
231 LDBL_EPSILON 1E-9
232
233 * Minimum normalized positive floating-point number, b^ e_min ^ −1.
234
235 FLT_MIN 1E-37
236
237 DBL_MIN 1E-37
238
239 LDBL_MIN 1E-37
240
241 The following sections are informative.
242
244 None.
245
247 All known hardware floating-point formats satisfy the property that the
248 exponent range is larger than the number of mantissa digits. The ISO C
249 standard permits a floating-point format where this property is not
250 true, such that the largest finite value would not be integral; how‐
251 ever, it is unlikely that there will ever be hardware support for such
252 a floating-point format, and it introduces boundary cases that portable
253 programs should not have to be concerned with (for example, a non-inte‐
254 gral DBL_MAX means that ceil() would have to worry about overflow).
255 Therefore, this standard imposes an additional requirement that the
256 largest representable finite value is integral.
257
259 None.
260
262 <complex.h>, <math.h>, <stdio.h>, <stdlib.h>, <wchar.h>
263
265 Portions of this text are reprinted and reproduced in electronic form
266 from IEEE Std 1003.1-2017, Standard for Information Technology -- Por‐
267 table Operating System Interface (POSIX), The Open Group Base Specifi‐
268 cations Issue 7, 2018 Edition, Copyright (C) 2018 by the Institute of
269 Electrical and Electronics Engineers, Inc and The Open Group. In the
270 event of any discrepancy between this version and the original IEEE and
271 The Open Group Standard, the original IEEE and The Open Group Standard
272 is the referee document. The original Standard can be obtained online
273 at http://www.opengroup.org/unix/online.html .
274
275 Any typographical or formatting errors that appear in this page are
276 most likely to have been introduced during the conversion of the source
277 files to man page format. To report such errors, see https://www.ker‐
278 nel.org/doc/man-pages/reporting_bugs.html .
279
280
281
282IEEE/The Open Group 2017 float.h(0P)