1float.h(0P) POSIX Programmer's Manual float.h(0P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
11
13 float.h — floating types
14
16 #include <float.h>
17
19 The functionality described on this reference page is aligned with the
20 ISO C standard. Any conflict between the requirements described here
21 and the ISO C standard is unintentional. This volume of POSIX.1‐2008
22 defers to the ISO C standard.
23
24 The characteristics of floating types are defined in terms of a model
25 that describes a representation of floating-point numbers and values
26 that provide information about an implementation's floating-point
27 arithmetic.
28
29 The following parameters are used to define the model for each float‐
30 ing-point type:
31
32 s Sign (±1).
33
34 b Base or radix of exponent representation (an integer >1).
35
36 e Exponent (an integer between a minimum e_min and a maximum
37 e_max).
38
39 p Precision (the number of base−b digits in the significand).
40
41 f_k Non-negative integers less than b (the significand digits).
42
43 A floating-point number x is defined by the following model:
44
45 x = sb^e kp=Σ1 f_k b^ −k, e_min ≤ e ≤ e_max
46
47 In addition to normalized floating-point numbers (f_1>0 if x≠0), float‐
48 ing types may be able to contain other kinds of floating-point numbers,
49 such as subnormal floating-point numbers (x≠0, e=e_min, f_1=0) and
50 unnormalized floating-point numbers (x≠0, e>e_min, f_1=0), and values
51 that are not floating-point numbers, such as infinities and NaNs. A NaN
52 is an encoding signifying Not-a-Number. A quiet NaN propagates through
53 almost every arithmetic operation without raising a floating-point
54 exception; a signaling NaN generally raises a floating-point exception
55 when occurring as an arithmetic operand.
56
57 An implementation may give zero and non-numeric values, such as infini‐
58 ties and NaNs, a sign, or may leave them unsigned. Wherever such values
59 are unsigned, any requirement in POSIX.1‐2008 to retrieve the sign
60 shall produce an unspecified sign and any requirement to set the sign
61 shall be ignored.
62
63 The accuracy of the floating-point operations ('+', '−', '*', '/') and
64 of the functions in <math.h> and <complex.h> that return floating-point
65 results is implementation-defined, as is the accuracy of the conversion
66 between floating-point internal representations and string representa‐
67 tions performed by the functions in <stdio.h>, <stdlib.h>, and
68 <wchar.h>. The implementation may state that the accuracy is unknown.
69
70 All integer values in the <float.h> header, except FLT_ROUNDS, shall be
71 constant expressions suitable for use in #if preprocessing directives;
72 all floating values shall be constant expressions. All except DECI‐
73 MAL_DIG, FLT_EVAL_METHOD, FLT_RADIX, and FLT_ROUNDS have separate names
74 for all three floating-point types. The floating-point model represen‐
75 tation is provided for all values except FLT_EVAL_METHOD and
76 FLT_ROUNDS.
77
78 The rounding mode for floating-point addition is characterized by the
79 implementation-defined value of FLT_ROUNDS:
80
81 −1 Indeterminable.
82
83 0 Toward zero.
84
85 1 To nearest.
86
87 2 Toward positive infinity.
88
89 3 Toward negative infinity.
90
91 All other values for FLT_ROUNDS characterize implementation-defined
92 rounding behavior.
93
94 The values of operations with floating operands and values subject to
95 the usual arithmetic conversions and of floating constants are evalu‐
96 ated to a format whose range and precision may be greater than required
97 by the type. The use of evaluation formats is characterized by the
98 implementation-defined value of FLT_EVAL_METHOD:
99
100 −1 Indeterminable.
101
102 0 Evaluate all operations and constants just to the range and pre‐
103 cision of the type.
104
105 1 Evaluate operations and constants of type float and double to the
106 range and precision of the double type; evaluate long double
107 operations and constants to the range and precision of the long
108 double type.
109
110 2 Evaluate all operations and constants to the range and precision
111 of the long double type.
112
113 All other negative values for FLT_EVAL_METHOD characterize implementa‐
114 tion-defined behavior.
115
116 The <float.h> header shall define the following values as constant
117 expressions with implementation-defined values that are greater or
118 equal in magnitude (absolute value) to those shown, with the same sign.
119
120 * Radix of exponent representation, b.
121
122 FLT_RADIX 2
123
124 * Number of base-FLT_RADIX digits in the floating-point significand,
125 p.
126
127 FLT_MANT_DIG
128
129 DBL_MANT_DIG
130
131 LDBL_MANT_DIG
132
133 * Number of decimal digits, n, such that any floating-point number in
134 the widest supported floating type with p_max radix b digits can be
135 rounded to a floating-point number with n decimal digits and back
136 again without change to the value.
137
138 p_max log_10 b if b is a power of 10
139 ⎡ 1 + p_max log_10 b⎤ otherwise
140 DECIMAL_DIG 10
141
142 * Number of decimal digits, q, such that any floating-point number
143 with q decimal digits can be rounded into a floating-point number
144 with p radix b digits and back again without change to the q deci‐
145 mal digits.
146
147 p log_10 b if b is a power of 10
148 ⎣ (p − 1) log_10 b ⎦ otherwise
149 FLT_DIG 6
150
151 DBL_DIG 10
152
153 LDBL_DIG 10
154
155 * Minimum negative integer such that FLT_RADIX raised to that power
156 minus 1 is a normalized floating-point number, e_min.
157
158 FLT_MIN_EXP
159
160 DBL_MIN_EXP
161
162 LDBL_MIN_EXP
163
164 * Minimum negative integer such that 10 raised to that power is in
165 the range of normalized floating-point numbers.
166
167 ⎡ log_10 b^ e_min ^ −1 ⎤
168
169 FLT_MIN_10_EXP
170 −37
171
172 DBL_MIN_10_EXP
173 −37
174
175 LDBL_MIN_10_EXP
176 −37
177
178 * Maximum integer such that FLT_RADIX raised to that power minus 1 is
179 a representable finite floating-point number, e_max.
180
181 FLT_MAX_EXP
182
183 DBL_MAX_EXP
184
185 LDBL_MAX_EXP
186
187 Additionally, FLT_MAX_EXP shall be at least as large as
188 FLT_MANT_DIG, DBL_MAX_EXP shall be at least as large as
189 DBL_MANT_DIG, and LDBL_MAX_EXP shall be at least as large as
190 LDBL_MANT_DIG; which has the effect that FLT_MAX, DBL_MAX, and
191 LDBL_MAX are integral.
192
193 * Maximum integer such that 10 raised to that power is in the range
194 of representable finite floating-point numbers.
195
196 ⎣ log_10 ((1 − b^ −p) b^e _max ) ⎦
197
198 FLT_MAX_10_EXP
199 +37
200
201 DBL_MAX_10_EXP
202 +37
203
204 LDBL_MAX_10_EXP
205 +37
206
207 The <float.h> header shall define the following values as constant
208 expressions with implementation-defined values that are greater than or
209 equal to those shown:
210
211 * Maximum representable finite floating-point number.
212
213 (1 − b^ −p) b^e _max
214
215 FLT_MAX 1E+37
216
217 DBL_MAX 1E+37
218
219 LDBL_MAX 1E+37
220
221 The <float.h> header shall define the following values as constant
222 expressions with implementation-defined (positive) values that are less
223 than or equal to those shown:
224
225 * The difference between 1 and the least value greater than 1 that is
226 representable in the given floating-point type, b^ 1 − p.
227
228 FLT_EPSILON 1E−5
229
230 DBL_EPSILON 1E−9
231
232 LDBL_EPSILON 1E−9
233
234 * Minimum normalized positive floating-point number, b^ e_min ^ −1.
235
236 FLT_MIN 1E−37
237
238 DBL_MIN 1E−37
239
240 LDBL_MIN 1E−37
241
242 The following sections are informative.
243
245 None.
246
248 All known hardware floating-point formats satisfy the property that the
249 exponent range is larger than the number of mantissa digits. The ISO C
250 standard permits a floating-point format where this property is not
251 true, such that the largest finite value would not be integral; how‐
252 ever, it is unlikely that there will ever be hardware support for such
253 a floating-point format, and it introduces boundary cases that portable
254 programs should not have to be concerned with (for example, a non-inte‐
255 gral DBL_MAX means that ceil() would have to worry about overflow).
256 Therefore, this standard imposes an additional requirement that the
257 largest representable finite value is integral.
258
260 None.
261
263 <complex.h>, <math.h>, <stdio.h>, <stdlib.h>, <wchar.h>
264
266 Portions of this text are reprinted and reproduced in electronic form
267 from IEEE Std 1003.1, 2013 Edition, Standard for Information Technology
268 -- Portable Operating System Interface (POSIX), The Open Group Base
269 Specifications Issue 7, Copyright (C) 2013 by the Institute of Electri‐
270 cal and Electronics Engineers, Inc and The Open Group. (This is
271 POSIX.1-2008 with the 2013 Technical Corrigendum 1 applied.) In the
272 event of any discrepancy between this version and the original IEEE and
273 The Open Group Standard, the original IEEE and The Open Group Standard
274 is the referee document. The original Standard can be obtained online
275 at http://www.unix.org/online.html .
276
277 Any typographical or formatting errors that appear in this page are
278 most likely to have been introduced during the conversion of the source
279 files to man page format. To report such errors, see https://www.ker‐
280 nel.org/doc/man-pages/reporting_bugs.html .
281
282
283
284IEEE/The Open Group 2013 float.h(0P)