float.h(0p)

1float.h(0P)                POSIX Programmer's Manual               float.h(0P)
2
3
4

PROLOG

6       This  manual  page is part of the POSIX Programmer's Manual.  The Linux
7       implementation of this interface may differ (consult the  corresponding
8       Linux  manual page for details of Linux behavior), or the interface may
9       not be implemented on Linux.
10

NAME

12       float.h — floating types
13

SYNOPSIS

15       #include <float.h>
16

DESCRIPTION

18       The functionality described on this reference page is aligned with  the
19       ISO C  standard.  Any  conflict between the requirements described here
20       and the ISO C standard is unintentional. This  volume  of  POSIX.1‐2017
21       defers to the ISO C standard.
22
23       The  characteristics  of floating types are defined in terms of a model
24       that describes a representation of floating-point  numbers  and  values
25       that  provide  information  about  an  implementation's  floating-point
26       arithmetic.
27
28       The following parameters are used to define the model for  each  float‐
29       ing-point type:
30
31       s     Sign (±1).
32
33       b     Base or radix of exponent representation (an integer >1).
34
35       e     Exponent  (an  integer  between  a  minimum  e_min  and a maximum
36             e_max).
37
38       p     Precision (the number of base-b digits in the significand).
39
40       f_k   Non-negative integers less than b (the significand digits).
41
42       A floating-point number x is defined by the following model:
43
44       x = sb^e  kp=Σ1 f_k  b^ −k, e_min  ≤ e ≤ e_max
45
46       In addition to normalized floating-point numbers (f_1>0 if x≠0), float‐
47       ing types may be able to contain other kinds of floating-point numbers,
48       such as subnormal floating-point  numbers  (x≠0,  e=e_min,  f_1=0)  and
49       unnormalized  floating-point  numbers (x≠0, e>e_min, f_1=0), and values
50       that are not floating-point numbers, such as infinities and NaNs. A NaN
51       is  an encoding signifying Not-a-Number. A quiet NaN propagates through
52       almost every arithmetic  operation  without  raising  a  floating-point
53       exception;  a signaling NaN generally raises a floating-point exception
54       when occurring as an arithmetic operand.
55
56       An implementation may give zero and non-numeric values, such as infini‐
57       ties and NaNs, a sign, or may leave them unsigned. Wherever such values
58       are unsigned, any requirement in  POSIX.1‐2008  to  retrieve  the  sign
59       shall  produce  an unspecified sign and any requirement to set the sign
60       shall be ignored.
61
62       The accuracy of the floating-point operations ('+', '-', '*', '/')  and
63       of the functions in <math.h> and <complex.h> that return floating-point
64       results is implementation-defined, as is the accuracy of the conversion
65       between  floating-point internal representations and string representa‐
66       tions  performed  by  the  functions  in  <stdio.h>,  <stdlib.h>,   and
67       <wchar.h>.  The implementation may state that the accuracy is unknown.
68
69       All integer values in the <float.h> header, except FLT_ROUNDS, shall be
70       constant expressions suitable for use in #if preprocessing  directives;
71       all  floating  values  shall  be constant expressions. All except DECI‐
72       MAL_DIG, FLT_EVAL_METHOD, FLT_RADIX, and FLT_ROUNDS have separate names
73       for  all three floating-point types. The floating-point model represen‐
74       tation  is  provided  for  all  values   except   FLT_EVAL_METHOD   and
75       FLT_ROUNDS.
76
77       The  rounding  mode for floating-point addition is characterized by the
78       implementation-defined value of FLT_ROUNDS:
79
80       -1    Indeterminable.
81
82        0    Toward zero.
83
84        1    To nearest.
85
86        2    Toward positive infinity.
87
88        3    Toward negative infinity.
89
90       All other values  for  FLT_ROUNDS  characterize  implementation-defined
91       rounding behavior.
92
93       The  values  of operations with floating operands and values subject to
94       the usual arithmetic conversions and of floating constants  are  evalu‐
95       ated to a format whose range and precision may be greater than required
96       by the type. The use of evaluation  formats  is  characterized  by  the
97       implementation-defined value of FLT_EVAL_METHOD:
98
99       -1    Indeterminable.
100
101        0    Evaluate  all operations and constants just to the range and pre‐
102             cision of the type.
103
104        1    Evaluate operations and constants of type float and double to the
105             range  and  precision  of  the  double type; evaluate long double
106             operations and constants to the range and precision of  the  long
107             double type.
108
109        2    Evaluate  all operations and constants to the range and precision
110             of the long double type.
111
112       All other negative values for FLT_EVAL_METHOD characterize  implementa‐
113       tion-defined behavior.
114
115       The  <float.h>  header  shall  define  the following values as constant
116       expressions with implementation-defined  values  that  are  greater  or
117       equal in magnitude (absolute value) to those shown, with the same sign.
118
119        *  Radix of exponent representation, b.
120
121           FLT_RADIX     2
122
123        *  Number  of base-FLT_RADIX digits in the floating-point significand,
124           p.
125
126           FLT_MANT_DIG
127
128           DBL_MANT_DIG
129
130           LDBL_MANT_DIG
131
132        *  Number of decimal digits, n, such that any floating-point number in
133           the widest supported floating type with p_max radix b digits can be
134           rounded to a floating-point number with n decimal digits  and  back
135           again without change to the value.
136
137           p_max  log_10  b         if b is a power of 10
138           ⎡ 1 + p_max  log_10  b⎤  otherwise
139           DECIMAL_DIG   10
140
141        *  Number  of  decimal  digits, q, such that any floating-point number
142           with q decimal digits can be rounded into a  floating-point  number
143           with  p radix b digits and back again without change to the q deci‐
144           mal digits.
145
146           p log_10  b            if b is a power of 10
147           ⎣ (p − 1) log_10  b ⎦  otherwise
148           FLT_DIG       6
149
150           DBL_DIG       10
151
152           LDBL_DIG      10
153
154        *  Minimum negative integer such that FLT_RADIX raised to  that  power
155           minus 1 is a normalized floating-point number, e_min.
156
157           FLT_MIN_EXP
158
159           DBL_MIN_EXP
160
161           LDBL_MIN_EXP
162
163        *  Minimum  negative  integer  such that 10 raised to that power is in
164           the range of normalized floating-point numbers.
165
166           ⎡ log_10  b^ e_min  ^ −1 ⎤
167
168           FLT_MIN_10_EXP
169                         -37
170
171           DBL_MIN_10_EXP
172                         -37
173
174           LDBL_MIN_10_EXP
175                         -37
176
177        *  Maximum integer such that FLT_RADIX raised to that power minus 1 is
178           a representable finite floating-point number, e_max.
179
180           FLT_MAX_EXP
181
182           DBL_MAX_EXP
183
184           LDBL_MAX_EXP
185
186           Additionally,   FLT_MAX_EXP   shall   be   at  least  as  large  as
187           FLT_MANT_DIG,  DBL_MAX_EXP  shall  be  at   least   as   large   as
188           DBL_MANT_DIG,  and  LDBL_MAX_EXP  shall  be  at  least  as large as
189           LDBL_MANT_DIG; which has the  effect  that  FLT_MAX,  DBL_MAX,  and
190           LDBL_MAX are integral.
191
192        *  Maximum  integer  such that 10 raised to that power is in the range
193           of representable finite floating-point numbers.
194
195           ⎣ log_10 ((1 − b^ −p) b^e _max ) ⎦
196
197           FLT_MAX_10_EXP
198                         +37
199
200           DBL_MAX_10_EXP
201                         +37
202
203           LDBL_MAX_10_EXP
204                         +37
205
206       The <float.h> header shall define  the  following  values  as  constant
207       expressions with implementation-defined values that are greater than or
208       equal to those shown:
209
210        *  Maximum representable finite floating-point number.
211
212           (1 − b^ −p) b^e _max
213
214           FLT_MAX       1E+37
215
216           DBL_MAX       1E+37
217
218           LDBL_MAX      1E+37
219
220       The <float.h> header shall define  the  following  values  as  constant
221       expressions with implementation-defined (positive) values that are less
222       than or equal to those shown:
223
224        *  The difference between 1 and the least value greater than 1 that is
225           representable in the given floating-point type, b^ 1 − p.
226
227           FLT_EPSILON   1E-5
228
229           DBL_EPSILON   1E-9
230
231           LDBL_EPSILON  1E-9
232
233        *  Minimum normalized positive floating-point number, b^ e_min  ^ −1.
234
235           FLT_MIN       1E-37
236
237           DBL_MIN       1E-37
238
239           LDBL_MIN      1E-37
240
241       The following sections are informative.
242

APPLICATION USAGE

244       None.
245

RATIONALE

247       All known hardware floating-point formats satisfy the property that the
248       exponent range is larger than the number of mantissa digits. The  ISO C
249       standard  permits  a  floating-point  format where this property is not
250       true, such that the largest finite value would not  be  integral;  how‐
251       ever,  it is unlikely that there will ever be hardware support for such
252       a floating-point format, and it introduces boundary cases that portable
253       programs should not have to be concerned with (for example, a non-inte‐
254       gral DBL_MAX means that ceil() would have  to  worry  about  overflow).
255       Therefore,  this  standard  imposes  an additional requirement that the
256       largest representable finite value is integral.
257

FUTURE DIRECTIONS

259       None.
260

COPYRIGHT

265       Portions of this text are reprinted and reproduced in  electronic  form
266       from  IEEE Std 1003.1-2017, Standard for Information Technology -- Por‐
267       table Operating System Interface (POSIX), The Open Group Base  Specifi‐
268       cations  Issue  7, 2018 Edition, Copyright (C) 2018 by the Institute of
269       Electrical and Electronics Engineers, Inc and The Open Group.   In  the
270       event of any discrepancy between this version and the original IEEE and
271       The Open Group Standard, the original IEEE and The Open Group  Standard
272       is  the  referee document. The original Standard can be obtained online
273       at http://www.opengroup.org/unix/online.html .
274
275       Any typographical or formatting errors that appear  in  this  page  are
276       most likely to have been introduced during the conversion of the source
277       files to man page format. To report such errors,  see  https://www.ker‐
278       nel.org/doc/man-pages/reporting_bugs.html .
279
280
281
282IEEE/The Open Group                  2017                          float.h(0P)