float.h(0p)

1float.h(0P)                POSIX Programmer's Manual               float.h(0P)
2
3
4

PROLOG

6       This  manual  page is part of the POSIX Programmer's Manual.  The Linux
7       implementation of this interface may differ (consult the  corresponding
8       Linux  manual page for details of Linux behavior), or the interface may
9       not be implemented on Linux.
10
11

NAME

13       float.h — floating types
14

SYNOPSIS

16       #include <float.h>
17

DESCRIPTION

19       The functionality described on this reference page is aligned with  the
20       ISO C  standard.  Any  conflict between the requirements described here
21       and the ISO C standard is unintentional. This  volume  of  POSIX.1‐2008
22       defers to the ISO C standard.
23
24       The  characteristics  of floating types are defined in terms of a model
25       that describes a representation of floating-point  numbers  and  values
26       that  provide  information  about  an  implementation's  floating-point
27       arithmetic.
28
29       The following parameters are used to define the model for  each  float‐
30       ing-point type:
31
32       s     Sign (±1).
33
34       b     Base or radix of exponent representation (an integer >1).
35
36       e     Exponent  (an  integer  between  a  minimum  e_min  and a maximum
37             e_max).
38
39       p     Precision (the number of base−b digits in the significand).
40
41       f_k   Non-negative integers less than b (the significand digits).
42
43       A floating-point number x is defined by the following model:
44
45       x = sb^e  kp=Σ1 f_k  b^ −k, e_min  ≤ e ≤ e_max
46
47       In addition to normalized floating-point numbers (f_1>0 if x≠0), float‐
48       ing types may be able to contain other kinds of floating-point numbers,
49       such as subnormal floating-point  numbers  (x≠0,  e=e_min,  f_1=0)  and
50       unnormalized  floating-point  numbers (x≠0, e>e_min, f_1=0), and values
51       that are not floating-point numbers, such as infinities and NaNs. A NaN
52       is  an encoding signifying Not-a-Number. A quiet NaN propagates through
53       almost every arithmetic  operation  without  raising  a  floating-point
54       exception;  a signaling NaN generally raises a floating-point exception
55       when occurring as an arithmetic operand.
56
57       An implementation may give zero and non-numeric values, such as infini‐
58       ties and NaNs, a sign, or may leave them unsigned. Wherever such values
59       are unsigned, any requirement in  POSIX.1‐2008  to  retrieve  the  sign
60       shall  produce  an unspecified sign and any requirement to set the sign
61       shall be ignored.
62
63       The accuracy of the floating-point operations ('+', '−', '*', '/')  and
64       of the functions in <math.h> and <complex.h> that return floating-point
65       results is implementation-defined, as is the accuracy of the conversion
66       between  floating-point internal representations and string representa‐
67       tions  performed  by  the  functions  in  <stdio.h>,  <stdlib.h>,   and
68       <wchar.h>.  The implementation may state that the accuracy is unknown.
69
70       All integer values in the <float.h> header, except FLT_ROUNDS, shall be
71       constant expressions suitable for use in #if preprocessing  directives;
72       all  floating  values  shall  be constant expressions. All except DECI‐
73       MAL_DIG, FLT_EVAL_METHOD, FLT_RADIX, and FLT_ROUNDS have separate names
74       for  all three floating-point types. The floating-point model represen‐
75       tation  is  provided  for  all  values   except   FLT_EVAL_METHOD   and
76       FLT_ROUNDS.
77
78       The  rounding  mode for floating-point addition is characterized by the
79       implementation-defined value of FLT_ROUNDS:
80
81       −1    Indeterminable.
82
83        0    Toward zero.
84
85        1    To nearest.
86
87        2    Toward positive infinity.
88
89        3    Toward negative infinity.
90
91       All other values  for  FLT_ROUNDS  characterize  implementation-defined
92       rounding behavior.
93
94       The  values  of operations with floating operands and values subject to
95       the usual arithmetic conversions and of floating constants  are  evalu‐
96       ated to a format whose range and precision may be greater than required
97       by the type. The use of evaluation  formats  is  characterized  by  the
98       implementation-defined value of FLT_EVAL_METHOD:
99
100       −1    Indeterminable.
101
102        0    Evaluate  all operations and constants just to the range and pre‐
103             cision of the type.
104
105        1    Evaluate operations and constants of type float and double to the
106             range  and  precision  of  the  double type; evaluate long double
107             operations and constants to the range and precision of  the  long
108             double type.
109
110        2    Evaluate  all operations and constants to the range and precision
111             of the long double type.
112
113       All other negative values for FLT_EVAL_METHOD characterize  implementa‐
114       tion-defined behavior.
115
116       The  <float.h>  header  shall  define  the following values as constant
117       expressions with implementation-defined  values  that  are  greater  or
118       equal in magnitude (absolute value) to those shown, with the same sign.
119
120        *  Radix of exponent representation, b.
121
122           FLT_RADIX     2
123
124        *  Number  of base-FLT_RADIX digits in the floating-point significand,
125           p.
126
127           FLT_MANT_DIG
128
129           DBL_MANT_DIG
130
131           LDBL_MANT_DIG
132
133        *  Number of decimal digits, n, such that any floating-point number in
134           the widest supported floating type with p_max radix b digits can be
135           rounded to a floating-point number with n decimal digits  and  back
136           again without change to the value.
137
138           p_max  log_10  b         if b is a power of 10
139           ⎡ 1 + p_max  log_10  b⎤  otherwise
140           DECIMAL_DIG   10
141
142        *  Number  of  decimal  digits, q, such that any floating-point number
143           with q decimal digits can be rounded into a  floating-point  number
144           with  p radix b digits and back again without change to the q deci‐
145           mal digits.
146
147           p log_10  b            if b is a power of 10
148           ⎣ (p − 1) log_10  b ⎦  otherwise
149           FLT_DIG       6
150
151           DBL_DIG       10
152
153           LDBL_DIG      10
154
155        *  Minimum negative integer such that FLT_RADIX raised to  that  power
156           minus 1 is a normalized floating-point number, e_min.
157
158           FLT_MIN_EXP
159
160           DBL_MIN_EXP
161
162           LDBL_MIN_EXP
163
164        *  Minimum  negative  integer  such that 10 raised to that power is in
165           the range of normalized floating-point numbers.
166
167           ⎡ log_10  b^ e_min  ^ −1 ⎤
168
169           FLT_MIN_10_EXP
170                         −37
171
172           DBL_MIN_10_EXP
173                         −37
174
175           LDBL_MIN_10_EXP
176                         −37
177
178        *  Maximum integer such that FLT_RADIX raised to that power minus 1 is
179           a representable finite floating-point number, e_max.
180
181           FLT_MAX_EXP
182
183           DBL_MAX_EXP
184
185           LDBL_MAX_EXP
186
187           Additionally,   FLT_MAX_EXP   shall   be   at  least  as  large  as
188           FLT_MANT_DIG,  DBL_MAX_EXP  shall  be  at   least   as   large   as
189           DBL_MANT_DIG,  and  LDBL_MAX_EXP  shall  be  at  least  as large as
190           LDBL_MANT_DIG; which has the  effect  that  FLT_MAX,  DBL_MAX,  and
191           LDBL_MAX are integral.
192
193        *  Maximum  integer  such that 10 raised to that power is in the range
194           of representable finite floating-point numbers.
195
196           ⎣ log_10 ((1 − b^ −p) b^e _max ) ⎦
197
198           FLT_MAX_10_EXP
199                         +37
200
201           DBL_MAX_10_EXP
202                         +37
203
204           LDBL_MAX_10_EXP
205                         +37
206
207       The <float.h> header shall define  the  following  values  as  constant
208       expressions with implementation-defined values that are greater than or
209       equal to those shown:
210
211        *  Maximum representable finite floating-point number.
212
213           (1 − b^ −p) b^e _max
214
215           FLT_MAX       1E+37
216
217           DBL_MAX       1E+37
218
219           LDBL_MAX      1E+37
220
221       The <float.h> header shall define  the  following  values  as  constant
222       expressions with implementation-defined (positive) values that are less
223       than or equal to those shown:
224
225        *  The difference between 1 and the least value greater than 1 that is
226           representable in the given floating-point type, b^ 1 − p.
227
228           FLT_EPSILON   1E−5
229
230           DBL_EPSILON   1E−9
231
232           LDBL_EPSILON  1E−9
233
234        *  Minimum normalized positive floating-point number, b^ e_min  ^ −1.
235
236           FLT_MIN       1E−37
237
238           DBL_MIN       1E−37
239
240           LDBL_MIN      1E−37
241
242       The following sections are informative.
243

APPLICATION USAGE

245       None.
246

RATIONALE

248       All known hardware floating-point formats satisfy the property that the
249       exponent range is larger than the number of mantissa digits. The  ISO C
250       standard  permits  a  floating-point  format where this property is not
251       true, such that the largest finite value would not  be  integral;  how‐
252       ever,  it is unlikely that there will ever be hardware support for such
253       a floating-point format, and it introduces boundary cases that portable
254       programs should not have to be concerned with (for example, a non-inte‐
255       gral DBL_MAX means that ceil() would have  to  worry  about  overflow).
256       Therefore,  this  standard  imposes  an additional requirement that the
257       largest representable finite value is integral.
258

FUTURE DIRECTIONS

260       None.
261

COPYRIGHT

266       Portions of this text are reprinted and reproduced in  electronic  form
267       from IEEE Std 1003.1, 2013 Edition, Standard for Information Technology
268       -- Portable Operating System Interface (POSIX),  The  Open  Group  Base
269       Specifications Issue 7, Copyright (C) 2013 by the Institute of Electri‐
270       cal and Electronics Engineers,  Inc  and  The  Open  Group.   (This  is
271       POSIX.1-2008  with  the  2013  Technical Corrigendum 1 applied.) In the
272       event of any discrepancy between this version and the original IEEE and
273       The  Open Group Standard, the original IEEE and The Open Group Standard
274       is the referee document. The original Standard can be  obtained  online
275       at http://www.unix.org/online.html .
276
277       Any  typographical  or  formatting  errors that appear in this page are
278       most likely to have been introduced during the conversion of the source
279       files  to  man page format. To report such errors, see https://www.ker‐
280       nel.org/doc/man-pages/reporting_bugs.html .
281
282
283
284IEEE/The Open Group                  2013                          float.h(0P)