IEEE 754 vol. 1: Representations of floating-point values
- Supported data types
- Representation of binary floating point values
- Kind of floating-point values
- Normalized and denormalized numbers
Supported data types
IEEE 754 is a technical standard for floating-point arithmetic. It defines representation of several data types of floating-point numbers. Mind that most FPUs on the current CPUs is based on that standard.
The most commonly used floating-point data types are those called binary in the standard (with data type width
in bits as a suffix). Those representations are optimized for calculations, quick comparison, etc.
| common name | std name | width | sign bit | exponent | bias | mantissa | valid digits | integer range | minimum denormalized number | minimum normalized number | maximum |
|---|---|---|---|---|---|---|---|---|---|---|---|
| half-precision | binary16 | 16b | 1b | 5b | 15 | 10+1b | ~3.31 | ±2048 | 6 × 10⁻⁸ | 6.1 × 10⁻⁴ | 6.5 × 10⁴ |
| single-precision | binary32 | 32b | 1b | 8b | 127 | 23+1b | ~7.22 | ±1.6 × 10⁷ | 1.4 × 10⁻⁴⁵ | 1.1 × 10⁻³⁸ | 3.4 × 10³⁸ |
| double-precision | binary64 | 64b | 1b | 11b | 1023 | 53+1b | ~15.95 | ±9 × 10¹⁵ | 5 × 10⁻³²⁴ | 2.2 × 10⁻³⁰⁸ | 1.7 × 10³⁰⁸ |
| extended-precision | binary80 | 80b | 1b | 15b | 16383 | 64b | ~19.26 | ±1.8 × 10¹⁹ | 3.6 × 10⁻⁴⁹⁵¹ | 1.1 × 10⁴⁹³² | |
| quadruple-precision | binary128 | 128b | 1b | 15b | 16383 | 112+1b | ~34.01 | ±1 × 10³⁴ | 3.6 × 10⁻⁴⁹⁵¹ | 3.3 × 10⁻⁴⁹³² | 1.1 × 10⁴⁹³² |
| - | bfloat16 | 16b | 1b | 8b | - | 7+1b | ~2.40 | ±256 | 9.2 × 10⁻⁴¹ | 1.1 × 10⁻³⁸ | 3.3 × 10³⁸ |
Depending on your programming language, selection of supported data types is usually limited. Many programming
languages supports single-precision (binary32) and double-precision (binary64) data types.
Less frequently might be seen support for extended-precision (binary80) data type, which is native data type
of x87 FPU (or at least it was done that way on x86 architecture of company Intel). This data type is supported
e.g. by Object Pascal programming language used by Delphi where it is called Extended or by C language
since standard C99 where it is known long double.
There is one extraordinary point about binary80 I would like to mention. Mantissa does not include one extra
virtual/hidden/implicit bit for normalized numbers. All mantissa bits are explicitly stored in FPU register
80 bits wide.
In the last row of the previous table is type of bfloat16 that was included in one of the later revisions
of IEEE 754. It was designed to be used for AI, because it uses more bits for exponent (increasing overall range
of representable value) but also fewer bits for mantissa (decreasing precision and number of representable digits).
The standard also defines decimal types of different width. Those are stored using decimal format either BID (aka Binary Integer Decimal) encoding or DPD (aka Densely Packed Decimal) encoding. Decimal types are might be used e.g. for financial calculations as it supports also a different rounding algorithms etc. All those decimal types are listed in the following table:
| std name | width | sign bit | exponent range | mantissa digits |
|---|---|---|---|---|
| decimal32 | 32b | 1b | -95 … +96 | 7 |
| decimal64 | 64b | 1b | -383 … +384 | 16 |
| decimal128 | 128b | 1b | -6143 … +6144 | 34 |
Representation of binary floating point values
---------------------------
| s | e | m |
---------------------------
Where s stands for sign bit (yes, MSB is always a sign bit),
e stands for exponent (stored using Offset binary),
and m stands for mantissa (aka significand).
The pair of s and m might be seen as stored using
Sign-magnitude representation.
Kind of floating-point values
Floating-point value can be in any of the following kinds:
| Conditions | Kind of value |
|---|---|
| 0 < e < max(e) | normalized number |
| e == 0 && m != 0 | denormalized number |
| e == 0 && m == 0 && s == 0 | positive zero (+0) |
| e == 0 && m == 0 && s != 0 | negative zero (-0) |
| e == max(e) && m == 0 && s == 0 | positive infinity (+∞) |
| e == max(e) && m == 0 && s != 0 | negative infinity (-∞) |
| e = max(e) && m != 0 | not-a-number (NaN) |
The majority of representable values are so-called normalized numbers. Values close to zero are denormalized numbers. Floating-point types have a separate representation for positive and negative zeroes. Although, many programming languages might hide sign of zero from both developers and users. Positive and negative infinities might a result of certain calculations. And last but not least, are so-called NaNs, that are used to denote an invalid result of calculation.
The following axis present whole range of representable floating-point value kinds:
^
|
| not-a-numbers
|-----
| positive infinity
|-----
|
| positive normalized numbers
|
|-----
|
| positive denormalized numbers
|-----
| +0 (aka positive zero)
|-----
| -0 (aka negative zero)
|-----
|
| negative denormalized numbers
|-----
|
| negative normalized numbers
|
|-----
| negative infinity
|-----
| not-a-numbers
|
v
Normalized and denormalized numbers
Normalized number or simply normal floating-point value has no leading zeroes in mantissa. All the zeroes that might appear during calculation are removed by shifting the exponent. Normalized numbers are calculated using the following formula:
sign_bit (exponent - bias)
(-1) × 1.mantissa × 2
Denormalized number (sometimes called denormal) is for IEEE 754 always a subnormal number. Subnormal number is any non-zero number with magnitude smaller than the smallest positive normal number. Using other words, it is a floating-point value with leading zeroes in mantissa, so exponent cannot be shifted anymore to finish the process of value normalization. Denormalized numbers are calculated using the following formula:
sign_bit (1 - bias)
(-1) × 0.mantissa × 2