Supported data types

IEEE 754 is a technical standard for floating-point arithmetic. It defines representation of several data types of floating-point numbers. Mind that most FPUs on the current CPUs is based on that standard.

The most commonly used floating-point data types are those called binary in the standard (with data type width in bits as a suffix). Those representations are optimized for calculations, quick comparison, etc.

common name std name width sign bit exponent bias mantissa valid digits integer range minimum denormalized number minimum normalized number maximum
half-precision binary16 16b 1b 5b 15 10+1b ~3.31 ±2048 6 × 10⁻⁸ 6.1 × 10⁻⁴ 6.5 × 10⁴
single-precision binary32 32b 1b 8b 127 23+1b ~7.22 ±1.6 × 10⁷ 1.4 × 10⁻⁴⁵ 1.1 × 10⁻³⁸ 3.4 × 10³⁸
double-precision binary64 64b 1b 11b 1023 53+1b ~15.95 ±9 × 10¹⁵ 5 × 10⁻³²⁴ 2.2 × 10⁻³⁰⁸ 1.7 × 10³⁰⁸
extended-precision binary80 80b 1b 15b 16383 64b ~19.26 ±1.8 × 10¹⁹ 3.6 × 10⁻⁴⁹⁵¹   1.1 × 10⁴⁹³²
quadruple-precision binary128 128b 1b 15b 16383 112+1b ~34.01 ±1 × 10³⁴ 3.6 × 10⁻⁴⁹⁵¹ 3.3 × 10⁻⁴⁹³² 1.1 × 10⁴⁹³²
- bfloat16 16b 1b 8b - 7+1b ~2.40 ±256 9.2 × 10⁻⁴¹ 1.1 × 10⁻³⁸ 3.3 × 10³⁸

Depending on your programming language, selection of supported data types is usually limited. Many programming languages supports single-precision (binary32) and double-precision (binary64) data types.

Less frequently might be seen support for extended-precision (binary80) data type, which is native data type of x87 FPU (or at least it was done that way on x86 architecture of company Intel). This data type is supported e.g. by Object Pascal programming language used by Delphi where it is called Extended or by C language since standard C99 where it is known long double.

There is one extraordinary point about binary80 I would like to mention. Mantissa does not include one extra virtual/hidden/implicit bit for normalized numbers. All mantissa bits are explicitly stored in FPU register 80 bits wide.

In the last row of the previous table is type of bfloat16 that was included in one of the later revisions of IEEE 754. It was designed to be used for AI, because it uses more bits for exponent (increasing overall range of representable value) but also fewer bits for mantissa (decreasing precision and number of representable digits).

The standard also defines decimal types of different width. Those are stored using decimal format either BID (aka Binary Integer Decimal) encoding or DPD (aka Densely Packed Decimal) encoding. Decimal types are might be used e.g. for financial calculations as it supports also a different rounding algorithms etc. All those decimal types are listed in the following table:

std name width sign bit exponent range mantissa digits
decimal32 32b 1b -95 … +96 7
decimal64 64b 1b -383 … +384 16
decimal128 128b 1b -6143 … +6144 34

Representation of binary floating point values

   ---------------------------
   | s |   e   |      m      |
   ---------------------------

Where s stands for sign bit (yes, MSB is always a sign bit), e stands for exponent (stored using Offset binary), and m stands for mantissa (aka significand).

The pair of s and m might be seen as stored using Sign-magnitude representation.


Kind of floating-point values

Floating-point value can be in any of the following kinds:

Conditions Kind of value
0 < e < max(e) normalized number
e == 0 && m != 0 denormalized number
e == 0 && m == 0 && s == 0 positive zero (+0)
e == 0 && m == 0 && s != 0 negative zero (-0)
e == max(e) && m == 0 && s == 0 positive infinity (+∞)
e == max(e) && m == 0 && s != 0 negative infinity (-∞)
e = max(e) && m != 0 not-a-number (NaN)

The majority of representable values are so-called normalized numbers. Values close to zero are denormalized numbers. Floating-point types have a separate representation for positive and negative zeroes. Although, many programming languages might hide sign of zero from both developers and users. Positive and negative infinities might a result of certain calculations. And last but not least, are so-called NaNs, that are used to denote an invalid result of calculation.

The following axis present whole range of representable floating-point value kinds:

  ^
  | 
  | not-a-numbers
  |-----
  | positive infinity
  |-----
  | 
  | positive normalized numbers
  |
  |-----
  |
  | positive denormalized numbers
  |-----
  | +0 (aka positive zero)
  |-----
  | -0 (aka negative zero)
  |-----
  |
  | negative denormalized numbers
  |-----
  |
  | negative normalized numbers
  |
  |-----
  | negative infinity
  |-----
  | not-a-numbers
  |
  v

Normalized and denormalized numbers

Normalized number or simply normal floating-point value has no leading zeroes in mantissa. All the zeroes that might appear during calculation are removed by shifting the exponent. Normalized numbers are calculated using the following formula:

     sign_bit                         (exponent - bias)
 (-1)           ×   1.mantissa   ×   2

Denormalized number (sometimes called denormal) is for IEEE 754 always a subnormal number. Subnormal number is any non-zero number with magnitude smaller than the smallest positive normal number. Using other words, it is a floating-point value with leading zeroes in mantissa, so exponent cannot be shifted anymore to finish the process of value normalization. Denormalized numbers are calculated using the following formula:

     sign_bit                         (1 - bias)
 (-1)           ×  0.mantissa   ×   2

<
Previous Post
Representations of signed integers
>
Next Post
Time in Go