Technical Computer Science
Computer Number Formats

Thorsten Thormählen
October 27, 2024
Part 2, Chapter 2

This is the print version of the slides.

Advance slides with the → key or
by clicking on the right border of the slide

Control Keys

→ move to next slide (also Enter or Spacebar).
← move to previous slide.
d enable/disable drawing on slides
p toggles between print and presentation view
CTRL + zoom in
CTRL - zoom out
CTRL 0 reset zoom

Slides can also be advanced by clicking on the left or right border of the slide.

Notation

Type	Font	Examples
Variables (scalars)	italics	$a, b, x, y$
Functions	upright	$\mathrm{f}, \mathrm{g}(x), \mathrm{max}(x)$
Vectors	bold, elements row-wise	$\mathbf{a}, \mathbf{b}= \begin{pmatrix}x\\y\end{pmatrix} = (x, y)^\top,$ $\mathbf{B}=(x, y, z)^\top$
Matrices	Typewriter	$\mathtt{A}, \mathtt{B}= \begin{bmatrix}a & b\\c & d\end{bmatrix}$
Sets	calligraphic	$\mathcal{A}, B=\{a, b\}, b \in \mathcal{B}$
Number systems, Coordinate spaces	double-struck	$\mathbb{N}, \mathbb{Z}, \mathbb{R}^2, \mathbb{R}^3$

Representation of unsigned integers
Representation of signed integer numbers
- Sign-magnitude representation
- Ones' complement
- Two's complement
Representation of signed rational numbers
- Fixed-point representation
- Floating-point representation

Representation of Unsigned Integers

Depending on how many bits are available to store an unsigned integer in binary, different integer ranges can be represented

Bytes	Bits	Range of values	Name of the data type (Microsoft Visual C++)
1	8	$[0;255]$	unsigned char
2	16	$[0;65535]$	unsigned short
4	32	$[0;4294967295]$	unsigned int
8	64	$[0;18446744073709551615]$	unsigned long long

Source: Data Type Ranges, Visual Studio 2013

Big-Endian and Little-Endian Storage

Little-Endian
- The byte with the lowest significance is stored first
- All x86-compatible computers use this ordering

Big-Endian
- The byte with the lowest significance is stored last
- MIPS, SPARC, PowerPC, Java Virtual Machine, etc.

Representation of Signed Integer Numbers

If positive and negative numbers have to be stored, there are several options:
- Sign-magnitude representation
- Ones' complement
- Two's complement
Every representation has advantages and disadvantages in terms of
- the symmetry of the representable range of values
- the one-to-one correspondence of the representation
- the execution of arithmetic operations

Sign-Magnitude Representation

The most significant bit is used for the sign:
- 0 = number positive; 1 = number negative
The remaining bits are used for the magnitude
In this way, a symmetrical range of values of $[-(2^{n-1}-1),+(2^{n-1}-1)]$ can be represented
Example for $n=4$:

Decimal	0	1	2	3	4	5	6	7
Binary	0000	0001	0010	0011	0100	0101	0110	0111

Decimal	-0	-1	-2	-3	-4	-5	-6	-7
Binary	1000	1001	1010	1011	1100	1101	1110	1111

Sign-Magnitude Representation

The representation is not a one-to-one mapping because zero is represented twice. This leads to problems when testing for equality: $-0 \ne 0$

Sign-Magnitude Representation

The addition/subtraction is more complicated because the sign bit has to be handled separately. Otherwise:
$ \begin{aligned} \begin{aligned} 5& \\ +-6& \\ \hline =-1&\\ \end{aligned} && \begin{aligned} 0101&&(5)\\ +1110&&(-6)\\ \hline =10011&&(-3)\\ \end{aligned} \end{aligned} $

Source:D. W. Hoffmann: Grundlagen der Technischen Informatik, 2. Auflage; Hanser 2009

Ones' Complement

A negative number $(-w)$ is represented in binary by the bitwise complement of the corresponding positive
number $w$
- Example: $(6)_{10} \hat{=} 0110 \rightarrow (-6)_{10} \hat{=} 1001$
In total, a symmetrical value range of $[-(2^{n-1}-1),+(2^{n-1}-1)]$ can be represented
Example for $n=4$:

Decimal	0	1	2	3	4	5	6	7
Binary	0000	0001	0010	0011	0100	0101	0110	0111

Decimal	-0	-1	-2	-3	-4	-5	-6	-7
Binary	1111	1110	1101	1100	1011	1010	1001	1000

Ones' Complement

The representation is not a one-to-one mapping because zero is represented twice

Ones' Complement

The addition/subtraction must be done in two steps
- 1. Normal binary addition
- 2. Extra addition of the carry (a carry occurring at the 2nd addition is ignored)
Example 1:
$ \begin{aligned} \begin{aligned} 5& \\ +-6& \\ \hline =-1&\\ \end{aligned} && \begin{aligned} 0101&&(5)\\ +1001&&(-6)\\ \hline =1110&&(-1)\\ \end{aligned} \end{aligned} $
Example 2:
$ \begin{aligned} \begin{aligned} 5& \\ +-3& \\ \hline =\ \ \ 2&\\ \end{aligned} && \begin{aligned} 0101&&(5)\\ +1100&&(-3)\\ \hline =1|0001&&(-14)\\ + 1&&(1) \\ \hline =\not{1}|0010&& (2) \end{aligned} \end{aligned} $

Source: D. W. Hoffmann: Grundlagen der Technischen Informatik, 2. Auflage; Hanser 2009

Two's Complement

The two's complement is formed by calculating the one's complement and adding 1
- Example:
  $(-6)_{10} \hat{=} \,\mbox{not}(0110) + 1 = \ 1001 + 1 = 1010$
The range of values is asymmetric $[-2^{n-1},+2^{n-1}-1]$
Example for $n=4$:

Decimal	0	1	2	3	4	5	6	7
Binary	0000	0001	0010	0011	0100	0101	0110	0111

Decimal	-1	-2	-3	-4	-5	-6	-7	-8
Binary	1111	1110	1101	1100	1011	1010	1001	1000

Two's Complement

The representation is a one-to-one mapping

Two's Complement

The normal unsigned integer binary addition can be used for addition/subtraction
Subtraction corresponds to negation and subsequent addition, e.g. $(5-6) = (5 + (-6))$
Example 1:
$ \begin{aligned} \begin{aligned} 5& \\ +-6& \\ \hline =-1&\\ \end{aligned} && \begin{aligned} 0101&&(5)\\ +1010&&(-6)\\ \hline =1111&&(-1)\\ \end{aligned} \end{aligned}$
Example 2:
$ \begin{aligned} \begin{aligned} 5& \\ +-3& \\ \hline =\ \ \ 2&\\ \end{aligned} && \begin{aligned} 0101&&(5)\\ +1101&&(-3)\\ \hline =\not{1}|0010&& (2) \end{aligned} \end{aligned} $

Two's Complement

Example 3:
$ \begin{aligned} \begin{aligned} 4& \\ +-3& \\ \hline =\ \ \ 1&\\ \end{aligned} && \begin{aligned} 0100&&(4)\\ +1101&&(-3)\\ \hline =\not{1}|0001&&(1)\\ \end{aligned} \end{aligned}$
Example 4:
$ \begin{aligned} \begin{aligned} -3& \\ +-4& \\ \hline =-7&\\ \end{aligned} && \begin{aligned} 1101&&(-3)\\ +1100&&(-4)\\ \hline =\not{1}|1001&& (-7) \end{aligned} \end{aligned} $
It appears that (unlike with the one's complement) we can simply omit the carry in the two's complement.
Is this always the case?

Two's Complement

Actually, yes, but be careful:
Example 5:
$ \begin{aligned} \begin{aligned} -4& \\ +-5& \\ \hline =-9&\\ \end{aligned} && \begin{aligned} 1100&&(-4)\\ +1011&&(-5)\\ \hline =\not{1}|0111&&(7)\\ \end{aligned} \end{aligned} $
Example 6:
$ \begin{aligned} \begin{aligned} 4& \\ +5& \\ \hline =9&\\ \end{aligned} && \begin{aligned} 0100&&(4)\\ +0101&&(5)\\ \hline =1001&&(-7)\\ \end{aligned} \end{aligned} $
In this case, the result is always incorrect. The carry can only be ignored if there is no overflow, i.e. the result does not go beyond the range of representable numbers.

Two's Complement

Example 5 again, now with $n=5$
$ \begin{aligned} \begin{aligned} -4& \\ +-5& \\ \hline =-9&\\ \end{aligned} && \begin{aligned} 11100&&(-4)\\ +11011&&(-5)\\ \hline =\not{1}|10111&&(-9)\\ \end{aligned} \end{aligned} $
Example 6 again, now with $n=5$:
$ \begin{aligned} \begin{aligned} 4& \\ +5& \\ \hline =9&\\ \end{aligned} && \begin{aligned} 00100&&(4)\\ +00101&&(5)\\ \hline =01001&&(9)\\ \end{aligned} \end{aligned} $
If there are enough binary digits (no overflow), the carry can be omitted

Two's Complement

For the two's complement, the numerical value $w$ is formally calculated as follows from the digits $z_{n-1}, \dots, z_1, z_0$ of a binary number:

$w = -z_{n-1} \cdot 2^{n-1} + z_{n-2} \cdot 2^{n-2} + \dots + \ z_0 \cdot 2^0$
- Example:
  $1101 = -1 \cdot 8 + 1 \cdot 4 + 0 \cdot 2 + 1 \cdot 1 = -3$

When does an overflow occur in two's complement?

Let us consider the addition of $w_a$ and $w_b$ with:
$\begin{align} w_a &= -a_{n-1} \cdot 2^{n-1} + a_{n-2} \dots 2^{n-2} + \dots + \ a_0 \cdot 2^0\\ w_b &= -b_{n-1} \cdot 2^{n-1} + b_{n-2} \dots 2^{n-2} + \dots + \ b_0 \cdot 2^0\end{align}$
Let $c_{n-1}$ be a possible carry resulting from the binary addition of the digits $a_{n-1}$ and $b_{n-1}$, and $c_{n-2}$ a possible carry for $a_{n-2}$ and $b_{n-2}$

Sign $w_a$	Sign $w_b$	Correct result when	Overflow (incorrect result) when
+	+	$c_{n-1}=0$ und $c_{n-2}=0$	$c_{n-1}=0$ und $c_{n-2}=1$
+	-	$c_{n-1}= c_{n-2}$	never
-	+	$c_{n-1}= c_{n-2}$	never
-	-	$c_{n-1}=1$ und $c_{n-2}=1$	$c_{n-1}=1$ und $c_{n-2}=0$

With two positive numbers, an overflow occurs when the carry in the most significant digit corrupts the sign bit of the sum
With two negative numbers, an overflow occurs if the inversion of the sign bit is not compensated by a carry in the most significant digit
Later we will learn that an overflow can be detected by an XOR, i.e. an overflow occurs when ($c_{n-1}$ xor $c_{n-2}$) = 1

Inverse of the Two's Complement

The inverse operation of the two's complement is simply a renewed application of the two's complement
- Example:
  $\begin{aligned}(6)_{10} &\hat{=} \ 0110 \\ (-6)_{10} &\hat{=} \,\mbox{not}(0110) + 1 = 1001 + 1 = 1010\\ (6)_{10} &\hat{=} \,\mbox{not}(1010) + 1 = (0101) + 1 = 0110\end{aligned}$
This means that every application of the two's complement negates the represented decimal number

Representation of Signed Integer Numbers

Summary

Representation	Range	One-to-one mapping	Math Operations
Sign-magnitude	symmetric	no	Special treatment
Ones' complement	symmetric	no	2-step addition
Two's complement	asymmetric	yes	Simple

The two's complement is the predominant method in practice for representing binary signed integer numbers (a lack of symmetry is tolerated)

Bytes	Bits	Range	Name of the data type (Microsoft Visual C++)
1	8	$[–128;127]$	char
2	16	$[–32768;32767]$	short
4	32	$[–2147483648;2147483647]$	int
8	64	$\begin{align} [-&9223372036854775808; \\ &9223372036854775807\ \ ] \end{align}$	long long

Source: Data Type Ranges, Visual Studio 2013

Representation of Signed Rational Numbers

We consider two methods for the binary representation of signed rational numbers:
- Fixed-point representation
- Floating-point representation

Fixed-Point Representation

The fixed-point representation is similar to the sign-magnitude representation of integer numbers
The most significant bit $s$ is used for the sign
The remaining $n-1$ bits are called the mantissa and are used to store the integer and fractional parts.
Example for $n=16$:
In the mantissa, the first $k \leq (n-1)$ digits are set as the digits before the decimal point. This results in $j = (n-1) - k$ digits after the decimal point
The numerical value $w$ is then calculated as follows:

$w = (-1)^{s}\left(\left(\sum\limits_{i=0}^{k-1} m_{j+i} \cdot 2^i\right) + \left(\sum\limits_{i=-j}^{-1} m_{j+i} \cdot 2^i\right)\right) = (-1)^{s}\left(\sum\limits_{i=-j}^{k-1} m_{j+i} \cdot 2^i\right)$

Fixed-Point Representation

Example: Conversion of $11011001$ for $n=8$, $k=4$ and $j=3$:

$\begin{align} w &= (-1)^{s}\left(\sum\limits_{i=-j}^{k-1} m_{j+i} \cdot 2^i\right) \\ &= (-1)^{1} \left(1 \cdot 2^{-3}+ 0 \cdot 2^{-2} + 0 \cdot 2^{-1} + 1 \cdot 2^{0} + 1 \cdot 2^{1} + 0 \cdot 2^{2} + 1 \cdot 2^{3}\right) \\ &= (-1) \left(0.125 + 1 + 2 + 8\right) = -11.125\end{align}$
Fixed-point representation is an equidistant number format because the numerical distance between two representable numbers is constant
If the range of numerical values used is very similar and known in advance, a fixed-point representation can thus guarantee a certain accuracy of the representation
The fixed-point representation is therefore used by special hardware, such as digital signal processors (DSPs)

Floating-Point Representation

In practice, numbers with different orders of magnitude have to be processed
As an example, physical constants can be examined:
- Speed of light: $2.99792458 \cdot 10^{8}$ m/s
- Electron mass: $9.10938291 \cdot 10^{−31}$ kg
- Elementary charge: $1.60217656 \cdot 10^{−19}$ C
In scientific notation, the decimal point is shifted by the exponent:
- Shift to the right for small numbers: $0.00041 = 4.1 \cdot 10^{−4}$
- Shift to the left for large numbers: $1200000.00041 \approx 1.2 \cdot 10^{6}$
The binary floating point representation copies the scientific notation and extends the fixed-point representation by a binary-coded exponent

Floating-Point Representation

Example: 1 bit sign, 5 bit exponent, 10 bit mantissa
The 5 bits of the exponent can represent $2^5=32$ different numbers
To represent negative exponents, instead of the exponent $e$, a "biased exponent" $c\ge0$ is stored in the 5 bits:
The conversion between the "biased exponent" and the exponent is done by subtracting a constant, e.g. $e = c - 15$
This means that the exponent can represent values from the interval $[-15,16]$

Floating-Point Representation

At first, the floating-point representation is not uniquely defined
$\begin{align}0.00111011 &= 000001.11011 \cdot 2^{-3}\\ &= 00000.111011 \cdot 2^{-2}\\ &= 0000.0111011 \cdot2^{-1}\\ &= 000.00111011 \cdot2^{0}\\ &= 00.000111011 \cdot2^{1}\\ &= 0.0000111011 \cdot2^{2}\\ &= \dots \end{align}$
In the normalized form, the exponent is chosen such that the first digit before the decimal point is unequal 0, in this case $000001.11011 \cdot 2^{-3}$
In the modified normalized form, the exponent is chosen so that the first digit after the decimal point is not equal to 0, in this case $00000.111011 \cdot 2^{-2}$
In a normalized form, the most significant bit is always 1. This "implicit leading one" can therefore be omitted in a so-called packed form

Floating-Point Representation

Example 1: $1001011010000000$ (modified normalized, not packed)
$\begin{align}s &= 1 \\ e &= c - (15)_{10} = (00101)_2 - (15)_{10} = (5)_{10} - (15)_{10} = -10 \\ m &= 0.1010000000 = 1 \cdot 2^{-1} + 1 \cdot 2^{-3} = 0.5 + 0.125 = 0.625 \end{align}$

Result: $-0.625 \cdot 2^{-10}$
Example 2: $1001011010000000$ (normalized, not packed)
$\begin{align}s &= 1 \\ e &= c - (15)_{10} = (00101)_2 - (15)_{10} = (5)_{10} - (15)_{10} = -10 \\ m &= 1.010000000 = 1 \cdot 2^{0} + 1 \cdot 2^{-2} = 1 + 0.25 = 1.25 \end{align}$

Result: $-1.25 \cdot 2^{-10}$

Floating-Point Representation

Example 3: $0101011010000000$ (modified normalized, packed)
$\begin{align}s &= 0 \\ e &= c - (15)_{10} = (10101)_2 - (15)_{10} = (21)_{10} - (15)_{10} = (6)_{10} \\ m &= 0.11010000000 = \\ &=1 \cdot 2^{-1} + 1 \cdot 2^{-2} + 1 \cdot 2^{-4} = 0.5 + 0.25 + 0.0625= 0.8125 \end{align}$
Result: $0.8125 \cdot 2^{6}$
Example 4: $0101011010000000$ (normalized, packed)
$\begin{align}s &= 0 \\ e &= c - (15)_{10} = (10101)_2 - (15)_{10} = (21)_{10} - (15)_{10} = (6)_{10} \\ m &= 1.1010000000 = \\ &=1 \cdot 2^{0} + 1 \cdot 2^{-1} + 1 \cdot 2^{-3} = 1 + 0.5 + 0.125= 1.625 \end{align}$
Result: $1.625 \cdot 2^{6}$

Floating-Point Representation: IEEE 754 Formats

Biased exponent $c$	Mantissa $m$	32-/64-bit Precision
$000\dots0000$	arbitrary	32: $(-1)^s 0.m \cdot 2^{-126}$
$000\dots0000$	arbitrary	64: $(-1)^s 0.m \cdot 2^{-1022}$
$ 111\dots1111$	$=0$	$(-1)^s \cdot \infty$
$ 111\dots1111$	$\ne0$	Not a Number (NaN)
all other bit sequences (default)	arbitrary	32: $(-1)^s 1.m \cdot 2^{c-127}$
all other bit sequences (default)	arbitrary	64: $(-1)^s 1.m \cdot 2^{c-1023}$

Source:D. W. Hoffmann: Grundlagen der Technischen Informatik, 2. Auflage; Hanser 2009

Floating-Point Representation: IEEE 754 Formats

The IEEE 754 formats (from 1985) are supported by all standard microprocessors today
Normally, the packed normalized form is used
Some reserved bit patterns for the "biased exponent" $c$ have special meaning:
Null
- $c=000\dots0000$ switches to an unpacked modified normalized form
- Representation of zero by $m=0$
Infinity
- Biased exponent $c=1111\dots111$ and $m=0$
- The sign $s$ decides between $+\infty$ and $-\infty$
- This occurs, for example, when dividing by $0$
Not a Number (NaN)
- Biased exponent $c=1111\dots111$ und $m\ne0$
- This occurs during algebraic operations with undefined or unrepresentable results, such as $0/0$, $0\cdot \infty$, etc.

Floating-Point Representation: IEEE 754 Formats

For the representation of floating point numbers in source code, the notation "52.21e-2" or "52.21E-2" is typically used. The number after the "e" indicates the exponent of the power of ten
This notation is also common in many programming languages

Bytes	Bits	Range of values	Name of data type
4	32	± 1.4e-45 ... 3.403e38	float
8	64	± 4.94e-324 ... 1.798e308	double

Are there any questions?

Please notify me by e-mail if you have questions, suggestions for improvement, or found typos: Contact

Technical Computer ScienceComputer Number Formats

Control Keys

Notation

Contents

Representation of Unsigned Integers

Big-Endian and Little-Endian Storage

Representation of Signed Integer Numbers

Sign-Magnitude Representation

Sign-Magnitude Representation

Sign-Magnitude Representation

Ones' Complement

Ones' Complement

Ones' Complement

Two's Complement

Two's Complement

Two's Complement

Two's Complement

Two's Complement

Two's Complement

Two's Complement

When does an overflow occur in two's complement?

Inverse of the Two's Complement

Representation of Signed Integer Numbers

Representation of Signed Rational Numbers

Fixed-Point Representation

Fixed-Point Representation

Floating-Point Representation

Floating-Point Representation

Floating-Point Representation

Floating-Point Representation

Floating-Point Representation

Floating-Point Representation: IEEE 754 Formats

Floating-Point Representation: IEEE 754 Formats

Floating-Point Representation: IEEE 754 Formats

Are there any questions?

Technical Computer Science
Computer Number Formats