Calculating Floats in Binary

Overview

In a 32 bit floating point number, the first bit is a sign bit, the next eight bits are an unsigned integer, and the remaining 23 bits are fractional values represented by 2^-1, 2^-2, 2^-3, etc. The decimal equivalents would be 0.5, 0.25, 0.125, etc. or 1/2, 1/4, 1/8, etc. This means that most floating point numbers cannot be represented with 100% precision.

From: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore, the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation. The section Relative Error and Ulps describes how it is measured.

Sign	Exponent	Significand	Value
0	1 0 0 0 0 0 1 1	0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0	17.125

Sign Bit - Controls whether the number is positive or negative Exponent - Controls the base value which is raised to the power of the significand Significand - Fractional value which is multiplied by the Exponent

The sign bit works exactly the same way it does with Signed Integers in Binary.

The exponent is valued with the same logic as a normal integer with the one exception of being offset by -127. This means that 10000000 = 128 when offset is valued at 1. This number is then used as the exponent in 2^x.

The significand contains a “hidden” bit which is always on and has a value of 2⁰. This means that every significand has a value of 1 + some fractional value.

Calculation

Calculating the value of 01000001 10010000 00000000 00000000 is a bit tedious, but not impossible to do by hand.

Section	Value	Calculations
Sign	0	Positive Number
Exponent	1 0 0 0 0 0 1 1	131 - 127 = 4
Significand	0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0	1 + 2^-4 + 2^-7 = 137/128

Note that 2^-4 can be written as 1/16 or 8/128 and 2^-7 is 1/128. Adding those fractions together with 128/128 gives us 137/128. We now have to multiply the exponent by the significand:

2^4\times\frac{137}{128} = \frac{2192}{128} = 17.125

alt text

Section	Value	Calculations
Sign	1	Negative Number
Exponent	1 0 0 0 0 0 1 0	130 - 127 = 3
Significand	0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0	1 + 2^-3 + 2^-7 +2^-11 = 2321/2048

-1\times2^3\times\frac{2321}{2048} = \frac{18568}{2048} = -9.06640625

alt text

Calculating Floats in Binary

Overview

Calculation

References