Skip Navigation

Floating-Point Numbers

So far, we’ve looked at how computers store whole numbers (“unsigned integers”) and integers. Now it’s time to consider what happens if we want to store numbers that have fractional components. These kinds of numbers are important in real-world calculations, such as those that work with money or scientific data.

Page Contents

Video Lecture


Watch at Internet Archive

Real Numbers

In mathematics, a real number is a continuous value that occurs between integers on a number line. Between any two integers, there are an infinite number of real numbers. For example, between 0 and 1, there is 0.1. There are also 0.01, 0.001, 0.0001, 0.0002, 0.5, 0.5040302, and so forth.

Whenever we write a real number using a decimal point, we normally end the number somewhere, even if that end isn’t precise. For example, we might write pi as 3.14, or 3.14159, or 3.14159265358979323846. Each of these three representations of pi demonstrates increasing precision, which is a measure of how close the number is to the actual value of the thing it represents. For some things, we need extremely precise numbers. Other things are “close enough for government work.”

If we had a way to list all the real numbers in existence, the number of numbers to be listed would be uncountably infinite. That means there is no way to itemize the list of real numbers using whole numbers to mark off everything on the list. Intuitively, this makes sense if you consider the numbers between 1 and 2. If you start counting at 1, then the second number (list item 2) might be 1.000001. However, there are an infinite number of real numbers between 1 and 1.000001! Therefore, you would have to go back and make, say, 1.0000001 list item 2. However, there also an infinite number of real numbers between 1.0000001 and 1.000001, so you’d have to keep fixing the list index… forever.

Since computers are finite machines, developed by human beings with finite lifetimes on a planet with finite resources, we cannot design a computer that is capable of representing any arbitrary real number. Instead, we have to make some trade-offs. We need to fit the number into a finite (and relatively small) amount of memory if we’re going to be able to process it efficiently.

An approximation of a real number, albeit with limited precision and other constraints, is called a floating-point number in computer terminology. Given a fixed number of bits in which to store a floating-point number, we have to strike a balance between the level of precision that we can store and the range of values that we can store within a single number. To get to the point where we can analyze this trade-off, we first need to look at how real numbers work in math.

The Radix Point

In math, you were taught that numbers like 32.5 and 2.718 contained a whole portion, which is the part of the number before the decimal point, and a fractional portion, which is the part of the number after the decimal point. The separator, or decimal point, is usually written as a dot (.) in English-speaking countries. In other countries, however, the role of dots and commas in numbers is reversed, and a comma is used to separate the whole part from the fraction. These two representations are therefore equivalent:

1.25
1,25

Both of the above values mean 1 and 1/4. We see a similar localization occur with currency. For example, in South Carolina, we would write $1,023.56. However, in parts of Europe, one might see €1.023,56. The numbers are the same: only the decorations and currencies (dollars versus Euros) are different.

As mentioned a few paragraphs ago, we call that final “.” or “,” the decimal point. However, it is more generically called the radix point. It becomes a decimal point in our customary usage only because we’re using base 10 numbers. In binary, we would call this mark the binary point. Each other number base would have its own term (octal point, anyone?), so in computing, it’s just easier to say “radix point.”

Decimal Numbers

Let’s first consider what numbers after the radix point in decimal mean. If I write the number (1.1)10, it is equivalent to the expression 1 + 1/10. Similarly, (1.01)10 would be the result of adding 1 + 1/100. Each place we move to the right after the radix point becomes a smaller fraction of 10.

Now if you remember the formulas we used for counting, you would know that counting whole numbers to the left of the radix point starts by raising the number base to the zero power (which is always 1), followed by the number base to the first power, then the second power, and so forth, moving from right to left. It turns out this same model works when moving to the right of the decimal point, only we view it now as counting down from left to right (which is the same thing as counting up from right to left).

Consider the value (531.24)10. The 1 is in the position that counts quantities of 100, which we call the one’s place. Going left, as we did previously, the 3 counts quantities of 101, which is why the position is called the ten’s place. One more step to the left brings to the 5, which counts quantities of 102, or hundreds. For this reason, we say that the part of the number to the left of the radix point is five hundred thirty-one. There are 5 hundreds, 3 tens, and 1 one.

Now when we move to the right in (531.24)10, the 2 is located in the position that counts quantities of 10-1. Recall that negative exponents can be written as fractions with positive exponents in the denominator, so 10-1 is equal to 1/(101). Similarly, the 4 is in the position that counts quantities of 10-2, or 1/(102), which simplifies to 1/100. If we wrote out the words representing the whole quantity, we would write five hundred thirty-one and twenty-four hundredths. Fortunately, we would really just write 531.24, since that’s a lot easier. We’d also normally say something like, “five thirty-one point two four.”

Binary Numbers

Numbers with radix points in binary work exactly the same way as they do in decimal, only we’re working in base two instead of in base ten. Consider the number:

(101.01101)2

We already know that the 101 part before the radix point (binary point, to be specific) is processed right to left as:

1*20 + 0*21 + 1*22 = 1*1 + 0*2 + 1*4 = 1 + 0 + 4 = 5

To the right of the radix point, we have negative exponents, and we work from left to right this time:

0*2-1 + 1*2-2 + 1*2-3 + 0*2-4 + 1*2-5 =
0*(1/2) + 1*(1/4) + 1*(1/8) + 0*(1/16) + 1*(1/32) =
0 + 0.25 + 0.125 + 0 + 0.03125 = 0.40625

Thus, (101.01101)2 = (5.40625)10.

The principle behind the radix point is the same regardless of the base. What changes are the values of the positions on either side of the radix point. These are whole multiples of the base on the left, followed by the one’s place, then the radix point, followed by fractions of the base to the right of the radix point.

Rounding

A key issue with floating-point numbers is that there is always a limit to precision, which means that floating-point numbers are usually rounded into useful values. Rounding errors can become problematic. Consider, for example, the binary representation of (0.1)10:

(0.000110011001100110011…)2

This representation results in a repeating pattern that cannot be precisely represented in any finite unit of storage. Therefore, there will necessarily be some error. If a small value, such as (0.01)10 or (0.001)10 is repeatedly added to a number, the rounding error may eventually accumulate to a point where the result becomes noticeably incorrect. Various software-based techniques can be used to mitigate such errors, although exceptional care must be used when handling financial transactions (most of which use small powers of ten for cents and fractions of cents).

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.