Double Precision (Not)
From this list, the gist is that most languages can't process 9999999999999999.0 - 9999999999999998.0
Why do they output 2 when it should be 1? I bet most people who've never done any formal CS (a.k.a maths and information theory) are super surprised.
Before you read the rest, ask yourself this: if all you have are zeroes and ones, how do you handle infinity?
If we fire up an interpreter that outputs the value when it's typed (like the Swift REPL), we have the beginning of an explanation:
Welcome to Apple Swift version 4.2.1 (swiftlang-1000.11.42 clang-1000.11.45.1). Type :help for assistance.
1> 9999999999999999.0 - 9999999999999998.0
$R0: Double = 2
2> let a = 9999999999999999.0
a: Double = 10000000000000000
3> let b = 9999999999999998.0
b: Double = 9999999999999998
4> a-b
$R1: Double = 2
Whew, it's not that the languages can't handle a simple substraction, it's just that a
is typed as 9999999999999999
but stored as 10000000000000000
.
If we used integers, we'd have:
5> 9999999999999999 - 9999999999999998
$R2: Int = 1
Are the decimal numbers broken? 😱
A detour through number representations
Let's look at a byte. This is the fundamental unit of data in a computer and is made of 8 bits, all of which can be 0 or 1. It ranges from 00000000
to 11111111
( 0x00
to 0xff
in hexadecimal, 0
to 255
in decimal, homework as to why and how it works like that due by monday).
Put like that, I hope it's obvious that the question "yes, but how do I represent the integer 999
on a byte?" is meaningless. You can decide that 00000000
means 990
and count up from there, or you can associate arbitrary values to the 256 possible combinations and make 999
be one of them, but you can't have both the 0
- 255
range and 999
. You have a finite number of possible values and that's it.
Of course, that's on 8 bits (hence the 256 color palette on old games). On 16, 32, 64 or bigger width memory blocks, you can store up to 2ⁿ
different values, and that's it.
The problem with decimals
While it's relatively easy to grasp the concept of infinity by looking at "how high can I count?", it's less intuitive to notice that there is the same amount of numbers between 0 and 1 as there are integers.
So, if we have a finite number of possible values, how do we decide which ones make the cut when talking decimal parts? The smallest? The most common? Again, as a stupid example, on 8 bits:
- maybe we need
0.01
...0.99
because we're doing accounting stuff - maybe we need
0.015
,0.025
,...,0.995
for rounding reasons - We'll just encode the numeric part on 8 bits (
0
-255
), and the decimal part as above
But that's already 99+99 values taken up. That leaves us 57 possible values for the rest of infinity. And that's not even mentionning the totally arbitrary nature of the selection. This way of representing numbers is historically the first one and is called "fixed" representation. There are many ways of choosing how the decimal part behaves and a lot of headache when coding how the simple operations work, not to mention the complex ones like square roots and powers and logs.
Floats (IEEE 754)
To make it simple for chips that perform the actual calculations, floating point numbers (that's their name) have been defined using two parameters:
- an integer
n
- a power (of base
b
)p
Such that we can have n x bᵖ
, for instance 15.3865
is 153863 x 10^(-4). The question is, how many bits can we use for the n
and how many for the p
.
The standard is to use 1 bit for the sign (+ or -), 23 bits for n
, 8 for p
, which use 32 bits total (we like powers of two), and using base 2, and n
is actually 1.n
. That gives us a range of ~8 million values, and powers of 2 from -126 to +127 due to some special cases like infinity and NotANumber (NaN).
In theory, we have numbers from -10⁴⁵ to 1038 roughly, but some numbers can't be represented in that form. For instance, if we look at the largest number smaller than 1, it's 0.9999999404
. Anything between that and 1 has to be rounded. Again, infinity can't be represented by a finite number of bits.
Doubles
The floats allow for "easy" calculus (by the computer at least) and are "good enough" with a precision of 7.2 decimal places on average. So when we needed more precision, someone said "hey, let's use 64 bits instead of 32!". The only thing that changes is that n
now uses 52 bits and p
11 bits.
Coincidentally, double has more a meaning of double size than double precision, even though the number of decimal places does jump to 15.9 on average.
We still have 2³² more values to play with, and that does fill some annoying gaps in the infinity, but not all. Famously (and annoyingly), 0.1 doesn't work in any precision size because of the base 2. In 32 bits float, it's stored as 0.100000001490116119384765625, like this:
(1)(2⁻⁴)(1.600000023841858)
Conversely, after double size (aka doubles), we have quadruple size (aka quads), with 15 and 112 bits, for a total of 128 bits.
Back to our problem
Our value is 9999999999999999.0
. The closest possible value encodable in double size floating point is actually 10000000000000000
, which should now make some kind of sense. It is confirmed by Swift when separating the two sides of the calculus, too:
2> let a = 9999999999999999.0
a: Double = 10000000000000000
Our big brain so good at maths knows that there is a difference between these two values, and so does the computer. It's just that using doubles, it can't store it. Using floats, a
will be rounded to 10000000272564224
which isn't exactly better. Quads aren't used regularly yet, so no luck there.
It's funny because this is an operation that we puny humans can do very easily, even those humans who say they suck at maths, and yet those touted computers with their billions of math operations per second can't work it out. Fair enough.
The kicker is, there is a litteral infinity of examples such as this one, because trying to represent infinity in a finite number of digits is impossible.