Double Precision (Not)

From this list, the gist is that most languages can't process 9999999999999999.0 - 9999999999999998.0

Why  do they output 2 when it should be 1? I bet most people who've never  done any formal CS (a.k.a maths and information theory) are super  surprised.

Before you read the rest, ask yourself this: if all you have are zeroes and ones, how do you handle infinity?

If  we fire up an interpreter that outputs the value when it's typed (like  the Swift REPL), we have the beginning of an explanation:

Welcome to Apple Swift version 4.2.1 (swiftlang-1000.11.42 clang-1000.11.45.1). Type :help for assistance.
  1> 9999999999999999.0 - 9999999999999998.0
$R0: Double = 2
  2> let a = 9999999999999999.0
a: Double = 10000000000000000
  3> let b = 9999999999999998.0
b: Double = 9999999999999998
  4> a-b
$R1: Double = 2

Whew, it's not that the languages can't handle a simple substraction, it's just that a is typed as 9999999999999999 but stored as 10000000000000000.

If we used integers, we'd have:

  5> 9999999999999999 - 9999999999999998
$R2: Int = 1

Are the decimal numbers broken? 😱

A detour through number representations

Let's  look at a byte. This is the fundamental unit of data in a computer and  is made of 8 bits, all of which can be 0 or 1. It ranges from 00000000 to 11111111 ( 0x00 to 0xff in hexadecimal, 0 to 255 in decimal, homework as to why and how it works like that due by monday).

Put like that, I hope it's obvious that the question "yes, but how do I represent the integer 999 on a byte?" is meaningless. You can decide that 00000000 means 990 and count up from there, or you can associate arbitrary values to the 256 possible combinations and make 999 be one of them, but you can't have both the 0 - 255 range and 999. You have a finite number of possible values and that's it.

Of  course, that's on 8 bits (hence the 256 color palette on old games). On  16, 32, 64 or bigger width memory blocks, you can store up to 2ⁿ different values, and that's it.

The problem with decimals

While  it's relatively easy to grasp the concept of infinity by looking at  "how high can I count?", it's less intuitive to notice that there is the same amount of numbers between 0 and 1 as there are integers.

So,  if we have a finite number of possible values, how do we decide which  ones make the cut when talking decimal parts? The smallest? The most  common? Again, as a stupid example, on 8 bits:

  • maybe we need 0.01 ... 0.99 because we're doing accounting stuff
  • maybe we need 0.015, 0.025,..., 0.995 for rounding reasons
  • We'll just encode the numeric part on 8 bits ( 0 - 255 ), and the decimal part as above

But that's already  99+99 values taken up. That leaves us 57 possible values for the rest of infinity. And that's not even mentionning the totally arbitrary nature of the  selection. This way of representing numbers is historically the first  one and is called "fixed" representation. There are many ways of  choosing how the decimal part behaves and a lot of headache when coding  how the simple operations work, not to mention the complex ones like  square roots and powers and logs.

Floats (IEEE 754)

To  make it simple for chips that perform the actual calculations, floating  point numbers (that's their name) have been defined using two  parameters:

  • an integer n
  • a power (of base b) p

Such that we can have n x bᵖ, for instance 15.3865 is 153863 x 10^(-4). The question is, how many bits can we use for the n and how many for the p.

The standard is to use 1 bit for the sign (+ or -), 23 bits for n, 8 for p, which use 32 bits total (we like powers of two), and using base 2, and n is actually 1.n.  That gives us a range of ~8 million values, and powers of 2 from -126  to +127 due to some special cases like infinity and NotANumber (NaN).

$$(-1~or~1)(2^{[-126...127]})(1.[one~of~the~8~million~values])$$

In theory, we have numbers from -10⁴⁵ to 1038 roughly, but some numbers can't be represented in that form. For  instance, if we look at the largest number smaller than 1, it's 0.9999999404. Anything between that and 1 has to be rounded. Again, infinity can't be represented by a finite number of bits.

Doubles

The  floats allow for "easy" calculus (by the computer at least) and are  "good enough" with a precision of 7.2 decimal places on average. So when  we needed more precision, someone said "hey, let's use 64 bits instead  of 32!". The only thing that changes is that n now uses 52 bits and p 11 bits.

Coincidentally, double has more a meaning of double size than double precision, even though the number of decimal places does jump to 15.9 on average.

We  still have 2³² more values to play with, and that does fill some  annoying gaps in the infinity, but not all. Famously (and annoyingly),  0.1 doesn't work in any precision size because of the base 2. In 32 bits  float, it's stored as 0.100000001490116119384765625, like this:

(1)(2⁻⁴)(1.600000023841858)

Conversely, after double size (aka doubles), we have quadruple size (aka quads), with 15 and 112 bits, for a total of 128 bits.

Back to our problem

Our value is 9999999999999999.0. The closest possible value encodable in double size floating point is actually 10000000000000000, which should now make some kind of sense. It is confirmed by Swift when separating the two sides of the calculus, too:

2> let a = 9999999999999999.0
a: Double = 10000000000000000

Our  big brain so good at maths knows that there is a difference between  these two values, and so does the computer. It's just that using  doubles, it can't store it. Using floats, a will be rounded to 10000000272564224 which isn't exactly better. Quads aren't used regularly yet, so no luck there.

It's  funny because this is an operation that we puny humans can do very  easily, even those humans who say they suck at maths, and yet those  touted computers with their billions of math operations per second can't  work it out. Fair enough.

The kicker is, there is a litteral infinity of examples such as this one, because trying to represent infinity in a finite number of digits is impossible.