This is the last part of a 3-parts series. In part 1, I tried to make sense of how it works and what we are trying to achieve, and in part 2, we set up the training loop.

#### Model Predictions

We have a trained model. Now what?

Remember, a model is a series of giant matrices that take an input like you trained it on, and spits out the list of probabilities associated with the outputs you trained it on. So all you have to do is feed it a new input and see what it tells you:

let input = [1.0, 179.0, 115.0]
let unlabeled : Tensor<Float> = Tensor<Float>(shape: [1, 3], scalars: input)
let predictions = model(unlabeled)
let logits = predictions[0]
let classIdx = logits.argmax().scalar! // we take only the best guess
print(classIdx)
17

Cool.

Cool, cool.

What?

Models deal with numbers. I am the one who assigned numbers to words to train the model on, so I need a translation layer. That's why I kept my contents structure around: I need it for its vocabulary map.

The real code:

let w1 = "on"
let w2 = "flocks"
let w3 = "settlement"

var indices = [w1, w2, w3].map {
let step2 = rnn2.callAsFunction(step1).differentiableMap({ $0.cell }) let last = withoutDerivative(at:step2[0]) let red = step2.differentiableReduce(last, { (p,e) -> Tensor<Float> in return e }) return red } @differentiable func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> { let step2out = runThrough(input) let step3 = matmul(step2out, correction) return step3 } } The RNN/LTSM have been talked about, but what are these two functions? callAsFunction is the only one needed. I have just decided to split the algorithm in two: the part where I "just" pass through layers, and the part where I format the output. Everything in runThrough could be written at the top of callAsFunction. We follow the steps outlined previously, it all seems logical, even if the details aren't quite clear yet. What is it with the @noDerivative and @differentiable annotations? Because we are dealing with a structure (model, layer, etc...) that not only should but will be adjusted over time, it is a way to tell the system which parts are important to that adjustment: • all properties except those maked as not derivative will be nudged potentially, so it makes sense to mark the number of inputs as immutable, and the rest as "nudgeable" • all the functions that calculate something that will be used in the "nudging" need to have specific maths properties that make the change non-random. We need to know where we are and where we were going. We need a position, and a speed, we need a value and its derivative Ugh, maths. Yeah. I am obviously oversimplifying everything to avoid scaring away everyone from the get go, but the idea should make sense if you look at it this way: • Let's take a blind man trying to shoot an arrow at a target • You ask them to shoot and then you'll correct them based on where the arrow lands • It hits the far left of the target • You tell them to nudge the aim to the right • The problem is that "more right" isn't enough information... You need to tell them to the right a little (new position and some information useful for later, you'll see) • The arrow lands slightly to the right of the center • You tell the archer to aim to the left but less than their movement they just made to the right. Two pieces of information: one relative to a direction, and one relative to the rate of change. The other name of the rate of change is the derivative. Standard derivatives are speed to position (we are here, now we are there, and finally we are there, and the rate of change slowed, so the next position won't be as far from this one as the one was to the previous one), or acceleration to speed (when moving, if your speed goes up and up and up, you have a positive rate of change, you accelerate). That's why passing through a layer should preserve the two: the actual values, and the speed at which we are changing them. Hence the @differentiable annotation. (NB for all you specialists in the field reading that piece... yes I know. I'm trying to make things more palatable) "But wait", say the most eagle-eyed among you, "I can see a withoutDerivative in that code!" Yes. RNN is peculiar in the way that it doesn't try to coerce the dimensions of the results. It spits out all the possible variants it has calculated. But in practice, we need only the last one. Taking one possible outcome out of many cancels out the @differentiable nature of the function, because we actually lose some information. This is why we only partially count on the RNN's hidden parameters to give us a "good enough" result, and need to incorporate extra weights and biases that are derivable. The two parts of the correction matrix, will retain the nudge speed, as well as reshape the output matrix to match the labels: matrix addition and multiplications are a bit beyond the scope here as well (and quite frankly a bit boring), but that last step ( step3 in the code ) basically transform a 512x512x<number of labels> matrix, into a 2x<numbers of labels> : one column to give us the final probabilities, one for each possible label. If you've made it this far, congratulations, you've been through the hardest. #### Model Training OK, we have the model we want to use to represent the various orders in the data, how do we train it? To continue with the blind archer metaphor, we need the piece of code that acts as the "corrector". In ML, it's called the optimizer. We need to give it what the archer is trying to do, and a way to measure how far off the mark the archer is, and a sense of how stern it should be (do we do a lot of small corrections, or fewer large ones?) The measure of the distance is called the "cost" function, or the "accuracy" function. Depending on how we look at it we want to make the cost (or error) as low as possible, and the accuracy as high as possible. They are obviously linked, but can be expressed in different units ("you are 3 centimeters off" and "you are closer by 1%"). Generally, loss has little to no meaning outside of the context of the layers ( is 6 far? close? because words aren't sorted in any meaningful way, we are 6.2 words away from the ideal word doesn't mean much), while accuracy is more like a satisfaction percentage (we are 93% satisfied with the result, whatever that means). func accuracy(predictions: Tensor<Int32>, truths: Tensor<Int32>) -> Float { return Tensor<Float>(predictions .== truths).mean().scalarized() } let predictions = model(aBunchOfFeatures) print("Accuracy: \(accuracy(predictions: predictions.argmax(squeezingAxis: 1), truths: aBunchOfLabels))") Accuracy: 0.10143079 and the loss: let predictions = model(aBunchOfFeatures) let loss = softmaxCrossEntropy(logits: predictions, labels: aBunchOfLabels) print("Loss test: \(loss)")  Loss test: 6.8377414 In more human terms, the best prediction we have is 10% satisfying, because the result is 6.8 words away from the right one. 😬 Now that we know how to measure how far off the mark we are (in two different ways), we need to make a decision about 3 things: • Which kind of optimizer we want to use (we'll use Adam, it's a good algorithm for our problem, but other ones exist. For our archer metaphor, it's a gentle but firm voice on the corrections, rather than a barking one that might progress rapidly at first then annoy the hell out of the archer) • What learning rate we want to use (do we correct a lot of times in tiny increments, or in bigger increments that take overall less time, but might overcorrect) • How many tries we give the system to get as close as possible You can obviously see why the two last parameters are hugely important, and very hard to figure out. For some problems, it might be better to use big steps in case we find ourselves stuck, for others it might be better to always get closer to the target but by smaller and smaller increments. It's an art, honestly. Here, I've used a learning rate of 0.001 (tiny) and a number of tries of 500 (medium), because if there is no way to figure out the structure of the text, I want to know it fast (fewer steps), but I do NOT want to overshoot(small learning rate). Let's setup the model, the correction matrices, and the training loop: var weigths = Tensor<Float>(randomNormal: [512, contents.vocabulary.count]) // random probabilities var biases = Tensor<Float>(randomNormal: [contents.vocabulary.count]) // random bias var model = TextModel(input:3, hidden: 512, output: contents.vocabulary.count, weights: weigths, biases: biases) Now let's setup the training loop and run it: let epochCount = 500 var trainAccuracyResults: [Float] = [] var trainLossResults: [Float] = [] var randomSampleSize = contents.original.count/15 var randomSampleCount = contents.original.count / randomSampleSize print("Doing \(randomSampleCount) samples per epoch") for epoch in 1...epochCount { var epochLoss: Float = 0 var epochAccuracy: Float = 0 var batchCount: Int = 0 for training in contents.randomSample(splits: randomSampleCount) { let (sampleFeatures,sampleLabels) = training let (loss, grad) = model.valueWithGradient { (model: TextModel) -> Tensor<Float> in let logits = model(sampleFeatures) return softmaxCrossEntropy(logits: logits, labels: sampleLabels) } optimizer.update(&model, along: grad) let logits = model(sampleFeatures) epochAccuracy += accuracy(predictions: logits.argmax(squeezingAxis: 1), truths: sampleLabels) epochLoss += loss.scalarized() batchCount += 1 } epochAccuracy /= Float(batchCount) epochLoss /= Float(batchCount) trainAccuracyResults.append(epochAccuracy) trainLossResults.append(epochLoss) if epoch % 10 == 0 { print("avg time per epoch: \(t.averageDeltaHumanReadable)") print("Epoch \(epoch): Loss: \(epochLoss), Accuracy: \(epochAccuracy)") } } A little bit of explanation: • We will try 500 times ( epochCount ) • At each epoch, I want to test and nudge for 15 different combinations of trigrams. Why? because it avoids the trap of optimizing for some specific turns of phrase • We apply the model to the sample, calculate the loss, and the derivative, and update the model with where we calculate we should go next What does that give us? Doing 15 samples per epoch Epoch 10: Loss: 6.8377414, Accuracy: 0.10143079 Epoch 20: Loss: 6.569199, Accuracy: 0.10564535 Epoch 30: Loss: 6.412607, Accuracy: 0.10802801 Epoch 40: Loss: 6.2550464, Accuracy: 0.10751916 Epoch 50: Loss: 6.0366735, Accuracy: 0.11123683 ... Epoch 490: Loss: 1.1177399, Accuracy: 0.73812264 Epoch 500: Loss: 0.5172857, Accuracy: 0.86724746 We like to keep these values in an array to graph them. What does it look like? We can see that despite the dips and spikes, because we change the samples often and don't try any radical movement, we tend to better and better results. We don't get stuck in a ditch. Next part, we'll see how to use the model. Here's a little spoiler: I asked it to generate some random text: on flocks settlement or their enter the earth; their only hope in their arrows, which for want of it, with a thorn. and distinction of their nature, that in the same yoke are also chosen their chiefs or rulers, such as administer justice in their villages and by superstitious awe in times of old. It's definitely gibberish when you look closely, but from a distance it looks kind of okayish for a program that learned to speak entirely from scratch, based on a 10k words essay written by fricking Tacitus. First part of doing RNN text prediction with TensorfFlow, in Swift #### Broad Strokes For all intents and purposes, it's about statistics. The question we are trying to solve is either something along the lines of "given an input X, what is the most probable Y?", or along the lines of "given an input X, what is the probability of having Y?" Of course, simple probability problems have somewhat simple solutions: if you take a game of chess and ask for a next move based on the current board, you can do all the possible moves and sort them based on the probability of having a piece taken off the board, for instance. If you are designing an autopilot of some kind, you have an "ideal" attitude (collection of yaw, pitch and roll angles), and you calculate the movements of the stick and pedals that will most likely get you closer to that objective. If your last shot went left of the target, chances are, you should go right. Etc etc etc. But the most interesting problems don't have obvious causality. If you have pasta, tomatoes and ground meat in your shopping basket, maybe your next item will be onions, because you're making some kind of bolognese, maybe it will be soap, because that's what you need, maybe it will be milk, because that's the order of the shelves you are passing by. Machine learning is about taking a whole bunch of hopefully consistent data (even if you don't know for sure that it's consistent), and use it to say "based on this past data, the probabilities for onions, soap and milk are X, Y, and Z, and therefore the most probable is onions. The data your are basing your predictive model on is really really important. Maybe your next item is consistent with the layout of the shop. Maybe it is consistent with what other customers got. Maybe it's consistent to your particular habits. Maybe it's consistent with people who are in the same "category" as you (your friends, your socio-economic peers, your cultural peers, ... pick one or more). So you want a lot of data to have a good prediction, but not all the data, because noise (random data) does not reveal a bias (people in the shop tend to visit the shelves in that order) or doesn't exploit a bias (people following receipes want those ingredients together). Oh yes, there are biases. Lots of them. Because ML uses past data to predict the future, if the data we use was based on bad practices, recommendations won't be a lot better. There is a branch of machine learning that starts ex nihilo but it is beyond the scope of this introduction, and generates data based on a tournament rather than on actual facts. Its principle is roughly the same, though. So, to recap: • We start with a model with random probabilities, and a series of "truths" ( X leads to Y ) • We try with a Z, see what the model predicts • We compare it to a truth, and fix the probabilities a little bit so that it matches • Repeat with as many Zs as possible to be fairly confident the prediction isn't too far off If you didn't before, now you know why ML takes a lot of resources. Depending on the number of possible Xs, Ys and the number of truths, the probability matrix is potentially humongous. Operations (like fixing the probabilities to match reality) on such enormous structures aren't cheap. If you want a more detailed oriented explanation, with maths and diagrams, you can read my other attempt at explaining how it works. #### Swift TensorFlow There are a few contenders in the field of "ML SDK", but one of the most known is TensorFlow, backed by Google. It also happens to have a Swift variant (almost every other ML environment out there is either Python or R). And of course, the documentation is really... lacking, making this article more useful than average along the way. In their "Why Swift?" motivation piece, the architects make a good case, if a little bit technical, as to why swift makes a good candidate for ML. The two major takeaways you have to know going in are: • It's a different build of Swift. You cannot use the one that shipped with Xcode (yet) • It uses a lot of Python interoperability to work, so some ways of doing things will be a bit alien The performance is rather good, comparable or better than the regular Python TensorFlow for the tasks I threw at it, so there's that. But the documentation... My oh my. Let's take an example: Tensor is, as the name of the framework implies, the central feature of the system. Its documentation is here: https://www.tensorflow.org/swift/api_docs/Structs/Tensor Sometimes, that page greets me in Greek... But hey, why not. There is little to no way to navigate the hierarchy, other than going on the left side, opening the section (good luck if you don't already know if it's a class, a protocol or a struct you're looking for), and if you use the search field, it will return pages about... the Python equivalents of the functions you're looking for. Clearly, this is early in the game, and you are assumed to know how regular TensorFlow works before attempting to do anything with STF. But fret not! I will hold your hand so that you don't need to look at the doc too much. The tutorials are well written, but don't go very far at all. Oh and if you use that triple Dense layers on more than a toy problem (flower classification that is based on numbers), your RAM will fill so fast that your computer will have to be rebooted. More on that later. And, because the center of ML is that "nudge" towards a better probability matrix (also called a Tensor), there is the whole @differentiable thing. We will talk about it later. A good thing is that Python examples (there are thousands of ML tutorials in Python) work almost out of the box, thanks to the interop. #### Data Preparation Which Learning will my Machine do? I have always thought that text generation was such a funny toy example (if a little bit scary when you think about some of the applications): teach the machine to speak like Shakespeare, and watch it spit some play at you. It's also easy for us to evaluate in terms of what it does and how successful it is. And the model makes sense, which helps when writing a piece on how ML works. A usual way of doing that is using trigrams. We all know than predicting the next word after a single word is super hard. And our brains tend to be able to predict the last word of a sentence with ease. So, a common way of teaching the machine is to have it look at 3 words to predict a 4th. I am hungry -> because, the birds flew -> away, etc Of course, for more accurate results, you can extend the number of words in the input, but it means you must have a lot more varied sentence examples. What we need to do here is assign numbers to these words (because everything is numbers in computers) so that we have a problem like "guess the function f if f(231,444,12)->123, f(111,2,671)->222", which neural networks are pretty good at. So we need data (a corpus), and we need to split it into (trigram)->result Now, because we are ultimately dealing with probabilities and rounding, we need the input to be in Float, so that the operations can wriggle the matrices by fractions, and we need the result to be an Int, because we don't want something like "the result is the word between 'hungry' and 'dumb'". The features (also called input) and the labels (also called outputs) have to be stored in two tensors (also called matrices), matching the data we want to train our model on. That's where RAM and processing time enter the arena: the size of the matrix is going to be huge: • Let's say the book I chose to teach it English has 11148 words in it (it's Tacitus' Germany), that's 11148*3-2 trigrams (33442 lines in my matrices, 4 columns total) • The way neural networks function, you basically have a function parameter per neuron that gets nudged at each iteration. In this example, I use two 512 parameters for somewhat decent results. That means 2 additional matrices of size 33442*512. • And operations regularly duplicate these matrices, if only for a short period of time, so yea, that's a lot of RAM and processing power. Here is the function that downloads a piece of text, and separates it into words: func loadData(_ url: URL) -> [String] { let sem = DispatchSemaphore(value: 0) var result = [String]() let session = URLSession(configuration: URLSessionConfiguration.default) // let set = CharacterSet.punctuationCharacters.union(CharacterSet.whitespacesAndNewlines) let set = CharacterSet.whitespacesAndNewlines session.dataTask(with: url, completionHandler: { data, response, error in if let data = data, let text = String(data: data, encoding: .utf8) { let comps = text.components(separatedBy: set).compactMap { (w) -> String? in // separate punctuation from the rest if w.count == 0 { return nil } else { return w } } result += comps } sem.signal() }).resume() sem.wait() return result } Please note two things: I make it synchronous (I want to wait for the result), and I chose to include word and word, separately. You can keep only the words by switching the commented lines, but I find that the output is more interesting with punctuation than without. Now, we need to setup the word->int and int->word transformations. Because we don't want to look at all the array of words every time we want to search for one, there is a dictionary based on the hashing of the words that will deal with the first, and because the most common words have better chances to pop up, the array for the vocabulary is sorted. It's not optimal, probably, but it helps makes things clear, and is fast enough. func loadVocabulary(_ text: [String]) -> [String] { var counts = [String:Int]() for w in text { let c = counts[w] ?? 0 counts[w] = c + 1 } let count = counts.sorted(by: { (arg1, arg2) -> Bool in let (_, value1) = arg1 let (_, value2) = arg2 return value1 > value2 }) return count.map { (arg0) -> String in let (key, _) = arg0 return key } } func makeHelper(_ vocabulary: [String]) -> [Int:Int] { var result : [Int:Int] = [:] vocabulary.enumerated().forEach { (arg0) in let (offset, element) = arg0 result[element.hash] = offset } return result } Why not hashValue instead of hash? turns out, on Linux, which this baby is going to run on, the values are more stable with the latter rather than the former, according to my tests. The data we will work on therefore is: struct TextBatch { let original: [String] let vocabulary: [String] let indexHelper: [Int:Int] let features : Tensor<Float> // 3 words let labels : Tensor<Int32> // followed by 1 word } We need a way to initialize that struct, and a couple of helper functions to extract some random samples to train our model on, and we're good to go: extension TextBatch { public init(from: [String]) { let v = loadVocabulary(from) let h = makeHelper(v) var f : [[Float]] = [] var l : [Int32] = [] for i in 0..<(from.count-3) { if let w1 = h[from[i].hash], let w2 = h[from[i+1].hash], let w3 = h[from[i+2].hash], let w4 = h[from[i+3].hash] { f.append([Float(w1), Float(w2), Float(w3)]) l.append(Int32(w4)) } } let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap {$0 })
let labelsT = Tensor<Int32>(l)

self.init(
original: from,
vocabulary: v,
indexHelper: h,
features: featuresT,
labels: labelsT
)
}

func randomSample(of size: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
var f : [[Float]] = []
var l : [Int32] = []
for i in 0..<(original.count-3) {
if let w1 = indexHelper[original[i].hash],
let w2 = indexHelper[original[i+1].hash],
let w3 = indexHelper[original[i+2].hash],
let w4 = indexHelper[original[i+3].hash] {
f.append([Float(w1), Float(w2), Float(w3)])
l.append(Int32(w4))
}
}

var rf : [[Float]] = []
var rl : [Int32] = []
if size >= l.count || size <= 0 {
let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 }) let labelsT = Tensor<Int32>(l) return (featuresT, labelsT) } var alreadyPicked = Set<Int>() while alreadyPicked.count < size { let idx = Int.random(in: 0..<l.count) if !alreadyPicked.contains(idx) { rf.append(f[idx]) rl.append(l[idx]) alreadyPicked.update(with: idx) } } let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap {$0 })
let labelsT = Tensor<Int32>(l)
return (featuresT, labelsT)
}

func randomSample(splits: Int) -> [(features: Tensor<Float>, labels: Tensor<Int32>)] {
var res = [(features: Tensor<Float>, labels: Tensor<Int32>)]()
let size = Int(floor(Double(original.count)/Double(splits)))
var f : [[Float]] = []
var l : [Int32] = []
for i in 0..<(original.count-3) {
if let w1 = indexHelper[original[i].hash],
let w2 = indexHelper[original[i+1].hash],
let w3 = indexHelper[original[i+2].hash],
let w4 = indexHelper[original[i+3].hash] {
f.append([Float(w1), Float(w2), Float(w3)])
l.append(Int32(w4))
}
}

for part in 1...splits {
var rf : [[Float]] = []
var rl : [Int32] = []
if size >= l.count || size <= 0 {
let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 }) let labelsT = Tensor<Int32>(l) return [(featuresT, labelsT)] } while alreadyPicked.count < size { let idx = Int.random(in: 0..<l.count) if !alreadyPicked.contains(idx) { rf.append(f[idx]) rl.append(l[idx]) alreadyPicked.update(with: idx) } } let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap {$0 })
let labelsT = Tensor<Int32>(l)

res.append((featuresT,labelsT))
}
return res
}
}

In the next part, we will see how to set the model up, and train it.

So, I'd like to take x elements out of an array of n elements. I know from statistical analysis that there are a precise number of combinations:

$$\frac{n!}{x!.(n-x)!}$$

But what are they? There are a few classic algorithms to do that, but most of them are recursive and use an accumulator, not exactly Swift's forte, especially if you don't know the right keyword. Cue inout

So here's the rough idea: every time we select an item in the array, we have only n-1 items to pick from, and x-1 items to pick. So it's recursive in nature. But we have an array, which means that we can make every possible combinations using the first item in the array first, then start again by removing the first item, and we should never repeat the combinations.

Here's an example: let's take 3 elements out of an array of 4

John Paul Ringo George

----------------------

John Paul Ringo
John Paul       George
John      Ringo George
Paul Ringo George

It definitely looks like loops: We do all of John's first, then when we're done we do Paul's, and then we're done because we don't have enough people to start with Ringo.

Another way of looking at it is "I've selected John, now start again with the rest of the list and pick 2", then "I've selected Paul, now start again with the rest of the list and pick 1", then "I've started with Ringo, now start again with the rest of the list and pick 1". When we're done with the lists starting with John, we remove him, start with Paul, and there's only one choice.

In swift, because of the extension mechanism, it's easy to generalize to every array, but we still need that recursion that needs both where we are and what's left to work. Then all we need to manage is the special edge cases:

• there is no element in the array (because it's easy to deal with)
• there is less elements in the array than we want (ditto)
• there is exactly as many elements in the array as we want (well duh, there's only one possibility)

So here is the algorithm, with accumulators passed with inout (to modify them in the callee and the caller):

extension Array { // combinatory
fileprivate func recArrangements(len: Int, start: Int, cur: inout [Element], result : inout [[Element]]) {
if len == 0 {
result.append([Element](cur))
} else {
var i = start
while i <= (self.count-len) {
cur[cur.count - len] = self[i]
recArrangements(len: len-1, start: i+1, cur: &cur, result: &result)
i += 1
}
}
}

func arrangements(of number: Int) -> [[Element]]? {
if self.count == 0 { return nil }
else if number > self.count { return nil }
else if number == self.count { return [self] }
var buffer = [Element](repeating: self[0], count: number)
var result = [[Element]]()
recArrangements(len: number, start: 0, cur: &buffer, result: &result)
return result
}
}

Proofs that it works:

> ["John", "Paul", "Ringo", "George"].arrangements(of: 3)
$R0: [[String]]? = 4 values { [0] = 3 values { [0] = "John" [1] = "Paul" [2] = "Ringo" } [1] = 3 values { [0] = "John" [1] = "Paul" [2] = "George" } [2] = 3 values { [0] = "John" [1] = "Ringo" [2] = "George" } [3] = 3 values { [0] = "Paul" [1] = "Ringo" [2] = "George" } } ["Potassium", "Calcium", "Scandium", "Titanium", "Vanadium", "Chromium", "Manganese", "Iron", "Cobalt", "Nickel", "Copper", "Zinc", "Gallium", "Germanium", "Arsenic", "Selenium", "Bromine", "Krypton"].arrangements(of: 3).count$R1: Int? = 43758

Which fits $$\frac{18!}{8!.10!}=43758$$

Unfortunately, as with most recursive algorithms, its complexity is fairly horrendous... It is equivalent to 3 nested for loops (the way to write that code is left as an exercise), which means a whopping n³... Then again, combinatory has a way of being exponential anyways. I wonder if there is a way to be more efficient.

From this list, the gist is that most languages can't process 9999999999999999.0 - 9999999999999998.0

Why  do they output 2 when it should be 1? I bet most people who've never  done any formal CS (a.k.a maths and information theory) are super  surprised.

Before you read the rest, ask yourself this: if all you have are zeroes and ones, how do you handle infinity?

If  we fire up an interpreter that outputs the value when it's typed (like  the Swift REPL), we have the beginning of an explanation:

Welcome to Apple Swift version 4.2.1 (swiftlang-1000.11.42 clang-1000.11.45.1). Type :help for assistance.
1> 9999999999999999.0 - 9999999999999998.0
$R0: Double = 2 2> let a = 9999999999999999.0 a: Double = 10000000000000000 3> let b = 9999999999999998.0 b: Double = 9999999999999998 4> a-b$R1: Double = 2

Whew, it's not that the languages can't handle a simple substraction, it's just that a is typed as 9999999999999999 but stored as 10000000000000000.

If we used integers, we'd have:

  5> 9999999999999999 - 9999999999999998
\$R2: Int = 1

Are the decimal numbers broken? 😱

##### A detour through number representations

Let's  look at a byte. This is the fundamental unit of data in a computer and  is made of 8 bits, all of which can be 0 or 1. It ranges from 00000000 to 11111111 ( 0x00 to 0xff in hexadecimal, 0 to 255 in decimal, homework as to why and how it works like that due by monday).

Put like that, I hope it's obvious that the question "yes, but how do I represent the integer 999 on a byte?" is meaningless. You can decide that 00000000 means 990 and count up from there, or you can associate arbitrary values to the 256 possible combinations and make 999 be one of them, but you can't have both the 0 - 255 range and 999. You have a finite number of possible values and that's it.

Of  course, that's on 8 bits (hence the 256 color palette on old games). On  16, 32, 64 or bigger width memory blocks, you can store up to 2ⁿ different values, and that's it.

##### The problem with decimals

While  it's relatively easy to grasp the concept of infinity by looking at  "how high can I count?", it's less intuitive to notice that there is the same amount of numbers between 0 and 1 as there are integers.

So,  if we have a finite number of possible values, how do we decide which  ones make the cut when talking decimal parts? The smallest? The most  common? Again, as a stupid example, on 8 bits:

• maybe we need 0.01 ... 0.99 because we're doing accounting stuff
• maybe we need 0.015, 0.025,..., 0.995 for rounding reasons
• We'll just encode the numeric part on 8 bits ( 0 - 255 ), and the decimal part as above

But that's already  99+99 values taken up. That leaves us 57 possible values for the rest of infinity. And that's not even mentionning the totally arbitrary nature of the  selection. This way of representing numbers is historically the first  one and is called "fixed" representation. There are many ways of  choosing how the decimal part behaves and a lot of headache when coding  how the simple operations work, not to mention the complex ones like  square roots and powers and logs.

##### Floats (IEEE 754)

To  make it simple for chips that perform the actual calculations, floating  point numbers (that's their name) have been defined using two  parameters:

• an integer n
• a power (of base b) p

Such that we can have n x bᵖ, for instance 15.3865 is 153863 x 10^(-4). The question is, how many bits can we use for the n and how many for the p.

The standard is to use 1 bit for the sign (+ or -), 23 bits for n, 8 for p, which use 32 bits total (we like powers of two), and using base 2, and n is actually 1.n.  That gives us a range of ~8 million values, and powers of 2 from -126  to +127 due to some special cases like infinity and NotANumber (NaN).

$$(-1~or~1)(2^{[-126...127]})(1.[one~of~the~8~million~values])$$

In theory, we have numbers from -10⁴⁵ to 1038 roughly, but some numbers can't be represented in that form. For  instance, if we look at the largest number smaller than 1, it's 0.9999999404. Anything between that and 1 has to be rounded. Again, infinity can't be represented by a finite number of bits.

##### Doubles

The  floats allow for "easy" calculus (by the computer at least) and are  "good enough" with a precision of 7.2 decimal places on average. So when  we needed more precision, someone said "hey, let's use 64 bits instead  of 32!". The only thing that changes is that n now uses 52 bits and p 11 bits.

Coincidentally, double has more a meaning of double size than double precision, even though the number of decimal places does jump to 15.9 on average.

We  still have 2³² more values to play with, and that does fill some  annoying gaps in the infinity, but not all. Famously (and annoyingly),  0.1 doesn't work in any precision size because of the base 2. In 32 bits  float, it's stored as 0.100000001490116119384765625, like this:

(1)(2⁻⁴)(1.600000023841858)

Conversely, after double size (aka doubles), we have quadruple size (aka quads), with 15 and 112 bits, for a total of 128 bits.

##### Back to our problem

Our value is 9999999999999999.0. The closest possible value encodable in double size floating point is actually 10000000000000000, which should now make some kind of sense. It is confirmed by Swift when separating the two sides of the calculus, too:

2> let a = 9999999999999999.0
a: Double = 10000000000000000

Our  big brain so good at maths knows that there is a difference between  these two values, and so does the computer. It's just that using  doubles, it can't store it. Using floats, a will be rounded to 10000000272564224 which isn't exactly better. Quads aren't used regularly yet, so no luck there.

It's  funny because this is an operation that we puny humans can do very  easily, even those humans who say they suck at maths, and yet those  touted computers with their billions of math operations per second can't  work it out. Fair enough.

The kicker is, there is a litteral infinity of examples such as this one, because trying to represent infinity in a finite number of digits is impossible.