#### Back in the olden days...

Before the (oh so annoying) chatbots, before conversational machine-learning, before all of that, there was... ELIZA.

It is a weird little part of computer history that nerds like me enjoy immensely, but that is fairly unknown from the public.

If I ask random people when they think chatting with a bot became a Thing, they tend to respond "the 90s" or later (usually roughly ten years after they were born, for weird psychological reasons).

But back in the 60s, the Turing Test was a big thing indeed. Of course, nowadays, we know that this test, as it was envisionned, isn't that difficult, but back then it was total fiction.

Enters Joseph Weizenbaum, working at the MIT in the mid 60s, who decided to simplify the problem of random conversation by using a jedi mind trick: the program would be a stern doctor, not trying to ingratiate itself to the user. We talk to that kind of terse and no nonsense people often enough that it could be reasonably assumed that it wouldn't faze a normal person.

It's not exactly amicable, but it was convincing enough at the time for people to project some personnality onto it. It became a real Frankenstein story: Weizenbaum was trying to show how stupid it was, and the concept behind man-machine conversations, but users kept talking to it, sometimes even confiding as they would to a doctor. And the more Weizenbaum tried to show that it was a useless piece of junk with the same amount of intelligence as your toaster, the more people became convinced this was going to revolutionize the psychiatry world.

Weizenbaum even felt compelled to write a book about the limitations of computing, and the capacity of the human brain to anthropomorphise the things it interacts with, as if to say that to most people, everything is partly human-like or has human-analogue intentions.

He is considered to be one of the fathers of artificial intelligence, despite his attempts at explaining to everyone that would listen that it was somewhat a contradiction in terms.

#### Design

ELIZA was written in SLIP, a language that worked as a subset or an extension or Fortran and later ALGOL, and was designed to facilitate the use of compounded lists (for instance (x1,x2,(y1,y2,y3),x3,x4)), which was something of a hard-ish thing to do back in the day.

By modern standards, the program itself is fairly simplistic:

• the user types an input
• the input is parsed for "keywords" that ELIZA knows about (eg I am, computer, I believe I, etc), which are ranked more or less arbitrarily
• depending on that "keyphrase", a variety of options are available like I don't understand that or Do computers frighten you?

Where ELIZA goes further than a standard decision tree, is that it has access to references. It tries to take parts of the input and mix them with its answer, for example: I am X -> Why are you X?

It does that through something that would become regular expression groups, and then transforming certain words or expressions into their respective counterparts.

For instance, something like I am like my father would be matched to ("I am ", "like my father"), then the response would be ("Why are you X?", "like my father"), then transformed to ("Why are you X?", "like your father"), then finally assembled into Why are you like your father?

Individually, both these steps are simple decompositions and substitutions. Using sed and regular expressions, we would use something like

$sed -n "s/I am $$.*$$/Why are you \1?/p" I am like my father Why are you like my father?$ echo "I am like my father" | sed -n "s/I am $$.*$$/Why are you \1?/p" | sed -n "s/my/your/p"
Why are you like your father?

Of course, ELIZA has a long list of my/your, me/you, ..., transformations, and multiple possibilities for each keyword, which, with a dash of randomness, allows the program to respond differently if you say the same thing twice.

But all in all, that's it. ELIZA is a very very simple program, from which emerges a complex behavior that a lot of people back then found spookily humanoid.

#### Taking a detour through (gasp) JS

One of the available "modern" implementations of ELIZA is in Javascript, as are most things. Now, those who know me figure out fairly quickly that I have very little love for that language. But having a distaste for it doesn't mean I don't need to write code in it every now and again, and I had heard so much about the bafflement people feel when using regular expressions in JS that I had to try myself. After all, two birds, one stone, etc... Learn a feature of JS I do not know, and resurrect an old friend.

As I said before, regular expressions (or regexs, or regexps) are relatively easy to understand, but a lot of people find them difficult to write. I'll just give you a couple of simple examples to get in the mood:

[A-Za-z]+;[A-Za-z]+

This will match any text that has 2 words (whatever the case of the letters) separated by a semicolon. Note the differenciating between uppercase and lowercase.
Basically, it says that I want to find a series of letters on length at least 1 (+) followed by ; followed by another series of letters of length at least 1

.*ish

Point (.) is a special character that means "any character", and * means "0 or more", so here I want to find anything ending in "ish"

Now, when you do search and replace (is is the case with ELIZA) or at least search and extract, you might want to know what is in this .* or [A-Za-z]+. To do that you use groups:

(.*)ish

This will match the same strings of letters, but by putting it in parenthesiseseseseseseseseses (parenthesiiiiiiiiiiiii? damn. anyway), you instruct the program to remember it. It is then stored in variables with the very imaginative names of \1, \2, etc...

So in the above case, if I apply that regexp to "easyish", \1 will contain "easy"

Now, because you have all these special characters like point and parenthesis and  whatnot, you need to differenciate when you need the actual "." and "any character". We escape those special characters with \.

([A-Za-z]+)\.([A-Za-z]+)

This will match any two words with upper and lower case letters joined by a dot (and not any character, as would be the case if I didn't use \), and remember them in \1 and \2

Of course, we have a lot of crazy special cases and special characters, so, yes, regexps can be really hard to build. For reference, the Internet found me a regexp that looks for email adresses:

(?:[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\$\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)$)

Yea... Moving on.

Now, let's talk about Javascript's implementation of regular expressions. Spoiler alert, it's weird if you have used regexps in any other language than perl. That's right, JS uses the perl semantics.

In most languages, regular expressions are represented by strings. It is a tradeoff that means you can manipulate it like a string (get its length, replace portions of it, have it built out of string variables etc), but it makes escaping nighmareish:

"^\\s*\\*\\s*(\\S)"

Because \ escapes the character that follows, you need to escape the escaper to keep it around: if you want \. as part of your regexp, more often than not, you need to type "\\." in your code. It's quite a drag, but the upside is that they work like any other string.

Now, in JS (and perl), regexps are a totally different type. They are not between quotes, but between slashes (eg /^(([^<>()\\.,;:\s@"]+(\.[^<>()\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/). On one hand, you don't have to escape the slashes anymore and they more closely resemble the actual regexp, but on the other hand, they are harder to compose or build programmatically. As I said, it's a different tradeoff, and to each their own. Where it gets bonkers is how you use them. Because the class system is... what it is, and because there is no operator overload, you can't really get the syntactic elegance of perl, so it's kind of a bastard system where you might type something like var myRe = /d(b+)d/; var isOK = "cdbbdbsbz".match(); // not null because "dbbd" is in the string match and matchAll aren't too bad, in the sense that they return the list of matching substrings (here, only one), or null, so it does have kind of a meaning. The problem arises when you need to use the dreaded exec function in order to use the regexp groups, or when you use the g flag in your regexp. The returned thing (I refuse to call it an object) is both an array and a hashmap/object at the same time. In result[0] you have the matched substring (here it would be "dbbd"), and in result[X] you have the \X equivalents (here \1 would be "bb", so that's what you find in result[1]). So far so not too bad. But this array also behaves like an object: result.index gives you the index of "the match" which is probably the first one. Not to mention you use string.match(regex) and regex.exec(string) const text = 'cdbbdbsbz'; const regex = /d(b+)d/g; const found = regex.exec(text); console.log(found); console.log(found.index); console.log(found["index"]); Array ["dbbd", "bb"] 1 1 So, the result is a nullable array that sometimes work as an object. I'll let that sink in for a bit. #### This is the end Once I got the equivalence down pat, it was just a matter of copying the data and rewriting a few functions, and ELIZA was back, as a libray, so that I could use it in CLI tools, iOS apps, or MacOS apps. When I'm done fixing the edge cases and tinkering with the ranking system, I might even publish it. In the meantime, ELIZA and I are rekindling an old friendship on my phone! I know and have fun as often as I can with Florent Pillet, another member of the tribe of "dinosaurs" still kicking around. I really like one of his projects that contributed to his notoriety : NSLogger. Logging has always been a pain in the neck, and this tool provided us all with a way to get it done efficiently and properly. The first commit on the github repo is from 2010, and I have a strong suspicion it's been in production since before that in one form or another. Anyhoo, I like Florent, I like NSLogger, but I hate what Cocoapods (and to a lesser extent Carthage) do to my projects. It's too brittle and I strongly dislike things that mess around with the extremely complicated XML that is a pbxproj. They do however serve an admirable purpose: managing dependencies in a way that doesn't require me to use git submodules in every one of my projects. So, I rarely use NSLogger. SHAME! SHAME! <insert your own meme here> With the advent of (and subsequent needed updates to) Swift Package Manager, we now have an official way of managing and supporting dependencies, but it has its own quirks that appently make it hard to "SPM" older projects. Let's see what we can do about NSLogger. ##### Step 1 : The Project Structure SPM can't mix Obj-C code and Swift code. It's always been pretty hacky anyways, with the bridging headers and the weird steps hidden by the toolchain, so we need to make it explicit: • One target for the Objective-C code (imaginatively named NSLoggerLibObjC) • One target for the Swift code (NSLogger) that depends on NSLoggerLibObjC • One product that builds the Swift target One of the problems is that all that code is mixed in the folders, because Xcode doesn't care about file placement. SPM, on the other hand does. So, let's use and abuse the path and sources parameters of the target. The first one is to provide the root where we look for files to compile, and the second one lists the files to be compiled. • LoggerClient.m for NSLoggerLibObjC • NSLogger.swift for NSLogger Done. Right? Not quite. ##### Step 2 : Compilation Quirks The Obj-C lib requires ARC to be disabled. Easy to do in Xcode, a bit harder in SPM. We need to pass the -fno-objc-arc flag to the compiler. SPM doesn't make it easy or obvious to do that, for a variety of reasons, but I guess mostly because you shouldn't pass compiler flags at all in an ideal world. But (especially in 2020), looking at the world, ideal it ain't. We have to use the (not so aptly named) cSetting option of the target, and use the very scary CSetting.unsafeFlags parameter for that option. Why is it unsafe, you might ask? Weeeeeeeeell. It's companies' usual way of telling you "you're on your own with this one". I'm fine with that. Another compilation quirk is that Obj-C code relies (like its ancestor, C) on the use of header files to make your code usable as a dependency. Again, because Xcode and SPM treat the file structure very differently, just saying that every header should be included in the resulting library is a bad idea: the search is recursive and in this particular case, would result in having specific iOS or MacOS (yes, capitalized, because sod that change) test headers exposed as well. In the end, I had to make the difficult choice of doing something super ugly: • move the public headers in their own directory • use symlinks to their old place so's not to break the other parts of the project If anyone has a better option that's not heavily more disruptive to the organization of the project, I'm all ears. ##### Step 3 : Final Assembly So we have the Swift target that depends on the Obj-C one. Fine. But how do we use that dependency? "Easy" some will exclaim (a bit too rapidly) "you just import the lib in the swift file!" Yes, but then it breaks the other projects, which, again, we don't want to do. Minimal impact changes. Legacy. Friend. So we need a preprocessing macro, like, say, SPMBuild, which would indicate we're building with SPM rather than Xcode. Sadly, this doesn't exist, and given the rate of change of the toolchain, I don't want to rely too heavily on the badly documented Xcode proprocessor macros that would allow me to detect a build through the IDE. Thankfully, in the same vein as cSettings, we have a swiftSettings parameter to our target, wich supports SwiftSetting.define options. Great, so I'll define a macro, and test its existence in the swift file before importing the Obj-C part of the project. One last thing I stumbled upon and used despite its shady nature: there is an undocumented decorator for import named @_exported which seems extraneous here, but has some interesting properties: it kinda sorta exposes what you import as part of the current module, flattening the dependency graph. To be honest, I didn't know about it, it amused me, so I included it. ##### Wrap Up In order to make it work directly from the repo, rather than locally, I also had to provide a version number. I chose to go with the next patch number instead of aggrandizing myself with a minor or even a major version. Hopefully, these changes don't impact the current project at all, and allows me to use it in a way I like better (and is officially supported), and I hope Florent will not murder me for all of that. He might even decide to accept my pull request. We'll see. In the meantime, you can find all the changes above and a usable SPM package in my fork. ###### TL;DR: Grab it here : Github repo #### Unit testing is painful amirite? Writing good tests for your code very often means spending twice as much time coding them than on the things you test themselves. It is good practice though to verify as much as possible that the code you write is valid, especially if that code is going to be public or included in someone else's work. In my workflow I insist on the notion of ownership : The bottomline for me is this: if there are several people on a project, I want clearly defined ownership. It's not that I won't fix a bug in someone else's code, just that they own it and therefore have to have a reliable way of testing that my fix works. Tests solve part of that problem. My code, my tests. If you fix my code, run my tests, I'm fairly confident that you didn't wreck the whole thing. And that I won't have to spend a couple of hours figuring out what it is that you did. This a a very very very light constraint when you compare it to methodologies like TDD, but it's a required minimum for me. Plus, it's not that painful, except... #### Testing every case In my personal opinion, the tests that are hardest to do right are the ones that have a very large input range, with a few failure/continuity points. If, for instance, and completely randomly, of course, you had an application where the tilt of the phone changes the state of the app (locked/unlocked, depending on whether the phone is lying flat-ish on the table or not: • from -20º to 20º the app is locked • from 160º to 200º the app is locked • the rest of the time it's not locked • All of that modulo 360, of course So you have a function that takes the current pitch angle, and returns if we should lock or not: func pitchLock(_ angle: Double) -> Bool { // ... } Does it work? Does it work modulo 360? What would a unit test for that function even look like? A for loop? I have been looking for a way to do that kind of test for a while, which is why I published HoledRange (now Domains 😇) a while back, as part of my hacks. What I wanted is to write my tests kind of like this (invalid code on so many levels): for x in [-1000.0...1000.0].randomSelection { let unitCircleAngle = x%360.0 if unitCircleAngle >= 340 || unitCircle <= 20 { XCTAssert(pitchLock(x)) } else if unitCircleAngle >= 160 && unitCircle <= 200 { XCTAssert(pitchLock(x)) } else { XCTAssertFalse(pitchLock(x)) } } This way of testing, while vaguely valid, leaves so many things flaky: • how many elements in the random selection? • how can we make certain values untestable (because we address them somewhere else, for instance) • what a lot of boilerplate if I have multiple functions to test on the same range of values • I can't reuse the same value for multiple tests to check function chains #### Function builders I have been fascinated with @_functionBuilder every since it was announced. While I don't feel enthusiastic about SwiftUI (in french), that way to build elements out of blocks is something I have wanted for years. Making them is a harrowing experience the first time, but in the end it works! What I wanted to use as syntax is something like this: func myPlus(_ a: Int, _ b: Int) -> Int DomainTests<Int> { Domain(-10000...10000) 1000000 Test { (a: Int) in XCTAssert(myPlus(a, 1) == a+1, "Problem with value\(a)") XCTAssert(myPlus(1, a) == a+1, "Problem with value\(a)") } Test { (a: Int) in let random = Int.random(in: -10000...10000) XCTAssert(myPlus(a, random) == a+random, "Problem with value\(a)") XCTAssert(myPlus(random, a) == a+random, "Problem with value\(a)") } }.random() This particular DomainTests runs 1000000 times over $$D=[-10000;10000]$$ in a random fashion. Note the Test builder that takes a function with a parameter that will be in the domain, and the definition that allows to define both the test domain (mandatory) and the number of random iterations (optional). If you want to test every single value in a domain, the bounding needs to be Strideable, ie usable in a for-loop. DomainTests<Int> { Domain(-10000...10000) Test { (a: Int) in XCTAssert(myPlus(a, 1) == a+1, "Problem with value\(a)") XCTAssert(myPlus(1, a) == a+1, "Problem with value\(a)") } Test { (a: Int) in let random = Int.random(in: -10000...10000) XCTAssert(myPlus(a, random) == a+random, "Problem with value\(a)") XCTAssert(myPlus(random, a) == a+random, "Problem with value\(a)") } }.full() #### Conclusion A couple of hard working days plus a healthy dose of using that framework personally means this should be ready-ish for production. If you are a maths-oriented dev and shiver at the idea of untested domains, this is for you 😬 This is the last part of a 3-parts series. In part 1, I tried to make sense of how it works and what we are trying to achieve, and in part 2, we set up the training loop. #### Model Predictions We have a trained model. Now what? Remember, a model is a series of giant matrices that take an input like you trained it on, and spits out the list of probabilities associated with the outputs you trained it on. So all you have to do is feed it a new input and see what it tells you: let input = [1.0, 179.0, 115.0] let unlabeled : Tensor<Float> = Tensor<Float>(shape: [1, 3], scalars: input) let predictions = model(unlabeled) let logits = predictions[0] let classIdx = logits.argmax().scalar! // we take only the best guess print(classIdx) 17 Cool. Cool, cool. What? Models deal with numbers. I am the one who assigned numbers to words to train the model on, so I need a translation layer. That's why I kept my contents structure around: I need it for its vocabulary map. The real code: let w1 = "on" let w2 = "flocks" let w3 = "settlement" var indices = [w1, w2, w3].map { Float(contents.indexHelper[$0.hash] ?? 0)
}

var wordsToPredict = 50
var sentence = "\(w1) \(w2) \(w3)"

while wordsToPredict >= 0 {
let unlabeled : Tensor<Float> = Tensor<Float>(shape: [1, 3], scalars: indices)
let predictions = model(unlabeled)
for i in 0..<predictions.shape[0] {
let logits = predictions[i]
let classIdx = logits.argmax().scalar!
let word = contents.vocabulary[Int(classIdx)]
sentence += " \(word)"

indices.append(Float(classIdx))
indices.remove(at: 0)
wordsToPredict -= 1
}
}

print(sentence)
on flocks settlement or their enter the earth; their only hope in their arrows, which for want of it, with a thorn. and distinction of their nature, that in the same yoke are also chosen their chiefs or rulers, such as administer justice in their villages and by superstitious awe in times of old.

Notice how I remove the first input and add the one the model predicted at the end to keep the loop running.

Seeing that, it kind of makes you think about the suggestions game when you send text messages eh? 😁

#### Model Serialization

Training a model takes a long time. You don't want a multi-hour launch time on your program every time you want a prediction, and maybe you even want to keep updating the model every now and then. So we need a way to store it and load it.

Thankfully, tensors are just matrices, so it's easy to store an array of arrays of floats, we've been doing that forever. They are even Codable out of the box.

In my particular case, the model itself needs to remember a few things to be recreated:

• the number of inputs and hidden nodes, in order to recreate the Reshape and LSTMCell layers
• the internal probability matrices of both RNNs
• the weigths and biases correction matrices

Because they are codable, any regular swift encoder will work, but I know some of you will want to see the actual matrices, so I use JSON. It is not the most time or space efficient, it does not come with a way to validate it, and JSON is an all-around awful storage format, but it makes a few things easy.

extension TextModel { // serialization
struct TextModelParams : Codable {
var inputs : Int
var hidden : Int
var rnn1w : Tensor<Float>
var rnn1b : Tensor<Float>
var rnn2w : Tensor<Float>
var rnn2b : Tensor<Float>
var weights : Tensor<Float>
var biases : Tensor<Float>
}
func serializedParameters() throws -> Data {
return try JSONEncoder().encode(TextModelParams(
inputs: self.inputs,
hidden: self.hidden,
rnn1w: self.rnn1.cell.fusedWeight,
rnn1b: self.rnn1.cell.fusedBias,
rnn2w: self.rnn2.cell.fusedWeight,
rnn2b: self.rnn1.cell.fusedBias,
weights: self.weightsOut,
biases: self.biasesOut))
}

struct TextModelSerializationError : Error { }
init(_ serialized: Data) throws {
guard let params = try? JSONDecoder().decode(TextModelParams.self, from: serialized) else { throw TextModelSerializationError() }

inputs = params.inputs
hidden = params.hidden
reshape = Reshape<Float>([-1, inputs])

var lstm1 = LSTMCell<Float>(inputSize: 1, hiddenSize: hidden)
lstm1.fusedWeight = params.rnn1w
lstm1.fusedBias = params.rnn1b
var lstm2 = LSTMCell<Float>(inputSize: hidden, hiddenSize: hidden)
lstm2.fusedWeight = params.rnn2w
lstm2.fusedBias = params.rnn2b

rnn1 = RNN(lstm1)
rnn2 = RNN(lstm2)

weightsOut = params.weights
biasesOut = params.biases
correction = weightsOut+biasesOut
}
}

My resulting JSON file is around 70MB (25 when bzipped), so not too bad.

When you serialize your model, remember to serialize the vocabulary mappings as well! Otherwise, you will lose the word <-> int translation layer.

#### That's all , folks!

This was a quick and dirty intro to TensorFlow for some, Swift for others, and SwiftTensorflow for most.

It definitely is a highly specialized and quite brittle piece of software, but it's a good conversation piece next time you hear that ML is going to take over the world.

Feel free to drop me comments or questions or corrections on Twitter!

This is the second part of a series. If you haven't, you should read part 1...

#### Model Preparation

The text I trained the model on is available on the Gutenberg Project. Why this one? Why not?

It has a fairly varied vocabulary and a consistency of grammar and phrase structures that should trigger the model. One of the main problems of picking the wrong corpus is that it leads to cycles in the prediction with the most common words, e.g. "and the most and the most and the most and the" because it's the pattern that you see most in the text. Tacitus, at least, should not have such repetitive turns of phrase. And it's interesting in and of itself, even though it's a bit racist, or more accurately, elitist. 😂

One of the difficult decisions is choosing the type of network we will be trying to train. I tend to have fairly decent results with RNNs on that category of problems so that's what I'll use. The types and sizes of these matrices is wayyyyy beyond the scope of this piece, but RNNs tend to be decent generalists. Two RNN/LSTM layers of 512 hidden nodes will give me enough flexibility for the task and good accuracy.

What are those and how do they work? You can do a deep dive on LSTM and RNN on Wikipedia, but the short version is, they work well with sequences because the order of the input is in and of itself one of the features it deals with. Recommended for handwriting recognition, speech recognition, or pattern analysis.

Why two layers? The way you "nudge" parameters in the training phase means that you should have as many layers as you think there are orders of things in your dataset. In the case of text pattern recognition, you can say that what matters is the first order of recognition (say, purely statistical "if this word then this word") or you can add a second order where you try to identify words that tend to have similar roles in the structure (e.g. subject verb object) and take that into account as well. Higher orders than that, in this particular instance, have very little meaning unless you are dealing with, say, a multilingual analysis.

That's totally malarkey when you look at the actual equations, but it helps to see it that way. Remember that you deal with probabilities, and that the reasoning the machine will learn is completely alien to us. By incorporating orders in the model, you make a suggestion to the algorithm, but you can't guarantee that it will take that route. It makes me feel better, so I use it.

Speaking of layers, it is another one of these metaphors that help us get a handle of things, by organizing our code and the way the algorithm treats the data.

You have an input, it will go through a first layer of probabilities, then a second layer will take the output of the first one, and apply its probabilities, and then you have an output.

Let's look at the actual contents of these things:

• Input is a list of trigrams associated with a word ( (borrowing a warrant) -> from, (his father Laertes) -> added, etc
• The first layer has a single input (the trigram), and a function with 512 tweakable parameters to output the label
• The second layer is trickier: it takes the 512 parameters of the first layer, and has 512 tweakable parameters of its own, to deal with the "higher order" of the data

It sounds weird, but it works, trust me for now, you'll experiment later.

The very first step is "reshaping" the trigrams so that LSTM can deal with it. We basically turn the matrices around and chunk them so that they are fed to the model as single inputs, 3 of them, in this order. It is actually a layer of its own called Reshape.

And finally, we need to write that using this model requires these steps:

• reshape
• rnn1
• rnn2
• get something usable out of it

struct TextModel : Layer {
@noDerivative var inputs : Int
@noDerivative var hidden : Int
var reshape : Reshape<Float>

var rnn1 : RNN<LSTMCell<Float>>
var rnn2 : RNN<LSTMCell<Float>>

var weightsOut : Tensor<Float> {
didSet { correction = weightsOut+biasesOut }
}
var biasesOut : Tensor<Float> {
didSet { correction = weightsOut+biasesOut }
}
fileprivate var correction: Tensor<Float>

init(input: Int, hidden: Int, output: Int, weights: Tensor<Float>, biases: Tensor<Float>) {
inputs = input
self.hidden = hidden
reshape = Reshape<Float>([-1, input])

let lstm1 = LSTMCell<Float>(inputSize: 1, hiddenSize: hidden)
let lstm2 = LSTMCell<Float>(inputSize: hidden, hiddenSize: hidden)
rnn1 = RNN(lstm1)
rnn2 = RNN(lstm2)

weightsOut = weights
biasesOut = biases
correction = weights+biases
}

@differentiable
func runThrough(_ input: Tensor<Float>) -> Tensor<Float> {
let reshaped = reshape.callAsFunction(input).split(count: inputs, alongAxis: 1)
let step1 = rnn1.callAsFunction(reshaped).differentiableMap({ $0.cell }) let step2 = rnn2.callAsFunction(step1).differentiableMap({$0.cell })
let last = withoutDerivative(at:step2[0])
let red = step2.differentiableReduce(last, { (p,e) -> Tensor<Float> in return e })
return red
}

@differentiable
func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
let step2out = runThrough(input)
let step3 = matmul(step2out, correction)
return step3
}
}

The RNN/LTSM have been talked about, but what are these two functions?

callAsFunction is the only one needed. I have just decided to split the algorithm in two: the part where I "just" pass through layers, and the part where I format the output. Everything in runThrough could be written at the top of callAsFunction.

We follow the steps outlined previously, it all seems logical, even if the details aren't quite clear yet.

What is it with the @noDerivative and @differentiable annotations?

Because we are dealing with a structure (model, layer, etc...) that not only should but will be adjusted over time, it is a way to tell the system which parts are important to that adjustment:

• all properties except those maked as not derivative will be nudged potentially, so it makes sense to mark the number of inputs as immutable, and the rest as "nudgeable"
• all the functions that calculate something that will be used in the "nudging" need to have specific maths properties that make the change non-random. We need to know where we are and where we were going. We need a position, and a speed, we need a value and its derivative

Ugh, maths.

Yeah.

I am obviously oversimplifying everything to avoid scaring away everyone from the get go, but the idea should make sense if you look at it this way:

• Let's take a blind man trying to shoot an arrow at a target
• You ask them to shoot and then you'll correct them based on where the arrow lands
• It hits the far left of the target
• You tell them to nudge the aim to the right
• The problem is that "more right" isn't enough information... You need to tell them to the right a little (new position and some information useful for later, you'll see)
• The arrow lands slightly to the right of the center
• You tell the archer to aim to the left but less than their movement they just made to the right.

Two pieces of information: one relative to a direction, and one relative to the rate of change. The other name of the rate of change is the derivative.

Standard derivatives are speed to position (we are here, now we are there, and finally we are there, and the rate of change slowed, so the next position won't be as far from this one as the one was to the previous one), or acceleration to speed (when moving, if your speed goes up and up and up, you have a positive rate of change, you accelerate).

That's why passing through a layer should preserve the two: the actual values, and the speed at which we are changing them. Hence the @differentiable annotation.

(NB for all you specialists in the field reading that piece... yes I know. I'm trying to make things more palatable)

"But wait", say the most eagle-eyed among you, "I can see a withoutDerivative in that code!"

Yes. RNN is peculiar in the way that it doesn't try to coerce the dimensions of the results. It spits out all the possible variants it has calculated. But in practice, we need only the last one. Taking one possible outcome out of many cancels out the @differentiable nature of the function, because we actually lose some information.

This is why we only partially count on the RNN's hidden parameters to give us a "good enough" result, and need to incorporate extra weights and biases that are derivable.

The two parts of the correction matrix, will retain the nudge speed, as well as reshape the output matrix to match the labels: matrix addition and multiplications are a bit beyond the scope here as well (and quite frankly a bit boring), but that last step ( step3 in the code ) basically transform a 512x512x<number of labels> matrix, into a 2x<numbers of labels> : one column to give us the final probabilities, one for each possible label.

If you've made it this far, congratulations, you've been through the hardest.

#### Model Training

OK, we have the model we want to use to represent the various orders in the data, how do we train it?

To continue with the blind archer metaphor, we need the piece of code that acts as the "corrector". In ML, it's called the optimizer. We need to give it what the archer is trying to do, and a way to measure how far off the mark the archer is, and a sense of how stern it should be (do we do a lot of small corrections, or fewer large ones?)

The measure of the distance is called the "cost" function, or the "accuracy" function. Depending on how we look at it we want to make the cost (or error) as low as possible, and the accuracy as high as possible. They are obviously linked, but can be expressed in different units ("you are 3 centimeters off" and "you are closer by 1%"). Generally, loss has little to no meaning outside of the context of the layers ( is 6 far? close? because words aren't sorted in any meaningful way, we are 6.2 words away from the ideal word doesn't mean much), while accuracy is more like a satisfaction percentage (we are 93% satisfied with the result, whatever that means).

func accuracy(predictions: Tensor<Int32>, truths: Tensor<Int32>) -> Float {
return Tensor<Float>(predictions .== truths).mean().scalarized()
}

let predictions = model(aBunchOfFeatures)
print("Accuracy: \(accuracy(predictions: predictions.argmax(squeezingAxis: 1), truths: aBunchOfLabels))")

Accuracy: 0.10143079

and the loss:

let predictions = model(aBunchOfFeatures)
let loss = softmaxCrossEntropy(logits: predictions, labels: aBunchOfLabels)
print("Loss test: \(loss)")


Loss test: 6.8377414

In more human terms, the best prediction we have is 10% satisfying, because the result is 6.8 words away from the right one. 😬

Now that we know how to measure how far off the mark we are (in two different ways), we need to make a decision about 3 things:

• Which kind of optimizer we want to use (we'll use Adam, it's a good algorithm for our problem, but other ones exist. For our archer metaphor, it's a gentle but firm voice on the corrections, rather than a barking one that might progress rapidly at first then annoy the hell out of the archer)
• What learning rate we want to use (do we correct a lot of times in tiny increments, or in bigger increments that take overall less time, but might overcorrect)
• How many tries we give the system to get as close as possible

You can obviously see why the two last parameters are hugely important, and very hard to figure out. For some problems, it might be better to use big steps in case we find ourselves stuck, for others it might be better to always get closer to the target but by smaller and smaller increments. It's an art, honestly.

Here, I've used a learning rate of 0.001 (tiny) and a number of tries of 500 (medium), because if there is no way to figure out the structure of the text, I want to know it fast (fewer steps), but I do NOT want to overshoot(small learning rate).

Let's setup the model, the correction matrices, and the training loop:

var weigths = Tensor<Float>(randomNormal: [512, contents.vocabulary.count]) // random probabilities
var biases = Tensor<Float>(randomNormal: [contents.vocabulary.count]) // random bias
var model = TextModel(input:3, hidden: 512, output: contents.vocabulary.count, weights: weigths, biases: biases)

Now let's setup the training loop and run it:

let epochCount = 500
var trainAccuracyResults: [Float] = []
var trainLossResults: [Float] = []

var randomSampleSize = contents.original.count/15
var randomSampleCount = contents.original.count / randomSampleSize

print("Doing \(randomSampleCount) samples per epoch")
for epoch in 1...epochCount {
var epochLoss: Float = 0
var epochAccuracy: Float = 0
var batchCount: Int = 0

for training in contents.randomSample(splits: randomSampleCount) {
let (sampleFeatures,sampleLabels) = training
let logits = model(sampleFeatures)
return softmaxCrossEntropy(logits: logits, labels: sampleLabels)
}

let logits = model(sampleFeatures)
epochAccuracy += accuracy(predictions: logits.argmax(squeezingAxis: 1), truths: sampleLabels)
epochLoss += loss.scalarized()
batchCount += 1
}
epochAccuracy /= Float(batchCount)
epochLoss /= Float(batchCount)
trainAccuracyResults.append(epochAccuracy)
trainLossResults.append(epochLoss)
if epoch % 10 == 0 {
print("Epoch \(epoch): Loss: \(epochLoss), Accuracy: \(epochAccuracy)")
}
}

A little bit of explanation:

• We will try 500 times ( epochCount )
• At each epoch, I want to test and nudge for 15 different combinations of trigrams. Why? because it avoids the trap of optimizing for some specific turns of phrase
• We apply the model to the sample, calculate the loss, and the derivative, and update the model with where we calculate we should go next

What does that give us?

Doing 15 samples per epoch
Epoch 10: Loss: 6.8377414, Accuracy: 0.10143079
Epoch 20: Loss: 6.569199, Accuracy: 0.10564535
Epoch 30: Loss: 6.412607, Accuracy: 0.10802801
Epoch 40: Loss: 6.2550464, Accuracy: 0.10751916
Epoch 50: Loss: 6.0366735, Accuracy: 0.11123683
...
Epoch 490: Loss: 1.1177399, Accuracy: 0.73812264
Epoch 500: Loss: 0.5172857, Accuracy: 0.86724746

We like to keep these values in an array to graph them. What does it look like?

We can see that despite the dips and spikes, because we change the samples often and don't try any radical movement, we tend to better and better results. We don't get stuck in a ditch.

Next part, we'll see how to use the model. Here's a little spoiler: I asked it to generate some random text:

on flocks settlement or their enter the earth; their only hope in their arrows, which for want of it, with a thorn. and distinction of their nature, that in the same yoke are also chosen their chiefs or rulers, such as administer justice in their villages and by superstitious awe in times of old.

It's definitely gibberish when you look closely, but from a distance it looks kind of okayish for a program that learned to speak entirely from scratch, based on a 10k words essay written by fricking Tacitus.