First part of doing RNN text prediction with TensorfFlow, in Swift

#### Broad Strokes

For all intents and purposes, it's about statistics. The question we are trying to solve is either something along the lines of "given an input `X`

, what is the most probable `Y`

?", or along the lines of "given an input `X`

, what is the probability of having `Y`

?"

Of course, simple probability problems have somewhat simple solutions: if you take a game of chess and ask for a next move based on the current board, you can do all the possible moves and sort them based on the probability of having a piece taken off the board, for instance. If you are designing an autopilot of some kind, you have an "ideal" attitude (collection of yaw, pitch and roll angles), and you calculate the movements of the stick and pedals that will most likely get you closer to that objective. If your last shot went left of the target, chances are, you should go right. Etc etc etc.

But the most interesting problems don't have obvious causality. If you have pasta, tomatoes and ground meat in your shopping basket, maybe your next item will be onions, because you're making some kind of bolognese, maybe it will be soap, because that's what you need, maybe it will be milk, because that's the order of the shelves you are passing by.

Machine learning is about taking a whole bunch of hopefully consistent data (even if you don't know *for sure* that it's consistent), and use it to say "based on this past data, the probabilities for onions, soap and milk are `X`

, `Y`

, and `Z`

, and therefore *the most probable* is onions.

The data your are basing your predictive model on is really **really** important. Maybe your next item is consistent with the layout of the shop. Maybe it is consistent with what other customers got. Maybe it's consistent to *your* particular habits. Maybe it's consistent with people who are in the same "category" as you (your friends, your socio-economic peers, your cultural peers, ... pick one or more).

So you want a *lot* of data to have a good prediction, but not *all* the data, because noise (random data) does not reveal a bias (people in the shop tend to visit the shelves in that order) or doesn't exploit a bias (people following receipes want those ingredients together).

Oh yes, there are biases. Lots of them. Because ML uses past data to predict the future, if the data we use was based on bad practices, recommendations won't be a lot better.

There is a branch of machine learning that starts *ex nihilo* but it is beyond the scope of this introduction, and generates data based on a tournament rather than on actual facts. Its principle is roughly the same, though.

So, to recap:

- We start with a model with random probabilities, and a series of "truths" (
`X`

leads to`Y`

) - We try with a
`Z`

, see what the model predicts - We compare it to a truth, and fix the probabilities a little bit so that it matches
- Repeat with as many
`Z`

s as possible to be fairly confident the prediction isn't too far off

If you didn't before, now you know why ML takes a **lot** of resources. Depending on the number of possible `X`

s, `Y`

s and the number of truths, the probability matrix is potentially humongous. Operations (like fixing the probabilities to match reality) on such enormous structures aren't cheap.

If you want a more detailed oriented explanation, with maths and diagrams, you can read my other attempt at explaining how it works.

#### Swift TensorFlow

There are a few contenders in the field of "ML SDK", but one of the most known is TensorFlow, backed by Google. It also happens to have a Swift variant (almost every other ML environment out there is either Python or R).

And of course, the documentation is really... lacking, making this article more useful than average along the way.

In their "Why Swift?" motivation piece, the architects make a good case, if a little bit technical, as to why swift makes a good candidate for ML.

The two major takeaways you have to know going in are:

- It's a
*different*build of Swift. You cannot use the one that shipped with Xcode (yet) - It uses a lot of Python interoperability to work, so some ways of doing things will be a bit alien

The performance is rather good, comparable or better than the regular Python TensorFlow for the tasks I threw at it, so there's that.

But the documentation... My oh my.

Let's take an example: Tensor is, as the name of the framework implies, the central feature of the system. Its documentation is here: https://www.tensorflow.org/swift/api_docs/Structs/Tensor

Sometimes, that page greets me in Greek... But hey, why not. There is little to no way to navigate the hierarchy, other than going on the left side, opening the section (good luck if you don't already know if it's a class, a protocol or a struct you're looking for), and if you use the search field, it will return pages about... the Python equivalents of the functions you're looking for.

Clearly, this is early in the game, and you are assumed to know how regular TensorFlow works before attempting to do anything with STF.

But fret not! I will hold your hand so that you don't need to look at the doc too much.

The tutorials are well written, but don't go very far at all. Oh and if you use that triple `Dense`

layers on more than a toy problem (flower classification that is based on numbers), your RAM will fill so fast that your computer will have to be rebooted. More on that later.

And, because the center of ML is that "nudge" towards a better probability matrix (also called a Tensor), there is the whole `@differentiable`

thing. We will talk about it later.

A good thing is that Python examples (there are thousands of ML tutorials in Python) work almost out of the box, thanks to the interop.

#### Data Preparation

Which Learning will my Machine do?

I have always thought that text generation was such a funny toy example (if a little bit scary when you think about some of the applications): teach the machine to speak like Shakespeare, and watch it spit some play at you. It's also easy for us to evaluate in terms of what it does and how successful it is. And the model makes sense, which helps when writing a piece on how ML works.

A usual way of doing that is using *trigrams*. We all know than predicting the next word after a single word is super hard. And our brains tend to be able to predict the last word of a sentence with ease. So, a common way of teaching the machine is to have it look at 3 words to predict a 4th.

`I am hungry`

-> `because`

, `the birds flew`

-> `away`

, etc

Of course, for more accurate results, you can extend the number of words in the input, but it means you must have a lot more varied sentence examples.

What we need to do here is assign numbers to these words (because everything is numbers in computers) so that we have a problem like "guess the function f if f(231,444,12)->123, f(111,2,671)->222", which neural networks are pretty good at.

So we need data (a *corpus*), and we need to split it into `(trigram)->result`

Now, because we are ultimately dealing with probabilities and rounding, we need the input to be in `Float`

, so that the operations can wriggle the matrices by fractions, and we need the result to be an `Int`

, because we don't want something like "the result is the word between 'hungry' and 'dumb'".

The features (also called input) and the labels (also called outputs) have to be stored in two tensors (also called matrices), matching the data we want to train our model on.

That's where RAM and processing time enter the arena: the *size* of the matrix is going to be huge:

- Let's say the book I chose to teach it English has 11148 words in it (it's Tacitus'
*Germany*), that's`11148*3-2`

trigrams (33442 lines in my matrices, 4 columns total) - The way neural networks function, you basically have a function parameter per neuron that gets nudged at each iteration. In this example, I use two 512 parameters for somewhat decent results. That means 2 additional matrices of size
`33442*512`

. - And operations regularly duplicate these matrices, if only for a short period of time, so yea, that's a lot of RAM and processing power.

Here is the function that downloads a piece of text, and separates it into words:

```
func loadData(_ url: URL) -> [String] {
let sem = DispatchSemaphore(value: 0)
var result = [String]()
let session = URLSession(configuration: URLSessionConfiguration.default)
// let set = CharacterSet.punctuationCharacters.union(CharacterSet.whitespacesAndNewlines)
let set = CharacterSet.whitespacesAndNewlines
session.dataTask(with: url, completionHandler: { data, response, error in
if let data = data, let text = String(data: data, encoding: .utf8) {
let comps = text.components(separatedBy: set).compactMap { (w) -> String? in
// separate punctuation from the rest
if w.count == 0 { return nil }
else { return w }
}
result += comps
}
sem.signal()
}).resume()
sem.wait()
return result
}
```

Please note two things: I make it synchronous (I want to wait for the result), and I chose to include `word`

and `word,`

separately. You can keep only the words by switching the commented lines, but I find that the output is more interesting with punctuation than without.

Now, we need to setup the `word->int`

and `int->word`

transformations. Because we don't want to look at all the array of words every time we want to search for one, there is a dictionary based on the hashing of the words that will deal with the first, and because the most common words have better chances to pop up, the array for the vocabulary is sorted. It's not optimal, probably, but it helps makes things clear, and is fast enough.

```
func loadVocabulary(_ text: [String]) -> [String] {
var counts = [String:Int]()
for w in text {
let c = counts[w] ?? 0
counts[w] = c + 1
}
let count = counts.sorted(by: { (arg1, arg2) -> Bool in
let (_, value1) = arg1
let (_, value2) = arg2
return value1 > value2
})
return count.map { (arg0) -> String in
let (key, _) = arg0
return key
}
}
func makeHelper(_ vocabulary: [String]) -> [Int:Int] {
var result : [Int:Int] = [:]
vocabulary.enumerated().forEach { (arg0) in
let (offset, element) = arg0
result[element.hash] = offset
}
return result
}
```

Why not `hashValue`

instead of `hash`

? turns out, on Linux, which this baby is going to run on, the values are more stable with the latter rather than the former, according to my tests.

The data we will work on therefore is:

```
struct TextBatch {
let original: [String]
let vocabulary: [String]
let indexHelper: [Int:Int]
let features : Tensor<Float> // 3 words
let labels : Tensor<Int32> // followed by 1 word
}
```

We need a way to initialize that struct, and a couple of helper functions to extract some random samples to train our model on, and we're good to go:

```
extension TextBatch {
public init(from: [String]) {
let v = loadVocabulary(from)
let h = makeHelper(v)
var f : [[Float]] = []
var l : [Int32] = []
for i in 0..<(from.count-3) {
if let w1 = h[from[i].hash],
let w2 = h[from[i+1].hash],
let w3 = h[from[i+2].hash],
let w4 = h[from[i+3].hash] {
f.append([Float(w1), Float(w2), Float(w3)])
l.append(Int32(w4))
}
}
let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 })
let labelsT = Tensor<Int32>(l)
self.init(
original: from,
vocabulary: v,
indexHelper: h,
features: featuresT,
labels: labelsT
)
}
func randomSample(of size: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
var f : [[Float]] = []
var l : [Int32] = []
for i in 0..<(original.count-3) {
if let w1 = indexHelper[original[i].hash],
let w2 = indexHelper[original[i+1].hash],
let w3 = indexHelper[original[i+2].hash],
let w4 = indexHelper[original[i+3].hash] {
f.append([Float(w1), Float(w2), Float(w3)])
l.append(Int32(w4))
}
}
var rf : [[Float]] = []
var rl : [Int32] = []
if size >= l.count || size <= 0 {
let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 })
let labelsT = Tensor<Int32>(l)
return (featuresT, labelsT)
}
var alreadyPicked = Set<Int>()
while alreadyPicked.count < size {
let idx = Int.random(in: 0..<l.count)
if !alreadyPicked.contains(idx) {
rf.append(f[idx])
rl.append(l[idx])
alreadyPicked.update(with: idx)
}
}
let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 })
let labelsT = Tensor<Int32>(l)
return (featuresT, labelsT)
}
func randomSample(splits: Int) -> [(features: Tensor<Float>, labels: Tensor<Int32>)] {
var res = [(features: Tensor<Float>, labels: Tensor<Int32>)]()
var alreadyPicked = Set<Int>()
let size = Int(floor(Double(original.count)/Double(splits)))
var f : [[Float]] = []
var l : [Int32] = []
for i in 0..<(original.count-3) {
if let w1 = indexHelper[original[i].hash],
let w2 = indexHelper[original[i+1].hash],
let w3 = indexHelper[original[i+2].hash],
let w4 = indexHelper[original[i+3].hash] {
f.append([Float(w1), Float(w2), Float(w3)])
l.append(Int32(w4))
}
}
for part in 1...splits {
var rf : [[Float]] = []
var rl : [Int32] = []
if size >= l.count || size <= 0 {
let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 })
let labelsT = Tensor<Int32>(l)
return [(featuresT, labelsT)]
}
while alreadyPicked.count < size {
let idx = Int.random(in: 0..<l.count)
if !alreadyPicked.contains(idx) {
rf.append(f[idx])
rl.append(l[idx])
alreadyPicked.update(with: idx)
}
}
let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 })
let labelsT = Tensor<Int32>(l)
res.append((featuresT,labelsT))
}
return res
}
}
```

In the next part, we will see how to set the model up, and train it.