swift - Jack Of Spades In Wonderland (Page 2)

11 Apr 2020 on swift, coding, CS, opinion, oss

[Dev Diary] Vanilla Is The Best Flavor

I have a weird thing with the multiplication of command-line tools and gizmos: I forget them.

Do I want to run supercool gitlab commands? Hell yea! Do I need to install 12 utilities (or code a new one) to archive every project older than a year? I hope not...

The setup

I am a sucker for well documented fully linted code. But the thing is, all the gizmos that help me do that have to be installed in the system or in my ~/bin and I have to remember to update them, and I have to install them on my CD machine, and on every new environment I setup, and make sure they are still compatible with the toolchain, and it freaks me out, ok?

Plus,watching the students try to do it is painful.

So, given a 100% vanilla swift-capable environment, can I manage to run documentation and linting?

The idea

We have Swift Package Manager, which is now a first-class citizen in XCode, but it can't run shell script phases without some nasty hacks.

What if some targets were (wait for it) built to do the documentation and the linting?

Linting

One of the most popular linters out there is swiftlint, and it supports SPM. It can also build a library instead of an executable, which means one of my targets could just run the linting and output it in the terminal.

In the Package.swift file, all I needed to do was add the right dependency, and the right product and voila!

let package = Package(
	name: "WonderfulPackage",
    products: [
    	// ...
         .executable(name: "Lint", targets: ["Lint"])
   	],
    dependencies: [
        // Dependencies declare other packages that this package depends on.
        // .package(url: /* package url */, from: "1.0.0"),
		// ... normal dependencies
        .package(url: "https://github.com/realm/SwiftLint", from: "0.39.0")
    ],
    targets: [
    	// ... normal targets
        .target(
            name: "Lint",
            dependencies: ["SwiftLintFramework"]),
	]
)

Package.swift

Now, SPM is very strict with paths, so I had to put a file named main.swift in the Sources/<target>/ directory, in this case Sources/Lint.

Running the linter is fairly straightforward, and goes in the main.swift file:

// Lint command main
// runs SourceDocs
import Foundation
import SwiftLintFramework

let config = Configuration(path: FileManager.default.currentDirectoryPath+"/.swiftlint.yml",
                           rootPath: FileManager.default.currentDirectoryPath,
                           optional: true,
                           quiet: true,
                           enableAllRules: false,
                           cachePath: nil,
                           customRulesIdentifiers: [])

for lintable in config.lintableFiles(inPath: FileManager.default.currentDirectoryPath, forceExclude: false) {
    let linter = Linter(file: lintable, configuration: config)
    let storage = RuleStorage()
    let collected = linter.collect(into: storage)
    let violations = collected.styleViolations(using: storage)
    if !violations.isEmpty {
        print(EmojiReporter.generateReport(violations))
    }
}

print("🎉 All done!")

Sources/Lint/main.swift

Setup the .swiftlint file as usual, and run the command via swift run Lint

Sources/WonderfulPackage/main.swift
⛔️ Line 15: Variable name should be between 3 and 40 characters long: 'f'
⚠️ Line 13: Arguments can be omitted when matching enums with associated types if they are not used.
⚠️ Line 12: Line should be 120 characters or less: currently 143 characters

Documentation

Documentation is actually trickier, because most documentation tools out there aren't built in swift, or compatible with SPM. Doxygen and jazzy are great, but they don't fit my needs.

I found a project that was extremely promising called SourceDocs by Eneko Alonso, but it isn't a library, so I had to fork it and make it into one (while providing a second target to generate the executable if needed). One weird issue is that SPM doesn't like subtargets to bear the same name so I had to rename a couple of them to avoid conflict with Swift Argument Parser (long story).

I finally found myself in the same spot than with the linter. All I needed to do was create another target, and Bob's you're uncle. Well actually he was mine. I digress.

let package = Package(
	name: "WonderfulPackage",
    products: [
    	// ...
         .executable(name: "Docs", targets: ["Docs"])
   	],
    dependencies: [
        // Dependencies declare other packages that this package depends on.
        // .package(url: /* package url */, from: "1.0.0"),
		// ... normal dependencies
        .package(url: "https://github.com/krugazor/SourceDocs", from: "0.7.0")
    ],
    targets: [
    	// ... normal targets
        .target(
            name: "Docs",
            dependencies: ["sourcedocslib"])
	]
)

Package.swift

Another well-placed main file:

// Docs command main
// runs SourceDocs
import Foundation
import SourceDocs

do {
    switch try SourceDocs().runOnSPM(moduleName: "WonderfulPackage",
                                     outputDirectory: FileManager.default.currentDirectoryPath+"/Documentation") {
    case .success:
        print("Successful run of the documentation phase")
    case .failure(let failure):
        print(failure.localizedDescription)
    }
} catch {
    print(error.localizedDescription)
}

Sources/Docs/main.swift

Now, the command swift run Docs generates the markdown documentation in the Documentation directory.

Parsing main.swift (1/1)
Removing reference documentation at 'WonderfulPackage/Documentation/KituraStarter'... ✔
Generating Markdown documentation...
  Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/structs/WonderfulPackage.md ✔
  Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/README.md ✔
Done 🎉
Successful run of the documentation phase

Conclusion

✅ Vanilla swift environment
✅ No install needed
✅ Works on Linux and MacOS
✅ Integrated into SPM
⚠️ When running in XCode, the current directory is always wonky for packages

30 Mar 2020 on swift, CS, coding

[Utilities] Time Tracking Structure

Every now and again (especially when training a model), I need to have a guesstimate as to how long a "step" takes, and how long the process will take, so I wrote myself a little piece of code that does that. Because I've had the question multiple times (and because I think everyone codes their own after a while), here's mine. Feel free to use it

/// Structure that keeps track of the time it takes to complete steps, to average or estimate the remaining time
public struct TimeRecord {
    /// The number of steps to keep for averaging. 5 is a decent default, increase or decrease as needed
    /// Minimum for average is 2, obvioulsy
    public var smoothing: Int = 5 {
        didSet {
            smoothing = max(smoothing, 2) // minimum 2 values
        }
    }
    /// dates for the steps
    private var dates : [Date] = []
    /// formatter for debug print and/or display
    private var formatter = DateComponentsFormatter()
    public var formatterStyle : DateComponentsFormatter.UnitsStyle {
        didSet {
            formatter.allowedUnits = [.hour, .minute, .second, .nanosecond]
            formatter.unitsStyle = formatterStyle
        }
    }
    
    public init(smoothing s: Int = 5, style fs: DateComponentsFormatter.UnitsStyle = .positional) {
        smoothing = max(s, 2)
        formatterStyle = fs
        formatter = DateComponentsFormatter()
        // not available everywhere
        // formatter.allowedUnits = [.hour, .minute, .second, .nanosecond]
        formatter.allowedUnits = [.hour, .minute, .second]
        formatter.zeroFormattingBehavior = .pad
        formatter.unitsStyle = fs
    }
    
    /// adds the record for a step
    /// - param d: the date of the step. If unspecified, current date is taken
    mutating func addRecord(_ d: Date? = nil) {
        if let d = d { dates.append(d) }
        else { dates.append(Date()) }
        while(dates.count > smoothing) { dates.remove(at: 0) }
    }
    
    /// gives the average delta between two steps (in seconds)
    var averageDelta : Double {
        if dates.count <= 1 { return 0.0 }
        var totalTime = 0.0
        for i in 1..<dates.count {
            totalTime += dates[i].timeIntervalSince(dates[i-1])
        }
        
        return totalTime/Double(dates.count)
    }
    
    /// gives the average delta between two steps in human readable form
    /// - see formatterStyle for options, default is "02:46:40"
    var averageDeltaHumanReadable : String {
        let delta = averageDelta
        return formatter.string(from: delta) ?? ""
    }
    
    /// given a number of remaining steps, gives an estimate of the time left on the process (in s)
    func estimatedTimeRemaining(_ steps: Int) -> Double {
        return Double(steps) * averageDelta
    }
    
    /// given a number of remaining steps, gives an estimate of the time left on the process in human readable form
    /// - see formatterStyle for options, default is "02:46:40"
    func estimatedTimeRemainingHumanReadable(_ steps: Int) -> String {
        let delta = estimatedTimeRemaining(steps)
         return formatter.string(from: delta) ?? ""
    }
}

When I train a model, I tend to use it that way:

// prepare model
var tt = TimeRecord()
tt.addRecord()

while currentEpoch < maxEpochs {
  // train the model
  tt.addRecord()
  if currentEpoch > 0 && currentEpoch % 5 == 0 {
  	print(tt.averageDeltaHumanReadable + " per epoch, " 
    	+ tt.(estimatedTimeRemainingHumanReadable(maxEpochs - currentEpoch) + " remaining"
    )
  }
}

10 Feb 2020 on swift, algorithms, holedrange, oss

[HoledRange] Sequences and Transformations

HoledRange has reached v 0.1.3 with two additions:

Iterating over ranges (contributed by Hugo a.k.a. Zyigh)
Transforming ranges into other ranges

Transformations

Simple idea, full of tricks: how do I apply a function f to transform a domain?

It has to result to a possible bound value for the domain (ie Hashable & Comparable)
We need to check the ordering of the new values in case the transformation flips the sign (eg [0;1] -> [-1;0])
We need to optimize the storage as we go in case the transformation collapses holes

24 Jan 2020 on machine learning, CS, swift, programming

[ML] Swift TensorFlow (Part 3)

This is the last part of a 3-parts series. In part 1, I tried to make sense of how it works and what we are trying to achieve, and in part 2, we set up the training loop.

Model Predictions

We have a trained model. Now what?

Remember, a model is a series of giant matrices that take an input like you trained it on, and spits out the list of probabilities associated with the outputs you trained it on. So all you have to do is feed it a new input and see what it tells you:

let input = [1.0, 179.0, 115.0]
let unlabeled : Tensor<Float> = Tensor<Float>(shape: [1, 3], scalars: input)
let predictions = model(unlabeled)
let logits = predictions[0]
let classIdx = logits.argmax().scalar! // we take only the best guess
print(classIdx)

Cool.

Cool, cool.

What?

Models deal with numbers. I am the one who assigned numbers to words to train the model on, so I need a translation layer. That's why I kept my contents structure around: I need it for its vocabulary map.

The real code:

let w1 = "on"
let w2 = "flocks"
let w3 = "settlement"

var indices = [w1, w2, w3].map {
    Float(contents.indexHelper[$0.hash] ?? 0)
}

var wordsToPredict = 50
var sentence = "\(w1) \(w2) \(w3)"

while wordsToPredict >= 0 {
    let unlabeled : Tensor<Float> = Tensor<Float>(shape: [1, 3], scalars: indices)
    let predictions = model(unlabeled)
    for i in 0..<predictions.shape[0] {
        let logits = predictions[i]
        let classIdx = logits.argmax().scalar!
        let word = contents.vocabulary[Int(classIdx)]
        sentence += " \(word)"
        
        indices.append(Float(classIdx))
        indices.remove(at: 0)
        wordsToPredict -= 1
    }
}

print(sentence)

on flocks settlement or their enter the earth; their only hope in their arrows, which for want of it, with a thorn. and distinction of their nature, that in the same yoke are also chosen their chiefs or rulers, such as administer justice in their villages and by superstitious awe in times of old.

Notice how I remove the first input and add the one the model predicted at the end to keep the loop running.

Seeing that, it kind of makes you think about the suggestions game when you send text messages eh? 😁

Model Serialization

Training a model takes a long time. You don't want a multi-hour launch time on your program every time you want a prediction, and maybe you even want to keep updating the model every now and then. So we need a way to store it and load it.

Thankfully, tensors are just matrices, so it's easy to store an array of arrays of floats, we've been doing that forever. They are even Codable out of the box.

In my particular case, the model itself needs to remember a few things to be recreated:

the number of inputs and hidden nodes, in order to recreate the Reshape and LSTMCell layers
the internal probability matrices of both RNNs
the weigths and biases correction matrices

Because they are codable, any regular swift encoder will work, but I know some of you will want to see the actual matrices, so I use JSON. It is not the most time or space efficient, it does not come with a way to validate it, and JSON is an all-around awful storage format, but it makes a few things easy.

extension TextModel { // serialization
    struct TextModelParams : Codable {
        var inputs : Int
        var hidden : Int
        var rnn1w : Tensor<Float>
        var rnn1b : Tensor<Float>
        var rnn2w : Tensor<Float>
        var rnn2b : Tensor<Float>
        var weights : Tensor<Float>
        var biases : Tensor<Float>
    }
    func serializedParameters() throws -> Data {
        return try JSONEncoder().encode(TextModelParams(
        inputs: self.inputs,
        hidden: self.hidden,
        rnn1w: self.rnn1.cell.fusedWeight,
        rnn1b: self.rnn1.cell.fusedBias,
        rnn2w: self.rnn2.cell.fusedWeight,
        rnn2b: self.rnn1.cell.fusedBias,
        weights: self.weightsOut,
        biases: self.biasesOut))
    }
    
    struct TextModelSerializationError : Error { }
    init(_ serialized: Data) throws {
        guard let params = try? JSONDecoder().decode(TextModelParams.self, from: serialized) else { throw TextModelSerializationError() }
        
        inputs = params.inputs
        hidden = params.hidden
        reshape = Reshape<Float>([-1, inputs])
        
        var lstm1 = LSTMCell<Float>(inputSize: 1, hiddenSize: hidden)
        lstm1.fusedWeight = params.rnn1w
        lstm1.fusedBias = params.rnn1b
        var lstm2 = LSTMCell<Float>(inputSize: hidden, hiddenSize: hidden)
        lstm2.fusedWeight = params.rnn2w
        lstm2.fusedBias = params.rnn2b
        
        rnn1 = RNN(lstm1)
        rnn2 = RNN(lstm2)
        
        weightsOut = params.weights
        biasesOut = params.biases
        correction = weightsOut+biasesOut
   }
}

My resulting JSON file is around 70MB (25 when bzipped), so not too bad.

When you serialize your model, remember to serialize the vocabulary mappings as well! Otherwise, you will lose the word <-> int translation layer.

That's all , folks!

This was a quick and dirty intro to TensorFlow for some, Swift for others, and SwiftTensorflow for most.

It definitely is a highly specialized and quite brittle piece of software, but it's a good conversation piece next time you hear that ML is going to take over the world.

Feel free to drop me comments or questions or corrections on Twitter!

23 Jan 2020 on machine learning, CS, swift, programming

[ML] Swift TensorFlow (Part 2)

This is the second part of a series. If you haven't, you should read part 1...

Model Preparation

The text I trained the model on is available on the Gutenberg Project. Why this one? Why not?

It has a fairly varied vocabulary and a consistency of grammar and phrase structures that should trigger the model. One of the main problems of picking the wrong corpus is that it leads to cycles in the prediction with the most common words, e.g. "and the most and the most and the most and the" because it's the pattern that you see most in the text. Tacitus, at least, should not have such repetitive turns of phrase. And it's interesting in and of itself, even though it's a bit racist, or more accurately, elitist. 😂

One of the difficult decisions is choosing the type of network we will be trying to train. I tend to have fairly decent results with RNNs on that category of problems so that's what I'll use. The types and sizes of these matrices is wayyyyy beyond the scope of this piece, but RNNs tend to be decent generalists. Two RNN/LSTM layers of 512 hidden nodes will give me enough flexibility for the task and good accuracy.

What are those and how do they work? You can do a deep dive on LSTM and RNN on Wikipedia, but the short version is, they work well with sequences because the order of the input is in and of itself one of the features it deals with. Recommended for handwriting recognition, speech recognition, or pattern analysis.

Why two layers? The way you "nudge" parameters in the training phase means that you should have as many layers as you think there are orders of things in your dataset. In the case of text pattern recognition, you can say that what matters is the first order of recognition (say, purely statistical "if this word then this word") or you can add a second order where you try to identify words that tend to have similar roles in the structure (e.g. subject verb object) and take that into account as well. Higher orders than that, in this particular instance, have very little meaning unless you are dealing with, say, a multilingual analysis.

That's totally malarkey when you look at the actual equations, but it helps to see it that way. Remember that you deal with probabilities, and that the reasoning the machine will learn is completely alien to us. By incorporating orders in the model, you make a suggestion to the algorithm, but you can't guarantee that it will take that route. It makes me feel better, so I use it.

Speaking of layers, it is another one of these metaphors that help us get a handle of things, by organizing our code and the way the algorithm treats the data.

You have an input, it will go through a first layer of probabilities, then a second layer will take the output of the first one, and apply its probabilities, and then you have an output.

Let's look at the actual contents of these things:

Input is a list of trigrams associated with a word ( (borrowing a warrant) -> from, (his father Laertes) -> added, etc
The first layer has a single input (the trigram), and a function with 512 tweakable parameters to output the label
The second layer is trickier: it takes the 512 parameters of the first layer, and has 512 tweakable parameters of its own, to deal with the "higher order" of the data

It sounds weird, but it works, trust me for now, you'll experiment later.

The very first step is "reshaping" the trigrams so that LSTM can deal with it. We basically turn the matrices around and chunk them so that they are fed to the model as single inputs, 3 of them, in this order. It is actually a layer of its own called Reshape.

And finally, we need to write that using this model requires these steps:

reshape
rnn1
rnn2
get something usable out of it

The code, then the comments:

struct TextModel : Layer {
    @noDerivative var inputs : Int
    @noDerivative var hidden : Int
    var reshape : Reshape<Float>
    
    var rnn1 : RNN<LSTMCell<Float>>
    var rnn2 : RNN<LSTMCell<Float>>
    
    var weightsOut : Tensor<Float> {
        didSet { correction = weightsOut+biasesOut }
    }
    var biasesOut : Tensor<Float> {
        didSet { correction = weightsOut+biasesOut }
    }
    fileprivate var correction: Tensor<Float>
    
    init(input: Int, hidden: Int, output: Int, weights: Tensor<Float>, biases: Tensor<Float>) {
        inputs = input
        self.hidden = hidden
        reshape = Reshape<Float>([-1, input])
        
        let lstm1 = LSTMCell<Float>(inputSize: 1, hiddenSize: hidden)
        let lstm2 = LSTMCell<Float>(inputSize: hidden, hiddenSize: hidden)
        rnn1 = RNN(lstm1)
        rnn2 = RNN(lstm2)
        
        weightsOut = weights
        biasesOut = biases
        correction = weights+biases
    }
    
    @differentiable
    func runThrough(_ input: Tensor<Float>) -> Tensor<Float> {
        let reshaped = reshape.callAsFunction(input).split(count: inputs, alongAxis: 1)
        let step1 = rnn1.callAsFunction(reshaped).differentiableMap({ $0.cell })
        let step2 = rnn2.callAsFunction(step1).differentiableMap({ $0.cell })
        let last = withoutDerivative(at:step2[0])
        let red = step2.differentiableReduce(last, { (p,e) -> Tensor<Float> in return e })
        return red
    }
    
    @differentiable
    func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
        let step2out = runThrough(input)
        let step3 = matmul(step2out, correction)
        return step3
    }
}

The RNN/LTSM have been talked about, but what are these two functions?

callAsFunction is the only one needed. I have just decided to split the algorithm in two: the part where I "just" pass through layers, and the part where I format the output. Everything in runThrough could be written at the top of callAsFunction.

We follow the steps outlined previously, it all seems logical, even if the details aren't quite clear yet.

What is it with the @noDerivative and @differentiable annotations?

Because we are dealing with a structure (model, layer, etc...) that not only should but will be adjusted over time, it is a way to tell the system which parts are important to that adjustment:

all properties except those maked as not derivative will be nudged potentially, so it makes sense to mark the number of inputs as immutable, and the rest as "nudgeable"
all the functions that calculate something that will be used in the "nudging" need to have specific maths properties that make the change non-random. We need to know where we are and where we were going. We need a position, and a speed, we need a value and its derivative

Ugh, maths.

Yeah.

I am obviously oversimplifying everything to avoid scaring away everyone from the get go, but the idea should make sense if you look at it this way:

Let's take a blind man trying to shoot an arrow at a target
You ask them to shoot and then you'll correct them based on where the arrow lands
It hits the far left of the target
You tell them to nudge the aim to the right
The problem is that "more right" isn't enough information... You need to tell them to the right a little (new position and some information useful for later, you'll see)
The arrow lands slightly to the right of the center
You tell the archer to aim to the left but less than their movement they just made to the right.

Two pieces of information: one relative to a direction, and one relative to the rate of change. The other name of the rate of change is the derivative.

Standard derivatives are speed to position (we are here, now we are there, and finally we are there, and the rate of change slowed, so the next position won't be as far from this one as the one was to the previous one), or acceleration to speed (when moving, if your speed goes up and up and up, you have a positive rate of change, you accelerate).

That's why passing through a layer should preserve the two: the actual values, and the speed at which we are changing them. Hence the @differentiable annotation.

(NB for all you specialists in the field reading that piece... yes I know. I'm trying to make things more palatable)

"But wait", say the most eagle-eyed among you, "I can see a withoutDerivative in that code!"

Yes. RNN is peculiar in the way that it doesn't try to coerce the dimensions of the results. It spits out all the possible variants it has calculated. But in practice, we need only the last one. Taking one possible outcome out of many cancels out the @differentiable nature of the function, because we actually lose some information.

This is why we only partially count on the RNN's hidden parameters to give us a "good enough" result, and need to incorporate extra weights and biases that are derivable.

The two parts of the correction matrix, will retain the nudge speed, as well as reshape the output matrix to match the labels: matrix addition and multiplications are a bit beyond the scope here as well (and quite frankly a bit boring), but that last step ( step3 in the code ) basically transform a 512x512x<number of labels> matrix, into a 2x<numbers of labels> : one column to give us the final probabilities, one for each possible label.

If you've made it this far, congratulations, you've been through the hardest.

Model Training

OK, we have the model we want to use to represent the various orders in the data, how do we train it?

To continue with the blind archer metaphor, we need the piece of code that acts as the "corrector". In ML, it's called the optimizer. We need to give it what the archer is trying to do, and a way to measure how far off the mark the archer is, and a sense of how stern it should be (do we do a lot of small corrections, or fewer large ones?)

The measure of the distance is called the "cost" function, or the "accuracy" function. Depending on how we look at it we want to make the cost (or error) as low as possible, and the accuracy as high as possible. They are obviously linked, but can be expressed in different units ("you are 3 centimeters off" and "you are closer by 1%"). Generally, loss has little to no meaning outside of the context of the layers ( is 6 far? close? because words aren't sorted in any meaningful way, we are 6.2 words away from the ideal word doesn't mean much), while accuracy is more like a satisfaction percentage (we are 93% satisfied with the result, whatever that means).

func accuracy(predictions: Tensor<Int32>, truths: Tensor<Int32>) -> Float {
    return Tensor<Float>(predictions .== truths).mean().scalarized()
}

let predictions = model(aBunchOfFeatures)
print("Accuracy: \(accuracy(predictions: predictions.argmax(squeezingAxis: 1), truths: aBunchOfLabels))")

Accuracy: 0.10143079

and the loss:

let predictions = model(aBunchOfFeatures)
let loss = softmaxCrossEntropy(logits: predictions, labels: aBunchOfLabels)
print("Loss test: \(loss)")

Loss test: 6.8377414

In more human terms, the best prediction we have is 10% satisfying, because the result is 6.8 words away from the right one. 😬

Now that we know how to measure how far off the mark we are (in two different ways), we need to make a decision about 3 things:

Which kind of optimizer we want to use (we'll use Adam, it's a good algorithm for our problem, but other ones exist. For our archer metaphor, it's a gentle but firm voice on the corrections, rather than a barking one that might progress rapidly at first then annoy the hell out of the archer)
What learning rate we want to use (do we correct a lot of times in tiny increments, or in bigger increments that take overall less time, but might overcorrect)
How many tries we give the system to get as close as possible

You can obviously see why the two last parameters are hugely important, and very hard to figure out. For some problems, it might be better to use big steps in case we find ourselves stuck, for others it might be better to always get closer to the target but by smaller and smaller increments. It's an art, honestly.

Here, I've used a learning rate of 0.001 (tiny) and a number of tries of 500 (medium), because if there is no way to figure out the structure of the text, I want to know it fast (fewer steps), but I do NOT want to overshoot(small learning rate).

Let's setup the model, the correction matrices, and the training loop:

var weigths = Tensor<Float>(randomNormal: [512, contents.vocabulary.count]) // random probabilities
var biases = Tensor<Float>(randomNormal: [contents.vocabulary.count]) // random bias
var model = TextModel(input:3, hidden: 512, output: contents.vocabulary.count, weights: weigths, biases: biases)

Now let's setup the training loop and run it:

let epochCount = 500
var trainAccuracyResults: [Float] = []
var trainLossResults: [Float] = []

var randomSampleSize = contents.original.count/15
var randomSampleCount = contents.original.count / randomSampleSize

print("Doing \(randomSampleCount) samples per epoch")
for epoch in 1...epochCount {
    var epochLoss: Float = 0
    var epochAccuracy: Float = 0
    var batchCount: Int = 0

    for training in contents.randomSample(splits: randomSampleCount) {
        let (sampleFeatures,sampleLabels) = training
        let (loss, grad) = model.valueWithGradient { (model: TextModel) -> Tensor<Float> in
            let logits = model(sampleFeatures)
            return softmaxCrossEntropy(logits: logits, labels: sampleLabels)
        }
        optimizer.update(&model, along: grad)
        
        let logits = model(sampleFeatures)
        epochAccuracy += accuracy(predictions: logits.argmax(squeezingAxis: 1), truths: sampleLabels)
        epochLoss += loss.scalarized()
        batchCount += 1
    }
    epochAccuracy /= Float(batchCount)
    epochLoss /= Float(batchCount)
    trainAccuracyResults.append(epochAccuracy)
    trainLossResults.append(epochLoss)
    if epoch % 10 == 0 {
       print("avg time per epoch: \(t.averageDeltaHumanReadable)")
       print("Epoch \(epoch): Loss: \(epochLoss), Accuracy: \(epochAccuracy)")
    }
}

A little bit of explanation:

We will try 500 times ( epochCount )
At each epoch, I want to test and nudge for 15 different combinations of trigrams. Why? because it avoids the trap of optimizing for some specific turns of phrase
We apply the model to the sample, calculate the loss, and the derivative, and update the model with where we calculate we should go next

What does that give us?

Doing 15 samples per epoch
Epoch 10: Loss: 6.8377414, Accuracy: 0.10143079
Epoch 20: Loss: 6.569199, Accuracy: 0.10564535
Epoch 30: Loss: 6.412607, Accuracy: 0.10802801
Epoch 40: Loss: 6.2550464, Accuracy: 0.10751916
Epoch 50: Loss: 6.0366735, Accuracy: 0.11123683
...
Epoch 490: Loss: 1.1177399, Accuracy: 0.73812264
Epoch 500: Loss: 0.5172857, Accuracy: 0.86724746

We like to keep these values in an array to graph them. What does it look like?

We can see that despite the dips and spikes, because we change the samples often and don't try any radical movement, we tend to better and better results. We don't get stuck in a ditch.

Next part, we'll see how to use the model. Here's a little spoiler: I asked it to generate some random text:

on flocks settlement or their enter the earth; their only hope in their arrows, which for want of it, with a thorn. and distinction of their nature, that in the same yoke are also chosen their chiefs or rulers, such as administer justice in their villages and by superstitious awe in times of old.

It's definitely gibberish when you look closely, but from a distance it looks kind of okayish for a program that learned to speak entirely from scratch, based on a 10k words essay written by fricking Tacitus.