[ML] Swift TensorFlow (Part 4)

With 0.8 dropping, a few things in my previous posts changed, thankfully not much. And by trying to train bigger models, I ran into a huge RAM issue, so I'll share what I did in a few paragraphs

Changes for 0.8

valueWithGradient is now a global module function, and you have to call it through TensorFlow like this:

let (loss, grad) = TensorFlow.valueWithGradient(at: model) { (model: TextModel) -> Tensor<Float> in
                    let logits = model(sampleFeatures)
                    return softmaxCrossEntropy(logits: logits, labels: sampleLabels)
}

Also, they revamped the serialization mechanics, you can now get serializable data through

try model.serializedParameters()
RAM issues

It so happens that someone told me to try with characters trigrams instead of word trigrams. I have no idea if the results are better or worst yet, because the dataset generated is huge: 4*<number of chars>, and a somewhat simple text file gave way to a magnificent 96GB of RAM usage.

Of course, this means that the program can't really run. It also meant I had to find an alternative way, and the simplest I know of that I could implement quickly was storing all the trigrams in a big database, and extract random samples from it, rather than doing it in memory. This meants going from 96GB of RAM usage down to 4GB.

The setup

I do Kitura stuff, and I ♥️ PostgreSQL, so I went for a simple ORM+Kuery setup.

The table stores trigrams, and I went for generics for the stored structure:

struct StorableTrigram<FScalar, TScalar> : Codable where FScalar : TensorFlowScalar, FScalar : Codable, TScalar : TensorFlowScalar, TScalar : Codable {
    var random_id : Int64
    var t1 : FScalar
    var t2 : FScalar
    var t3 : FScalar
    var r : TScalar
}

extension StorableTrigram : Model {
    static var tableName: String {
        get {
            return ("StorableTrigram"+String(describing: FScalar.self)+String(describing: TScalar.self)).replacingOccurrences(of: " ", with: "_")
        }
    }
}

The random_id will be used to shuffle the lines into multiple partitions later, and the tableName override is to avoid < and > from the table name.

The partitionning

One of the key things needed to avoid saturating the RAM is to partition the data. As the rest of the training loop expects an array, I decided to go with a custom Collection that can fit in a for loop and load only the current partition:

struct RandomAccessPartition : Collection {
    let numberOfPartitions: Int
    let db : ConnectionPool
    
    typealias Index = Int
    var startIndex: Int { return 0 }
    var endIndex: Int { return numberOfPartitions-1 }
    
    func index(after i: Int) -> Int {
        return i+1
    }

    subscript(position: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
        let partitionSize = Int64.max / Int64(numberOfPartitions)
        let start_rid = partitionSize * Int64(position)
        let end_rid = partitionSize * Int64(position + 1)
        var rf : [[Float]] = []
        var rl : [Int32] = []

        let lsem = DispatchSemaphore(value: 0)
        db.getConnection() { conn, err in
             if conn == nil {
                 lsem.signal()
                 return
             }
             
             conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" WHERE random_id >= \(start_rid) AND random_id < \(end_rid)") { resultSet in
                 resultSet.asRows { rows,error in
                     guard let rows = rows else {
                         lsem.signal()
                         return
                     }
                     for row in rows {
                         if let t1 = row["t1"] as? Float,
                         let t2 = row["t1"] as? Float,
                         let t3 = row["t1"] as? Float,
                             let r = row["r"] as? Int32 {
                             rf.append([t1,t2,t3])
                             rl.append(r)
                         }
                     }
                     lsem.signal()
                 }
             }
         }

        
        lsem.wait()
         let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
         let labelsT = Tensor<Int32>(rl)
         return (featuresT, labelsT)
    }
}

Relying on random_id for the partitions is a bit iffy, but thankfully PostgreSQL can re-randomize those ids somewhat fast works well enough for my use

The TextBatch replacement

The three key features of that batch-holding struct was:

  • initialization
  • random sample (once)
  • random partitions (once every epoch)

So here's the relevant code, with breaks for explanations:

struct RandomAccessStringStorage {
    var db : ConnectionPool
    var tableCreated : Bool = false
    
    let original: [String]
    let vocabulary: [String]
    let indexHelper: [Int:Int]
    
    init(db database: ConnectionPool, original o: [String], terminator: String? = nil, fromScratch: Bool) {
        db = database
        Database.default = Database(database) // shady, but hey
        
        original = o
        let f : [[Float]]
        let l : [Int32]
        let v : [String]
        let h : [Int:Int]
        if let term = terminator {
            (f,l,v,h) = RandomAccessStringStorage.makeArrays(original, terminator: term)
        } else {
            (f,l,v,h) = RandomAccessStringStorage.makeArrays(original)
        }
        
        vocabulary = v
        indexHelper = h
        if fromScratch {
            deleteAll()
            for i in 0..<f.count {
                insertTrigram(t1: f[i][0], t2: f[i][1], t3: f[i][2], r: l[i])
            }
        } 
    }
    
        mutating func deleteAll() {
        let _ = try? StorableTrigram<Float,Int32>.dropTableSync()
        tableCreated = false
    }
    
    mutating func insertTrigram(t1: Float, t2: Float, t3: Float, r: Int32) {
        if !tableCreated {
            let _ = try? StorableTrigram<Float,Int32>.createTableSync()
            tableCreated = true
        }
        let trig = StorableTrigram(random_id: Int64.random(in: Int64(0)...Int64.max), t1: t1, t2: t2, t3: t3, r: r)
        let lsem = DispatchSemaphore(value: 0)
        trig.save { st, error in
            lsem.signal()
        }
        lsem.wait()
    }
// ...
}

The two makeArrays are copied and pasted from the in-memory TextBatch, and the only other thing the initialization relies on is the insertion in the DB system.

There are two ways of drawing random items: a one-off and partition the data into random chunks:

func randomSample(of size: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
    var rf : [[Float]] = []
    var rl : [Int32] = []

    let lsem = DispatchSemaphore(value: 0)
    db.getConnection() { conn, err in
        if conn == nil {
            lsem.signal()
            return
        }
        
        conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" ORDER BY random() LIMIT \(size)") { resultSet in
            resultSet.asRows { rows,error in
                guard let rows = rows else {
                    lsem.signal()
                    return
                }
                for row in rows {
                    if let t1 = row["t1"] as? Float,
                    let t2 = row["t1"] as? Float,
                    let t3 = row["t1"] as? Float,
                        let r = row["r"] as? Int32 {
                        rf.append([t1,t2,t3])
                        rl.append(r)
                    }
                }
                lsem.signal()
            }
        }
    }
    
    lsem.wait()
    let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
    let labelsT = Tensor<Int32>(rl)
    return (featuresT, labelsT)
}

Random selection in Pg actually works pretty well, but can't be repeated, which is why we have to rely on the random_id to partition:

func randomSample(splits: Int) -> RandomAccessPartition<Float,Int32> {
    // reshuffle (will take a while)
    // update "StorableTrigramFloatInt32" SET random_id = cast(9223372036854775807 * random() as bigint);
    let lsem = DispatchSemaphore(value: 0)
    db.getConnection() { conn, err in
        if conn == nil {
            lsem.signal()
            return
        }
        
        conn!.execute("UPDATE \"\(StorableTrigram<Float,Int32>.tableName)\" SET random_id = cast(9223372036854775807 * random() as bigint)") { resultSet in
            lsem.signal()
        }
    }
    lsem.wait()
    return RandomAccessPartition<Float,Int32>(numberOfPartitions: splits, db: self.db)
}

The update will re-randomize the ids, paving the way for the RandomAccessPartition.

Of course the tradeoff in terms of performance is rather big, especially in the initialization phase, but hey, more ram to do other things when the model is training!

[Utilities] Time Tracking Structure

Every now and again (especially when training a model), I need to have a guesstimate as to how long a "step" takes, and how long the process will take, so I wrote myself a little piece of code that does that. Because I've had the question multiple times (and because I think everyone codes their own after a while), here's mine. Feel free to use it

/// Structure that keeps track of the time it takes to complete steps, to average or estimate the remaining time
public struct TimeRecord {
    /// The number of steps to keep for averaging. 5 is a decent default, increase or decrease as needed
    /// Minimum for average is 2, obvioulsy
    public var smoothing: Int = 5 {
        didSet {
            smoothing = max(smoothing, 2) // minimum 2 values
        }
    }
    /// dates for the steps
    private var dates : [Date] = []
    /// formatter for debug print and/or display
    private var formatter = DateComponentsFormatter()
    public var formatterStyle : DateComponentsFormatter.UnitsStyle {
        didSet {
            formatter.allowedUnits = [.hour, .minute, .second, .nanosecond]
            formatter.unitsStyle = formatterStyle
        }
    }
    
    public init(smoothing s: Int = 5, style fs: DateComponentsFormatter.UnitsStyle = .positional) {
        smoothing = max(s, 2)
        formatterStyle = fs
        formatter = DateComponentsFormatter()
        // not available everywhere
        // formatter.allowedUnits = [.hour, .minute, .second, .nanosecond]
        formatter.allowedUnits = [.hour, .minute, .second]
        formatter.zeroFormattingBehavior = .pad
        formatter.unitsStyle = fs
    }
    
    /// adds the record for a step
    /// - param d: the date of the step. If unspecified, current date is taken
    mutating func addRecord(_ d: Date? = nil) {
        if let d = d { dates.append(d) }
        else { dates.append(Date()) }
        while(dates.count > smoothing) { dates.remove(at: 0) }
    }
    
    /// gives the average delta between two steps (in seconds)
    var averageDelta : Double {
        if dates.count <= 1 { return 0.0 }
        var totalTime = 0.0
        for i in 1..<dates.count {
            totalTime += dates[i].timeIntervalSince(dates[i-1])
        }
        
        return totalTime/Double(dates.count)
    }
    
    /// gives the average delta between two steps in human readable form
    /// - see formatterStyle for options, default is "02:46:40"
    var averageDeltaHumanReadable : String {
        let delta = averageDelta
        return formatter.string(from: delta) ?? ""
    }
    
    /// given a number of remaining steps, gives an estimate of the time left on the process (in s)
    func estimatedTimeRemaining(_ steps: Int) -> Double {
        return Double(steps) * averageDelta
    }
    
    /// given a number of remaining steps, gives an estimate of the time left on the process in human readable form
    /// - see formatterStyle for options, default is "02:46:40"
    func estimatedTimeRemainingHumanReadable(_ steps: Int) -> String {
        let delta = estimatedTimeRemaining(steps)
         return formatter.string(from: delta) ?? ""
    }
}

When I train a model, I tend to use it that way:

// prepare model
var tt = TimeRecord()
tt.addRecord()

while currentEpoch < maxEpochs {
  // train the model
  tt.addRecord()
  if currentEpoch > 0 && currentEpoch % 5 == 0 {
  	print(tt.averageDeltaHumanReadable + " per epoch, " 
    	+ tt.(estimatedTimeRemainingHumanReadable(maxEpochs - currentEpoch) + " remaining"
    )
  }
}

[Confinement] Week 2

As everyone settles down in the new mode of operations, the number of small tasks has increased and the number of big projects has decreased.

The plagiarism tool is in testing among some of the teachers at school, and the funny reaction of my team of developers asking for an API (to avoid going through the web front end that I crafted - probably badly - in React) made me smile.

What fascinates me overall is the inability of "the web" to cope with the sudden influx of having a ton more people working from home. "They" said the web would replace everything, that it was just a matter of scaling up.

Azure seems to be full, GCloud has some issues with the data traffic, AWS is holding but the status page keep showing outages...

Don't get me wrong, I've been working remote for close to 20 years, so I'm not saying office work is better. But I have been working on projects with people who said it didn't matter if the performance was poor, because they'd just order a bigger server or two.

That inability to take into account the physical constraints of our world is one of the things that grind my gears the most: I've been working on embedded software and high-performance backend stuff for a long time, and betting on poor code hygiene to be compensated by someone else is never a good bet. It ends up with re-writing the code again, and again, and again.

When it's not the RAM issues (lookin at you Electron), it's server constraints (oh the surprise when your instance autoscales up), or bandwidth issues (our government is thinking of restricting the use of Netflix and the like 🙄).

This situation will hopefully remove the attention from the people who can talk and present the best, and back onto more objective metrics (aka "does it work under load?")

I don't rent a small server because I'm cheap. I do it because I can't release any software that doesn't work "correctly" on the bare minimum of specs I have decided the users will have. Then again, when it explodes, it's a great opportunity to learn new things about optimization and constraints 🧐

Since the situation will last a while longer, I hope it reminds everyone that what we do isn't magic. It's science, and we can't wave the problems and constraints away.

Random Wednesday

I had absolutely no idea that /dev/random was so controversial

> That's all good and nice, but even the man page for /dev/(u)random contradicts you! Does anyonewho knows about this stuff actually agree with you?

No, it really doesn't. It seems to imply that /dev/urandom is insecure for cryptographic use, unless you really understand all that cryptographic jargon.

Sick burn

From Myths about /dev/urandom

[Confinement] Week 1

Busy busy week, although not really that hard for people who are used to freelance - and working from home.

Confinement is currently imposed by our government, and even though you hear here and there that it's an affront and whatever, make no mistake: we are testing our social bond by doing this. I am not at risk. Most of my friends are not at risk. We catch COVID-19, we run a fever, and we heal.

Near me, people at risk include my grand father, whose health is fluctuating, and my niece who will soon turn 3. We do it for them, not for ourselves. Now, if they caught it, a decently stocked hospital could save them. But here's the kicker: there are only so many beds in hospitals, and I don't want to put the people working there in the position of deciding whether or not my grandpa or my niece deserve to live, compared to someone else's grandpa or niece.

So we confine ourselves to slow the infection rate, so that hospitals could cope if my loved ones catch that virus, and so that people there don't have to choose who lives or dies. It's that simple really.

Now, it does present some challenges: all the classes at school must be remote-taught and you know how well that goes. Despite the fact that they save a ton of transit time, students don't work more. I'm not going to go on a rant about who students should be working for (hint: themselves), but I will let you imagine how much more energy is needed to keep them focused on learning rather than, say, watching videos online or playing games.

Of course, that means working on new means of coercion 😇

And so, I decided to put money where my mouth is, and since I teach some of them how to develop a backend system with Kitura, I build a plagiarism detector for the work they hand in. Everyone who knows me will tell you how much I hate web front development, but hey, confined means more time to learn new tricks.

Technologies used:

  • REDACTED: for plagiarism detection, CLI tool
  • Kitura: for backend development, resource management and rendering
  • PostgreSQL: for session management, data storage and retreival, in conjunction with Kuery
  • React: for front end development
Things learned from REDACTED

Plagiarism detection is hard. It is absolutely not about diffing files, but is more about turning the documents into a tree-like structure and comparing the branches and the leaves.

It's slow and inefficient, and it's probably not going to work against smart plagiarists. But hey, we'll see how many of my students read my blog 😜

Things learned from Kitura

I actually know Kitura quite well, having worked with it (and taught it) pretty much since its inception. I know some of its "competitors" (if such a term could be applied to open-source frameworks) quite well as well, but this one is the cleanest I know. It has very few dependencies, idiosyncrasies and pitfalls.

Even though IBM has decided to stop contributing in an official capacity, I still think it's the best fit for me, and that somewhat large project has comforrted me in that belief.

Things learned from PostgreSQL

It's the best open-source database software, bar none. Don't change postgres, I've been with you for 20 years, and I still love you very much.

Things learned from React

Where to start? I don't like the way modern frontend dev works. To my old eyes, they seem to be reinventing stuff we always had, in a worse way.

The component based approach is definitely better than doing plain HTML/CSS/JS, but it's so... heavy. There are hundreds if not thousands of dependencies, none of which I can audit for performance or security in a finite amount of time, and it's all running on an engine that's millions of not billions of lines of code. It's just too big to grasp, and too hard to debug.

It works, don't get me wrong, and when you do manage to get it to display things the way you want, it can do marvelous things... I am not dissing the web, per se, just lamenting that it emphasized the two things that I ask my students to avoid doing at all costs:

  • copying stuff you don't understand
  • relying too much of third party dependencies

But hey, got drag and drop upload, session history, and visual code comparison working, so I guess there's that. It was also the #1 time sink on this project.

Infrastructure

I decided to host and deploy that thing on my own server to test my chops as a sysop. Way back when, I contributed to open-source OSes, and I've always fancied myself above average when it comes to server management. That being said, I've had no formal training in the matter and I know I do some... ad-hoc stuff.

Managing a multi-site server has come a long way with Docker and docker-compose, but it's still not 100% easy. Especially if you need to add SSL certificates, which I do, because I'd like the contents of the files to have a modicum of security in transit.

If you need to have a docker + nginx proxy + let's encrypt certificates, I strongly suggest reading this documentation which will help you tons.

Can I Haz Ze Softwarez?

Only if you ask nicely. This is not going open-source for now, but I can offer it to other teachers/schools who face the same situation as me. Reach out and let's talk.