[Dev Diary] Vanilla Is The Best Flavor

I have a weird thing with the multiplication of command-line tools and gizmos: I forget them.

Do I want to run supercool gitlab commands? Hell yea! Do I need to install 12 utilities (or code a new one) to archive every project older than a year? I hope not...

The setup

I am a sucker for well documented fully linted code. But the thing is, all the gizmos that help me do that have to be installed in the system or in my ~/bin and I have to remember to update them, and I have to install them on my CD machine, and on every new environment I setup, and make sure they are still compatible with the toolchain, and it freaks me out, ok?

Plus,watching the students try to do it is painful.

So, given a 100% vanilla swift-capable environment, can I manage to run documentation and linting?

The idea

We have Swift Package Manager, which is now a first-class citizen in XCode, but it can't run shell script phases without some nasty hacks.

What if some targets were (wait for it) built to do the documentation and the linting?

Linting

One of the most popular linters out there is swiftlint, and it supports SPM. It can also build a library instead of an executable, which means one of my targets could just run the linting and output it in the terminal.

In the Package.swift file, all I needed to do was add the right dependency, and the right product and voila!

let package = Package(
	name: "WonderfulPackage",
    products: [
    	// ...
         .executable(name: "Lint", targets: ["Lint"])
   	],
    dependencies: [
        // Dependencies declare other packages that this package depends on.
        // .package(url: /* package url */, from: "1.0.0"),
		// ... normal dependencies
        .package(url: "https://github.com/realm/SwiftLint", from: "0.39.0")
    ],
    targets: [
    	// ... normal targets
        .target(
            name: "Lint",
            dependencies: ["SwiftLintFramework"]),
	]
)
Package.swift

Now, SPM is very strict with paths, so I had to put a file named main.swift in the Sources/<target>/ directory, in this case Sources/Lint.

Running the linter is fairly straightforward, and goes in the main.swift file:

// Lint command main
// runs SourceDocs
import Foundation
import SwiftLintFramework

let config = Configuration(path: FileManager.default.currentDirectoryPath+"/.swiftlint.yml",
                           rootPath: FileManager.default.currentDirectoryPath,
                           optional: true,
                           quiet: true,
                           enableAllRules: false,
                           cachePath: nil,
                           customRulesIdentifiers: [])

for lintable in config.lintableFiles(inPath: FileManager.default.currentDirectoryPath, forceExclude: false) {
    let linter = Linter(file: lintable, configuration: config)
    let storage = RuleStorage()
    let collected = linter.collect(into: storage)
    let violations = collected.styleViolations(using: storage)
    if !violations.isEmpty {
        print(EmojiReporter.generateReport(violations))
    }
}

print("🎉 All done!")
Sources/Lint/main.swift

Setup the .swiftlint file as usual, and run the command via swift run Lint

Sources/WonderfulPackage/main.swift
⛔️ Line 15: Variable name should be between 3 and 40 characters long: 'f'
⚠️ Line 13: Arguments can be omitted when matching enums with associated types if they are not used.
⚠️ Line 12: Line should be 120 characters or less: currently 143 characters

Documentation

Documentation is actually trickier, because most documentation tools out there aren't built in swift, or compatible with SPM. Doxygen and jazzy are great, but they don't fit my needs.

I found a project that was extremely promising called SourceDocs by Eneko Alonso, but it isn't a library, so I had to fork it and make it into one (while providing a second target to generate the executable if needed). One weird issue is that SPM doesn't like subtargets to bear the same name so I had to rename a couple of them to avoid conflict with Swift Argument Parser (long story).

I finally found myself in the same spot than with the linter. All I needed to do was create another target, and Bob's you're uncle. Well actually he was mine. I digress.

let package = Package(
	name: "WonderfulPackage",
    products: [
    	// ...
         .executable(name: "Docs", targets: ["Docs"])
   	],
    dependencies: [
        // Dependencies declare other packages that this package depends on.
        // .package(url: /* package url */, from: "1.0.0"),
		// ... normal dependencies
        .package(url: "https://github.com/krugazor/SourceDocs", from: "0.7.0")
    ],
    targets: [
    	// ... normal targets
        .target(
            name: "Docs",
            dependencies: ["sourcedocslib"])
	]
)
Package.swift

Another well-placed main file:

// Docs command main
// runs SourceDocs
import Foundation
import SourceDocs

do {
    switch try SourceDocs().runOnSPM(moduleName: "WonderfulPackage",
                                     outputDirectory: FileManager.default.currentDirectoryPath+"/Documentation") {
    case .success:
        print("Successful run of the documentation phase")
    case .failure(let failure):
        print(failure.localizedDescription)
    }
} catch {
    print(error.localizedDescription)
}
Sources/Docs/main.swift

Now, the command swift run Docs generates the markdown documentation in the Documentation directory.

Parsing main.swift (1/1)
Removing reference documentation at 'WonderfulPackage/Documentation/KituraStarter'... ✔
Generating Markdown documentation...
  Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/structs/WonderfulPackage.md ✔
  Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/README.md ✔
Done 🎉
Successful run of the documentation phase

Conclusion

✅ Vanilla swift environment
✅ No install needed
✅ Works on Linux and MacOS
✅ Integrated into SPM
⚠️ When running in XCode, the current directory is always wonky for packages


[ML] Swift TensorFlow (Part 4)

With 0.8 dropping, a few things in my previous posts changed, thankfully not much. And by trying to train bigger models, I ran into a huge RAM issue, so I'll share what I did in a few paragraphs

Changes for 0.8

valueWithGradient is now a global module function, and you have to call it through TensorFlow like this:

let (loss, grad) = TensorFlow.valueWithGradient(at: model) { (model: TextModel) -> Tensor<Float> in
                    let logits = model(sampleFeatures)
                    return softmaxCrossEntropy(logits: logits, labels: sampleLabels)
}

Also, they revamped the serialization mechanics, you can now get serializable data through

try model.serializedParameters()
RAM issues

It so happens that someone told me to try with characters trigrams instead of word trigrams. I have no idea if the results are better or worst yet, because the dataset generated is huge: 4*<number of chars>, and a somewhat simple text file gave way to a magnificent 96GB of RAM usage.

Of course, this means that the program can't really run. It also meant I had to find an alternative way, and the simplest I know of that I could implement quickly was storing all the trigrams in a big database, and extract random samples from it, rather than doing it in memory. This meants going from 96GB of RAM usage down to 4GB.

The setup

I do Kitura stuff, and I ♥️ PostgreSQL, so I went for a simple ORM+Kuery setup.

The table stores trigrams, and I went for generics for the stored structure:

struct StorableTrigram<FScalar, TScalar> : Codable where FScalar : TensorFlowScalar, FScalar : Codable, TScalar : TensorFlowScalar, TScalar : Codable {
    var random_id : Int64
    var t1 : FScalar
    var t2 : FScalar
    var t3 : FScalar
    var r : TScalar
}

extension StorableTrigram : Model {
    static var tableName: String {
        get {
            return ("StorableTrigram"+String(describing: FScalar.self)+String(describing: TScalar.self)).replacingOccurrences(of: " ", with: "_")
        }
    }
}

The random_id will be used to shuffle the lines into multiple partitions later, and the tableName override is to avoid < and > from the table name.

The partitionning

One of the key things needed to avoid saturating the RAM is to partition the data. As the rest of the training loop expects an array, I decided to go with a custom Collection that can fit in a for loop and load only the current partition:

struct RandomAccessPartition : Collection {
    let numberOfPartitions: Int
    let db : ConnectionPool
    
    typealias Index = Int
    var startIndex: Int { return 0 }
    var endIndex: Int { return numberOfPartitions-1 }
    
    func index(after i: Int) -> Int {
        return i+1
    }

    subscript(position: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
        let partitionSize = Int64.max / Int64(numberOfPartitions)
        let start_rid = partitionSize * Int64(position)
        let end_rid = partitionSize * Int64(position + 1)
        var rf : [[Float]] = []
        var rl : [Int32] = []

        let lsem = DispatchSemaphore(value: 0)
        db.getConnection() { conn, err in
             if conn == nil {
                 lsem.signal()
                 return
             }
             
             conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" WHERE random_id >= \(start_rid) AND random_id < \(end_rid)") { resultSet in
                 resultSet.asRows { rows,error in
                     guard let rows = rows else {
                         lsem.signal()
                         return
                     }
                     for row in rows {
                         if let t1 = row["t1"] as? Float,
                         let t2 = row["t1"] as? Float,
                         let t3 = row["t1"] as? Float,
                             let r = row["r"] as? Int32 {
                             rf.append([t1,t2,t3])
                             rl.append(r)
                         }
                     }
                     lsem.signal()
                 }
             }
         }

        
        lsem.wait()
         let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
         let labelsT = Tensor<Int32>(rl)
         return (featuresT, labelsT)
    }
}

Relying on random_id for the partitions is a bit iffy, but thankfully PostgreSQL can re-randomize those ids somewhat fast works well enough for my use

The TextBatch replacement

The three key features of that batch-holding struct was:

  • initialization
  • random sample (once)
  • random partitions (once every epoch)

So here's the relevant code, with breaks for explanations:

struct RandomAccessStringStorage {
    var db : ConnectionPool
    var tableCreated : Bool = false
    
    let original: [String]
    let vocabulary: [String]
    let indexHelper: [Int:Int]
    
    init(db database: ConnectionPool, original o: [String], terminator: String? = nil, fromScratch: Bool) {
        db = database
        Database.default = Database(database) // shady, but hey
        
        original = o
        let f : [[Float]]
        let l : [Int32]
        let v : [String]
        let h : [Int:Int]
        if let term = terminator {
            (f,l,v,h) = RandomAccessStringStorage.makeArrays(original, terminator: term)
        } else {
            (f,l,v,h) = RandomAccessStringStorage.makeArrays(original)
        }
        
        vocabulary = v
        indexHelper = h
        if fromScratch {
            deleteAll()
            for i in 0..<f.count {
                insertTrigram(t1: f[i][0], t2: f[i][1], t3: f[i][2], r: l[i])
            }
        } 
    }
    
        mutating func deleteAll() {
        let _ = try? StorableTrigram<Float,Int32>.dropTableSync()
        tableCreated = false
    }
    
    mutating func insertTrigram(t1: Float, t2: Float, t3: Float, r: Int32) {
        if !tableCreated {
            let _ = try? StorableTrigram<Float,Int32>.createTableSync()
            tableCreated = true
        }
        let trig = StorableTrigram(random_id: Int64.random(in: Int64(0)...Int64.max), t1: t1, t2: t2, t3: t3, r: r)
        let lsem = DispatchSemaphore(value: 0)
        trig.save { st, error in
            lsem.signal()
        }
        lsem.wait()
    }
// ...
}

The two makeArrays are copied and pasted from the in-memory TextBatch, and the only other thing the initialization relies on is the insertion in the DB system.

There are two ways of drawing random items: a one-off and partition the data into random chunks:

func randomSample(of size: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
    var rf : [[Float]] = []
    var rl : [Int32] = []

    let lsem = DispatchSemaphore(value: 0)
    db.getConnection() { conn, err in
        if conn == nil {
            lsem.signal()
            return
        }
        
        conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" ORDER BY random() LIMIT \(size)") { resultSet in
            resultSet.asRows { rows,error in
                guard let rows = rows else {
                    lsem.signal()
                    return
                }
                for row in rows {
                    if let t1 = row["t1"] as? Float,
                    let t2 = row["t1"] as? Float,
                    let t3 = row["t1"] as? Float,
                        let r = row["r"] as? Int32 {
                        rf.append([t1,t2,t3])
                        rl.append(r)
                    }
                }
                lsem.signal()
            }
        }
    }
    
    lsem.wait()
    let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
    let labelsT = Tensor<Int32>(rl)
    return (featuresT, labelsT)
}

Random selection in Pg actually works pretty well, but can't be repeated, which is why we have to rely on the random_id to partition:

func randomSample(splits: Int) -> RandomAccessPartition<Float,Int32> {
    // reshuffle (will take a while)
    // update "StorableTrigramFloatInt32" SET random_id = cast(9223372036854775807 * random() as bigint);
    let lsem = DispatchSemaphore(value: 0)
    db.getConnection() { conn, err in
        if conn == nil {
            lsem.signal()
            return
        }
        
        conn!.execute("UPDATE \"\(StorableTrigram<Float,Int32>.tableName)\" SET random_id = cast(9223372036854775807 * random() as bigint)") { resultSet in
            lsem.signal()
        }
    }
    lsem.wait()
    return RandomAccessPartition<Float,Int32>(numberOfPartitions: splits, db: self.db)
}

The update will re-randomize the ids, paving the way for the RandomAccessPartition.

Of course the tradeoff in terms of performance is rather big, especially in the initialization phase, but hey, more ram to do other things when the model is training!


[Utilities] Time Tracking Structure

Every now and again (especially when training a model), I need to have a guesstimate as to how long a "step" takes, and how long the process will take, so I wrote myself a little piece of code that does that. Because I've had the question multiple times (and because I think everyone codes their own after a while), here's mine. Feel free to use it

/// Structure that keeps track of the time it takes to complete steps, to average or estimate the remaining time
public struct TimeRecord {
    /// The number of steps to keep for averaging. 5 is a decent default, increase or decrease as needed
    /// Minimum for average is 2, obvioulsy
    public var smoothing: Int = 5 {
        didSet {
            smoothing = max(smoothing, 2) // minimum 2 values
        }
    }
    /// dates for the steps
    private var dates : [Date] = []
    /// formatter for debug print and/or display
    private var formatter = DateComponentsFormatter()
    public var formatterStyle : DateComponentsFormatter.UnitsStyle {
        didSet {
            formatter.allowedUnits = [.hour, .minute, .second, .nanosecond]
            formatter.unitsStyle = formatterStyle
        }
    }
    
    public init(smoothing s: Int = 5, style fs: DateComponentsFormatter.UnitsStyle = .positional) {
        smoothing = max(s, 2)
        formatterStyle = fs
        formatter = DateComponentsFormatter()
        // not available everywhere
        // formatter.allowedUnits = [.hour, .minute, .second, .nanosecond]
        formatter.allowedUnits = [.hour, .minute, .second]
        formatter.zeroFormattingBehavior = .pad
        formatter.unitsStyle = fs
    }
    
    /// adds the record for a step
    /// - param d: the date of the step. If unspecified, current date is taken
    mutating func addRecord(_ d: Date? = nil) {
        if let d = d { dates.append(d) }
        else { dates.append(Date()) }
        while(dates.count > smoothing) { dates.remove(at: 0) }
    }
    
    /// gives the average delta between two steps (in seconds)
    var averageDelta : Double {
        if dates.count <= 1 { return 0.0 }
        var totalTime = 0.0
        for i in 1..<dates.count {
            totalTime += dates[i].timeIntervalSince(dates[i-1])
        }
        
        return totalTime/Double(dates.count)
    }
    
    /// gives the average delta between two steps in human readable form
    /// - see formatterStyle for options, default is "02:46:40"
    var averageDeltaHumanReadable : String {
        let delta = averageDelta
        return formatter.string(from: delta) ?? ""
    }
    
    /// given a number of remaining steps, gives an estimate of the time left on the process (in s)
    func estimatedTimeRemaining(_ steps: Int) -> Double {
        return Double(steps) * averageDelta
    }
    
    /// given a number of remaining steps, gives an estimate of the time left on the process in human readable form
    /// - see formatterStyle for options, default is "02:46:40"
    func estimatedTimeRemainingHumanReadable(_ steps: Int) -> String {
        let delta = estimatedTimeRemaining(steps)
         return formatter.string(from: delta) ?? ""
    }
}

When I train a model, I tend to use it that way:

// prepare model
var tt = TimeRecord()
tt.addRecord()

while currentEpoch < maxEpochs {
  // train the model
  tt.addRecord()
  if currentEpoch > 0 && currentEpoch % 5 == 0 {
  	print(tt.averageDeltaHumanReadable + " per epoch, " 
    	+ tt.(estimatedTimeRemainingHumanReadable(maxEpochs - currentEpoch) + " remaining"
    )
  }
}


[Confinement] Week 2

As everyone settles down in the new mode of operations, the number of small tasks has increased and the number of big projects has decreased.

The plagiarism tool is in testing among some of the teachers at school, and the funny reaction of my team of developers asking for an API (to avoid going through the web front end that I crafted - probably badly - in React) made me smile.

What fascinates me overall is the inability of "the web" to cope with the sudden influx of having a ton more people working from home. "They" said the web would replace everything, that it was just a matter of scaling up.

Azure seems to be full, GCloud has some issues with the data traffic, AWS is holding but the status page keep showing outages...

Don't get me wrong, I've been working remote for close to 20 years, so I'm not saying office work is better. But I have been working on projects with people who said it didn't matter if the performance was poor, because they'd just order a bigger server or two.

That inability to take into account the physical constraints of our world is one of the things that grind my gears the most: I've been working on embedded software and high-performance backend stuff for a long time, and betting on poor code hygiene to be compensated by someone else is never a good bet. It ends up with re-writing the code again, and again, and again.

When it's not the RAM issues (lookin at you Electron), it's server constraints (oh the surprise when your instance autoscales up), or bandwidth issues (our government is thinking of restricting the use of Netflix and the like 🙄).

This situation will hopefully remove the attention from the people who can talk and present the best, and back onto more objective metrics (aka "does it work under load?")

I don't rent a small server because I'm cheap. I do it because I can't release any software that doesn't work "correctly" on the bare minimum of specs I have decided the users will have. Then again, when it explodes, it's a great opportunity to learn new things about optimization and constraints 🧐

Since the situation will last a while longer, I hope it reminds everyone that what we do isn't magic. It's science, and we can't wave the problems and constraints away.


Random Wednesday

I had absolutely no idea that /dev/random was so controversial

> That's all good and nice, but even the man page for /dev/(u)random contradicts you! Does anyonewho knows about this stuff actually agree with you?

No, it really doesn't. It seems to imply that /dev/urandom is insecure for cryptographic use, unless you really understand all that cryptographic jargon.

Sick burn

From Myths about /dev/urandom