Zino - Jack Of Spades In Wonderland (Page 4)

10 Jun 2020 on swift, spm

[Dev Diaries] NSLogger is merged

The changes I made to make NSLogger SPM compatible are now in the master branch of the official repo. Update your dependencies ☺️

06 Jun 2020 on swift, spm, nslogger, programming

[Dev Diaries] SPM'ing NSLogger

I know and have fun as often as I can with Florent Pillet, another member of the tribe of "dinosaurs" still kicking around.

I really like one of his projects that contributed to his notoriety : NSLogger. Logging has always been a pain in the neck, and this tool provided us all with a way to get it done efficiently and properly. The first commit on the github repo is from 2010, and I have a strong suspicion it's been in production since before that in one form or another.

Anyhoo, I like Florent, I like NSLogger, but I hate what Cocoapods (and to a lesser extent Carthage) do to my projects. It's too brittle and I strongly dislike things that mess around with the extremely complicated XML that is a pbxproj. They do however serve an admirable purpose: managing dependencies in a way that doesn't require me to use git submodules in every one of my projects.

So, I rarely use NSLogger. SHAME! SHAME! <insert your own meme here>

With the advent of (and subsequent needed updates to) Swift Package Manager, we now have an official way of managing and supporting dependencies, but it has its own quirks that appently make it hard to "SPM" older projects.

Let's see what we can do about NSLogger.

Step 1 : The Project Structure

SPM can't mix Obj-C code and Swift code. It's always been pretty hacky anyways, with the bridging headers and the weird steps hidden by the toolchain, so we need to make it explicit:

One target for the Objective-C code (imaginatively named NSLoggerLibObjC)
One target for the Swift code (NSLogger) that depends on NSLoggerLibObjC
One product that builds the Swift target

One of the problems is that all that code is mixed in the folders, because Xcode doesn't care about file placement. SPM, on the other hand does.

So, let's use and abuse the path and sources parameters of the target. The first one is to provide the root where we look for files to compile, and the second one lists the files to be compiled.

LoggerClient.m for NSLoggerLibObjC
NSLogger.swift for NSLogger

Done. Right?

Not quite.

Step 2 : Compilation Quirks

The Obj-C lib requires ARC to be disabled. Easy to do in Xcode, a bit harder in SPM.

We need to pass the -fno-objc-arc flag to the compiler. SPM doesn't make it easy or obvious to do that, for a variety of reasons, but I guess mostly because you shouldn't pass compiler flags at all in an ideal world.

But (especially in 2020), looking at the world, ideal it ain't.

We have to use the (not so aptly named) cSetting option of the target, and use the very scary CSetting.unsafeFlags parameter for that option. Why is it unsafe, you might ask? Weeeeeeeeell. It's companies' usual way of telling you "you're on your own with this one". I'm fine with that.

Another compilation quirk is that Obj-C code relies (like its ancestor, C) on the use of header files to make your code usable as a dependency.

Again, because Xcode and SPM treat the file structure very differently, just saying that every header should be included in the resulting library is a bad idea: the search is recursive and in this particular case, would result in having specific iOS or MacOS (yes, capitalized, because sod that change) test headers exposed as well.

In the end, I had to make the difficult choice of doing something super ugly:

move the public headers in their own directory
use symlinks to their old place so's not to break the other parts of the project

If anyone has a better option that's not heavily more disruptive to the organization of the project, I'm all ears.

Step 3 : Final Assembly

So we have the Swift target that depends on the Obj-C one. Fine. But how do we use that dependency?

"Easy" some will exclaim (a bit too rapidly) "you just import the lib in the swift file!"

Yes, but then it breaks the other projects, which, again, we don't want to do. Minimal impact changes. Legacy. Friend.

So we need a preprocessing macro, like, say, SPMBuild, which would indicate we're building with SPM rather than Xcode. Sadly, this doesn't exist, and given the rate of change of the toolchain, I don't want to rely too heavily on the badly documented Xcode proprocessor macros that would allow me to detect a build through the IDE.

Thankfully, in the same vein as cSettings, we have a swiftSettings parameter to our target, wich supports SwiftSetting.define options. Great, so I'll define a macro, and test its existence in the swift file before importing the Obj-C part of the project.

One last thing I stumbled upon and used despite its shady nature: there is an undocumented decorator for import named @_exported which seems extraneous here, but has some interesting properties: it kinda sorta exposes what you import as part of the current module, flattening the dependency graph.

To be honest, I didn't know about it, it amused me, so I included it.

Wrap Up

In order to make it work directly from the repo, rather than locally, I also had to provide a version number. I chose to go with the next patch number instead of aggrandizing myself with a minor or even a major version.

Hopefully, these changes don't impact the current project at all, and allows me to use it in a way I like better (and is officially supported), and I hope Florent will not murder me for all of that. He might even decide to accept my pull request. We'll see.

In the meantime, you can find all the changes above and a usable SPM package in my fork.

24 Apr 2020 on CS, programming, swift, oss

Introducing FuzzyTests

TL;DR: Grab it here : Github repo

Unit testing is painful amirite?

Writing good tests for your code very often means spending twice as much time coding them than on the things you test themselves.

It is good practice though to verify as much as possible that the code you write is valid, especially if that code is going to be public or included in someone else's work.

In my workflow I insist on the notion of ownership :

The bottomline for me is this: if there are several people on a project, I want clearly defined ownership. It's not that I won't fix a bug in someone else's code, just that they own it and therefore have to have a reliable way of testing that my fix works.

Tests solve part of that problem. My code, my tests. If you fix my code, run my tests, I'm fairly confident that you didn't wreck the whole thing. And that I won't have to spend a couple of hours figuring out what it is that you did.

This a a very very very light constraint when you compare it to methodologies like TDD, but it's a required minimum for me.

Plus, it's not that painful, except...

Testing every case

In my personal opinion, the tests that are hardest to do right are the ones that have a very large input range, with a few failure/continuity points.

If, for instance, and completely randomly, of course, you had an application where the tilt of the phone changes the state of the app (locked/unlocked, depending on whether the phone is lying flat-ish on the table or not:

from -20º to 20º the app is locked
from 160º to 200º the app is locked
the rest of the time it's not locked
All of that modulo 360, of course

So you have a function that takes the current pitch angle, and returns if we should lock or not:

func pitchLock(_ angle: Double) -> Bool {
  // ...
}

Does it work? Does it work modulo 360? What would a unit test for that function even look like? A for loop?

I have been looking for a way to do that kind of test for a while, which is why I published HoledRange (now Domains 😇) a while back, as part of my hacks.

What I wanted is to write my tests kind of like this (invalid code on so many levels):

for x in [-1000.0...1000.0].randomSelection {
  let unitCircleAngle = x%360.0
  if unitCircleAngle >= 340 || unitCircle <= 20 {
    XCTAssert(pitchLock(x))
  } else if unitCircleAngle >= 160 && unitCircle <= 200 {
    XCTAssert(pitchLock(x))
  } else {
    XCTAssertFalse(pitchLock(x))
  }
}

This way of testing, while vaguely valid, leaves so many things flaky:

how many elements in the random selection?
how can we make certain values untestable (because we address them somewhere else, for instance)
what a lot of boilerplate if I have multiple functions to test on the same range of values
I can't reuse the same value for multiple tests to check function chains

Function builders

I have been fascinated with @_functionBuilder every since it was announced. While I don't feel enthusiastic about SwiftUI (in french), that way to build elements out of blocks is something I have wanted for years.

Making them is a harrowing experience the first time, but in the end it works!

What I wanted to use as syntax is something like this:

func myPlus(_ a: Int, _ b: Int) -> Int

DomainTests<Int> {
    Domain(-10000...10000)
    1000000
    Test { (a: Int) in
        XCTAssert(myPlus(a, 1) == a+1, "Problem with value\(a)")
        XCTAssert(myPlus(1, a) == a+1, "Problem with value\(a)")
    }
    Test { (a: Int) in
        let random = Int.random(in: -10000...10000)
        XCTAssert(myPlus(a, random) == a+random, "Problem with value\(a)")
        XCTAssert(myPlus(random, a) == a+random, "Problem with value\(a)")
   }
}.random()

This particular DomainTests runs 1000000 times over $$D=[-10000;10000]$$ in a random fashion.

Note the Test builder that takes a function with a parameter that will be in the domain, and the definition that allows to define both the test domain (mandatory) and the number of random iterations (optional).

If you want to test every single value in a domain, the bounding needs to be Strideable, ie usable in a for-loop.

DomainTests<Int> {
    Domain(-10000...10000)
    Test { (a: Int) in
        XCTAssert(myPlus(a, 1) == a+1, "Problem with value\(a)")
        XCTAssert(myPlus(1, a) == a+1, "Problem with value\(a)")
    }
    Test { (a: Int) in
        let random = Int.random(in: -10000...10000)
        XCTAssert(myPlus(a, random) == a+random, "Problem with value\(a)")
        XCTAssert(myPlus(random, a) == a+random, "Problem with value\(a)")
   }
}.full()

Conclusion

A couple of hard working days plus a healthy dose of using that framework personally means this should be ready-ish for production.

If you are a maths-oriented dev and shiver at the idea of untested domains, this is for you 😬

11 Apr 2020 on swift, coding, CS, opinion, oss

[Dev Diary] Vanilla Is The Best Flavor

I have a weird thing with the multiplication of command-line tools and gizmos: I forget them.

Do I want to run supercool gitlab commands? Hell yea! Do I need to install 12 utilities (or code a new one) to archive every project older than a year? I hope not...

The setup

I am a sucker for well documented fully linted code. But the thing is, all the gizmos that help me do that have to be installed in the system or in my ~/bin and I have to remember to update them, and I have to install them on my CD machine, and on every new environment I setup, and make sure they are still compatible with the toolchain, and it freaks me out, ok?

Plus,watching the students try to do it is painful.

So, given a 100% vanilla swift-capable environment, can I manage to run documentation and linting?

The idea

We have Swift Package Manager, which is now a first-class citizen in XCode, but it can't run shell script phases without some nasty hacks.

What if some targets were (wait for it) built to do the documentation and the linting?

Linting

One of the most popular linters out there is swiftlint, and it supports SPM. It can also build a library instead of an executable, which means one of my targets could just run the linting and output it in the terminal.

In the Package.swift file, all I needed to do was add the right dependency, and the right product and voila!

let package = Package(
	name: "WonderfulPackage",
    products: [
    	// ...
         .executable(name: "Lint", targets: ["Lint"])
   	],
    dependencies: [
        // Dependencies declare other packages that this package depends on.
        // .package(url: /* package url */, from: "1.0.0"),
		// ... normal dependencies
        .package(url: "https://github.com/realm/SwiftLint", from: "0.39.0")
    ],
    targets: [
    	// ... normal targets
        .target(
            name: "Lint",
            dependencies: ["SwiftLintFramework"]),
	]
)

Package.swift

Now, SPM is very strict with paths, so I had to put a file named main.swift in the Sources/<target>/ directory, in this case Sources/Lint.

Running the linter is fairly straightforward, and goes in the main.swift file:

// Lint command main
// runs SourceDocs
import Foundation
import SwiftLintFramework

let config = Configuration(path: FileManager.default.currentDirectoryPath+"/.swiftlint.yml",
                           rootPath: FileManager.default.currentDirectoryPath,
                           optional: true,
                           quiet: true,
                           enableAllRules: false,
                           cachePath: nil,
                           customRulesIdentifiers: [])

for lintable in config.lintableFiles(inPath: FileManager.default.currentDirectoryPath, forceExclude: false) {
    let linter = Linter(file: lintable, configuration: config)
    let storage = RuleStorage()
    let collected = linter.collect(into: storage)
    let violations = collected.styleViolations(using: storage)
    if !violations.isEmpty {
        print(EmojiReporter.generateReport(violations))
    }
}

print("🎉 All done!")

Sources/Lint/main.swift

Setup the .swiftlint file as usual, and run the command via swift run Lint

Sources/WonderfulPackage/main.swift
⛔️ Line 15: Variable name should be between 3 and 40 characters long: 'f'
⚠️ Line 13: Arguments can be omitted when matching enums with associated types if they are not used.
⚠️ Line 12: Line should be 120 characters or less: currently 143 characters

Documentation

Documentation is actually trickier, because most documentation tools out there aren't built in swift, or compatible with SPM. Doxygen and jazzy are great, but they don't fit my needs.

I found a project that was extremely promising called SourceDocs by Eneko Alonso, but it isn't a library, so I had to fork it and make it into one (while providing a second target to generate the executable if needed). One weird issue is that SPM doesn't like subtargets to bear the same name so I had to rename a couple of them to avoid conflict with Swift Argument Parser (long story).

I finally found myself in the same spot than with the linter. All I needed to do was create another target, and Bob's you're uncle. Well actually he was mine. I digress.

let package = Package(
	name: "WonderfulPackage",
    products: [
    	// ...
         .executable(name: "Docs", targets: ["Docs"])
   	],
    dependencies: [
        // Dependencies declare other packages that this package depends on.
        // .package(url: /* package url */, from: "1.0.0"),
		// ... normal dependencies
        .package(url: "https://github.com/krugazor/SourceDocs", from: "0.7.0")
    ],
    targets: [
    	// ... normal targets
        .target(
            name: "Docs",
            dependencies: ["sourcedocslib"])
	]
)

Package.swift

Another well-placed main file:

// Docs command main
// runs SourceDocs
import Foundation
import SourceDocs

do {
    switch try SourceDocs().runOnSPM(moduleName: "WonderfulPackage",
                                     outputDirectory: FileManager.default.currentDirectoryPath+"/Documentation") {
    case .success:
        print("Successful run of the documentation phase")
    case .failure(let failure):
        print(failure.localizedDescription)
    }
} catch {
    print(error.localizedDescription)
}

Sources/Docs/main.swift

Now, the command swift run Docs generates the markdown documentation in the Documentation directory.

Parsing main.swift (1/1)
Removing reference documentation at 'WonderfulPackage/Documentation/KituraStarter'... ✔
Generating Markdown documentation...
  Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/structs/WonderfulPackage.md ✔
  Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/README.md ✔
Done 🎉
Successful run of the documentation phase

Conclusion

✅ Vanilla swift environment
✅ No install needed
✅ Works on Linux and MacOS
✅ Integrated into SPM
⚠️ When running in XCode, the current directory is always wonky for packages

04 Apr 2020

[ML] Swift TensorFlow (Part 4)

With 0.8 dropping, a few things in my previous posts changed, thankfully not much. And by trying to train bigger models, I ran into a huge RAM issue, so I'll share what I did in a few paragraphs

Changes for 0.8

valueWithGradient is now a global module function, and you have to call it through TensorFlow like this:

let (loss, grad) = TensorFlow.valueWithGradient(at: model) { (model: TextModel) -> Tensor<Float> in
                    let logits = model(sampleFeatures)
                    return softmaxCrossEntropy(logits: logits, labels: sampleLabels)
}

Also, they revamped the serialization mechanics, you can now get serializable data through

try model.serializedParameters()

RAM issues

It so happens that someone told me to try with characters trigrams instead of word trigrams. I have no idea if the results are better or worst yet, because the dataset generated is huge: 4*<number of chars>, and a somewhat simple text file gave way to a magnificent 96GB of RAM usage.

Of course, this means that the program can't really run. It also meant I had to find an alternative way, and the simplest I know of that I could implement quickly was storing all the trigrams in a big database, and extract random samples from it, rather than doing it in memory. This meants going from 96GB of RAM usage down to 4GB.

The setup

I do Kitura stuff, and I ♥️ PostgreSQL, so I went for a simple ORM+Kuery setup.

The table stores trigrams, and I went for generics for the stored structure:

struct StorableTrigram<FScalar, TScalar> : Codable where FScalar : TensorFlowScalar, FScalar : Codable, TScalar : TensorFlowScalar, TScalar : Codable {
    var random_id : Int64
    var t1 : FScalar
    var t2 : FScalar
    var t3 : FScalar
    var r : TScalar
}

extension StorableTrigram : Model {
    static var tableName: String {
        get {
            return ("StorableTrigram"+String(describing: FScalar.self)+String(describing: TScalar.self)).replacingOccurrences(of: " ", with: "_")
        }
    }
}

The random_id will be used to shuffle the lines into multiple partitions later, and the tableName override is to avoid < and > from the table name.

The partitionning

One of the key things needed to avoid saturating the RAM is to partition the data. As the rest of the training loop expects an array, I decided to go with a custom Collection that can fit in a for loop and load only the current partition:

struct RandomAccessPartition : Collection {
    let numberOfPartitions: Int
    let db : ConnectionPool
    
    typealias Index = Int
    var startIndex: Int { return 0 }
    var endIndex: Int { return numberOfPartitions-1 }
    
    func index(after i: Int) -> Int {
        return i+1
    }

    subscript(position: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
        let partitionSize = Int64.max / Int64(numberOfPartitions)
        let start_rid = partitionSize * Int64(position)
        let end_rid = partitionSize * Int64(position + 1)
        var rf : [[Float]] = []
        var rl : [Int32] = []

        let lsem = DispatchSemaphore(value: 0)
        db.getConnection() { conn, err in
             if conn == nil {
                 lsem.signal()
                 return
             }
             
             conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" WHERE random_id >= \(start_rid) AND random_id < \(end_rid)") { resultSet in
                 resultSet.asRows { rows,error in
                     guard let rows = rows else {
                         lsem.signal()
                         return
                     }
                     for row in rows {
                         if let t1 = row["t1"] as? Float,
                         let t2 = row["t1"] as? Float,
                         let t3 = row["t1"] as? Float,
                             let r = row["r"] as? Int32 {
                             rf.append([t1,t2,t3])
                             rl.append(r)
                         }
                     }
                     lsem.signal()
                 }
             }
         }

        
        lsem.wait()
         let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
         let labelsT = Tensor<Int32>(rl)
         return (featuresT, labelsT)
    }
}

Relying on random_id for the partitions is a bit iffy, but thankfully PostgreSQL can re-randomize those ids somewhat fast works well enough for my use

The TextBatch replacement

The three key features of that batch-holding struct was:

initialization
random sample (once)
random partitions (once every epoch)

So here's the relevant code, with breaks for explanations:

struct RandomAccessStringStorage {
    var db : ConnectionPool
    var tableCreated : Bool = false
    
    let original: [String]
    let vocabulary: [String]
    let indexHelper: [Int:Int]
    
    init(db database: ConnectionPool, original o: [String], terminator: String? = nil, fromScratch: Bool) {
        db = database
        Database.default = Database(database) // shady, but hey
        
        original = o
        let f : [[Float]]
        let l : [Int32]
        let v : [String]
        let h : [Int:Int]
        if let term = terminator {
            (f,l,v,h) = RandomAccessStringStorage.makeArrays(original, terminator: term)
        } else {
            (f,l,v,h) = RandomAccessStringStorage.makeArrays(original)
        }
        
        vocabulary = v
        indexHelper = h
        if fromScratch {
            deleteAll()
            for i in 0..<f.count {
                insertTrigram(t1: f[i][0], t2: f[i][1], t3: f[i][2], r: l[i])
            }
        } 
    }
    
        mutating func deleteAll() {
        let _ = try? StorableTrigram<Float,Int32>.dropTableSync()
        tableCreated = false
    }
    
    mutating func insertTrigram(t1: Float, t2: Float, t3: Float, r: Int32) {
        if !tableCreated {
            let _ = try? StorableTrigram<Float,Int32>.createTableSync()
            tableCreated = true
        }
        let trig = StorableTrigram(random_id: Int64.random(in: Int64(0)...Int64.max), t1: t1, t2: t2, t3: t3, r: r)
        let lsem = DispatchSemaphore(value: 0)
        trig.save { st, error in
            lsem.signal()
        }
        lsem.wait()
    }
// ...
}

The two makeArrays are copied and pasted from the in-memory TextBatch, and the only other thing the initialization relies on is the insertion in the DB system.

There are two ways of drawing random items: a one-off and partition the data into random chunks:

func randomSample(of size: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
    var rf : [[Float]] = []
    var rl : [Int32] = []

    let lsem = DispatchSemaphore(value: 0)
    db.getConnection() { conn, err in
        if conn == nil {
            lsem.signal()
            return
        }
        
        conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" ORDER BY random() LIMIT \(size)") { resultSet in
            resultSet.asRows { rows,error in
                guard let rows = rows else {
                    lsem.signal()
                    return
                }
                for row in rows {
                    if let t1 = row["t1"] as? Float,
                    let t2 = row["t1"] as? Float,
                    let t3 = row["t1"] as? Float,
                        let r = row["r"] as? Int32 {
                        rf.append([t1,t2,t3])
                        rl.append(r)
                    }
                }
                lsem.signal()
            }
        }
    }
    
    lsem.wait()
    let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
    let labelsT = Tensor<Int32>(rl)
    return (featuresT, labelsT)
}

Random selection in Pg actually works pretty well, but can't be repeated, which is why we have to rely on the random_id to partition:

func randomSample(splits: Int) -> RandomAccessPartition<Float,Int32> {
    // reshuffle (will take a while)
    // update "StorableTrigramFloatInt32" SET random_id = cast(9223372036854775807 * random() as bigint);
    let lsem = DispatchSemaphore(value: 0)
    db.getConnection() { conn, err in
        if conn == nil {
            lsem.signal()
            return
        }
        
        conn!.execute("UPDATE \"\(StorableTrigram<Float,Int32>.tableName)\" SET random_id = cast(9223372036854775807 * random() as bigint)") { resultSet in
            lsem.signal()
        }
    }
    lsem.wait()
    return RandomAccessPartition<Float,Int32>(numberOfPartitions: splits, db: self.db)
}

The update will re-randomize the ids, paving the way for the RandomAccessPartition.

Of course the tradeoff in terms of performance is rather big, especially in the initialization phase, but hey, more ram to do other things when the model is training!