[Dev Diaries] NSLogger is merged
The changes I made to make NSLogger SPM compatible are now in the master branch of the official repo. Update your dependencies ☺️
The changes I made to make NSLogger SPM compatible are now in the master branch of the official repo. Update your dependencies ☺️
I know and have fun as often as I can with Florent Pillet, another member of the tribe of "dinosaurs" still kicking around.
I really like one of his projects that contributed to his notoriety : NSLogger. Logging has always been a pain in the neck, and this tool provided us all with a way to get it done efficiently and properly. The first commit on the github repo is from 2010, and I have a strong suspicion it's been in production since before that in one form or another.
Anyhoo, I like Florent, I like NSLogger, but I hate what Cocoapods (and to a lesser extent Carthage) do to my projects. It's too brittle and I strongly dislike things that mess around with the extremely complicated XML that is a pbxproj. They do however serve an admirable purpose: managing dependencies in a way that doesn't require me to use git submodules in every one of my projects.
So, I rarely use NSLogger. SHAME! SHAME! <insert your own meme here>
With the advent of (and subsequent needed updates to) Swift Package Manager, we now have an official way of managing and supporting dependencies, but it has its own quirks that appently make it hard to "SPM" older projects.
Let's see what we can do about NSLogger.
SPM can't mix Obj-C code and Swift code. It's always been pretty hacky anyways, with the bridging headers and the weird steps hidden by the toolchain, so we need to make it explicit:
NSLoggerLibObjC)NSLogger) that depends on NSLoggerLibObjCOne of the problems is that all that code is mixed in the folders, because Xcode doesn't care about file placement. SPM, on the other hand does.
So, let's use and abuse the path and sources parameters of the target. The first one is to provide the root where we look for files to compile, and the second one lists the files to be compiled.
NSLoggerLibObjCNSLoggerDone. Right?
Not quite.
The Obj-C lib requires ARC to be disabled. Easy to do in Xcode, a bit harder in SPM.
We need to pass the -fno-objc-arc flag to the compiler. SPM doesn't make it easy or obvious to do that, for a variety of reasons, but I guess mostly because you shouldn't pass compiler flags at all in an ideal world.
But (especially in 2020), looking at the world, ideal it ain't.
We have to use the (not so aptly named) cSetting option of the target, and use the very scary CSetting.unsafeFlags parameter for that option. Why is it unsafe, you might ask? Weeeeeeeeell. It's companies' usual way of telling you "you're on your own with this one". I'm fine with that.
Another compilation quirk is that Obj-C code relies (like its ancestor, C) on the use of header files to make your code usable as a dependency.
Again, because Xcode and SPM treat the file structure very differently, just saying that every header should be included in the resulting library is a bad idea: the search is recursive and in this particular case, would result in having specific iOS or MacOS (yes, capitalized, because sod that change) test headers exposed as well.
In the end, I had to make the difficult choice of doing something super ugly:
If anyone has a better option that's not heavily more disruptive to the organization of the project, I'm all ears.
So we have the Swift target that depends on the Obj-C one. Fine. But how do we use that dependency?
"Easy" some will exclaim (a bit too rapidly) "you just import the lib in the swift file!"
Yes, but then it breaks the other projects, which, again, we don't want to do. Minimal impact changes. Legacy. Friend.
So we need a preprocessing macro, like, say, SPMBuild, which would indicate we're building with SPM rather than Xcode. Sadly, this doesn't exist, and given the rate of change of the toolchain, I don't want to rely too heavily on the badly documented Xcode proprocessor macros that would allow me to detect a build through the IDE.
Thankfully, in the same vein as cSettings, we have a swiftSettings parameter to our target, wich supports SwiftSetting.define options. Great, so I'll define a macro, and test its existence in the swift file before importing the Obj-C part of the project.
One last thing I stumbled upon and used despite its shady nature: there is an undocumented decorator for import named @_exported which seems extraneous here, but has some interesting properties: it kinda sorta exposes what you import as part of the current module, flattening the dependency graph.
To be honest, I didn't know about it, it amused me, so I included it.
In order to make it work directly from the repo, rather than locally, I also had to provide a version number. I chose to go with the next patch number instead of aggrandizing myself with a minor or even a major version.
Hopefully, these changes don't impact the current project at all, and allows me to use it in a way I like better (and is officially supported), and I hope Florent will not murder me for all of that. He might even decide to accept my pull request. We'll see.
In the meantime, you can find all the changes above and a usable SPM package in my fork.
Writing good tests for your code very often means spending twice as much time coding them than on the things you test themselves.
It is good practice though to verify as much as possible that the code you write is valid, especially if that code is going to be public or included in someone else's work.
In my workflow I insist on the notion of ownership :
The bottomline for me is this: if there are several people on a project, I want clearly defined ownership. It's not that I won't fix a bug in someone else's code, just that they own it and therefore have to have a reliable way of testing that my fix works.
Tests solve part of that problem. My code, my tests. If you fix my code, run my tests, I'm fairly confident that you didn't wreck the whole thing. And that I won't have to spend a couple of hours figuring out what it is that you did.
This a a very very very light constraint when you compare it to methodologies like TDD, but it's a required minimum for me.
Plus, it's not that painful, except...
In my personal opinion, the tests that are hardest to do right are the ones that have a very large input range, with a few failure/continuity points.
If, for instance, and completely randomly, of course, you had an application where the tilt of the phone changes the state of the app (locked/unlocked, depending on whether the phone is lying flat-ish on the table or not:
So you have a function that takes the current pitch angle, and returns if we should lock or not:
func pitchLock(_ angle: Double) -> Bool {
// ...
}Does it work? Does it work modulo 360? What would a unit test for that function even look like? A for loop?
I have been looking for a way to do that kind of test for a while, which is why I published HoledRange (now Domains 😇) a while back, as part of my hacks.
What I wanted is to write my tests kind of like this (invalid code on so many levels):
for x in [-1000.0...1000.0].randomSelection {
let unitCircleAngle = x%360.0
if unitCircleAngle >= 340 || unitCircle <= 20 {
XCTAssert(pitchLock(x))
} else if unitCircleAngle >= 160 && unitCircle <= 200 {
XCTAssert(pitchLock(x))
} else {
XCTAssertFalse(pitchLock(x))
}
}This way of testing, while vaguely valid, leaves so many things flaky:
I have been fascinated with @_functionBuilder every since it was announced. While I don't feel enthusiastic about SwiftUI (in french), that way to build elements out of blocks is something I have wanted for years.
Making them is a harrowing experience the first time, but in the end it works!
What I wanted to use as syntax is something like this:
func myPlus(_ a: Int, _ b: Int) -> Int
DomainTests<Int> {
Domain(-10000...10000)
1000000
Test { (a: Int) in
XCTAssert(myPlus(a, 1) == a+1, "Problem with value\(a)")
XCTAssert(myPlus(1, a) == a+1, "Problem with value\(a)")
}
Test { (a: Int) in
let random = Int.random(in: -10000...10000)
XCTAssert(myPlus(a, random) == a+random, "Problem with value\(a)")
XCTAssert(myPlus(random, a) == a+random, "Problem with value\(a)")
}
}.random()This particular DomainTests runs 1000000 times over $$D=[-10000;10000]$$ in a random fashion.
Note the Test builder that takes a function with a parameter that will be in the domain, and the definition that allows to define both the test domain (mandatory) and the number of random iterations (optional).
If you want to test every single value in a domain, the bounding needs to be Strideable, ie usable in a for-loop.
DomainTests<Int> {
Domain(-10000...10000)
Test { (a: Int) in
XCTAssert(myPlus(a, 1) == a+1, "Problem with value\(a)")
XCTAssert(myPlus(1, a) == a+1, "Problem with value\(a)")
}
Test { (a: Int) in
let random = Int.random(in: -10000...10000)
XCTAssert(myPlus(a, random) == a+random, "Problem with value\(a)")
XCTAssert(myPlus(random, a) == a+random, "Problem with value\(a)")
}
}.full()A couple of hard working days plus a healthy dose of using that framework personally means this should be ready-ish for production.
If you are a maths-oriented dev and shiver at the idea of untested domains, this is for you 😬
I have a weird thing with the multiplication of command-line tools and gizmos: I forget them.
Do I want to run supercool gitlab commands? Hell yea! Do I need to install 12 utilities (or code a new one) to archive every project older than a year? I hope not...
I am a sucker for well documented fully linted code. But the thing is, all the gizmos that help me do that have to be installed in the system or in my ~/bin and I have to remember to update them, and I have to install them on my CD machine, and on every new environment I setup, and make sure they are still compatible with the toolchain, and it freaks me out, ok?
Plus,watching the students try to do it is painful.
So, given a 100% vanilla swift-capable environment, can I manage to run documentation and linting?
We have Swift Package Manager, which is now a first-class citizen in XCode, but it can't run shell script phases without some nasty hacks.
What if some targets were (wait for it) built to do the documentation and the linting?
One of the most popular linters out there is swiftlint, and it supports SPM. It can also build a library instead of an executable, which means one of my targets could just run the linting and output it in the terminal.
In the Package.swift file, all I needed to do was add the right dependency, and the right product and voila!
let package = Package(
name: "WonderfulPackage",
products: [
// ...
.executable(name: "Lint", targets: ["Lint"])
],
dependencies: [
// Dependencies declare other packages that this package depends on.
// .package(url: /* package url */, from: "1.0.0"),
// ... normal dependencies
.package(url: "https://github.com/realm/SwiftLint", from: "0.39.0")
],
targets: [
// ... normal targets
.target(
name: "Lint",
dependencies: ["SwiftLintFramework"]),
]
)Now, SPM is very strict with paths, so I had to put a file named main.swift in the Sources/<target>/ directory, in this case Sources/Lint.
Running the linter is fairly straightforward, and goes in the main.swift file:
// Lint command main
// runs SourceDocs
import Foundation
import SwiftLintFramework
let config = Configuration(path: FileManager.default.currentDirectoryPath+"/.swiftlint.yml",
rootPath: FileManager.default.currentDirectoryPath,
optional: true,
quiet: true,
enableAllRules: false,
cachePath: nil,
customRulesIdentifiers: [])
for lintable in config.lintableFiles(inPath: FileManager.default.currentDirectoryPath, forceExclude: false) {
let linter = Linter(file: lintable, configuration: config)
let storage = RuleStorage()
let collected = linter.collect(into: storage)
let violations = collected.styleViolations(using: storage)
if !violations.isEmpty {
print(EmojiReporter.generateReport(violations))
}
}
print("🎉 All done!")Setup the .swiftlint file as usual, and run the command via swift run Lint
Sources/WonderfulPackage/main.swift
⛔️ Line 15: Variable name should be between 3 and 40 characters long: 'f'
⚠️ Line 13: Arguments can be omitted when matching enums with associated types if they are not used.
⚠️ Line 12: Line should be 120 characters or less: currently 143 charactersDocumentation is actually trickier, because most documentation tools out there aren't built in swift, or compatible with SPM. Doxygen and jazzy are great, but they don't fit my needs.
I found a project that was extremely promising called SourceDocs by Eneko Alonso, but it isn't a library, so I had to fork it and make it into one (while providing a second target to generate the executable if needed). One weird issue is that SPM doesn't like subtargets to bear the same name so I had to rename a couple of them to avoid conflict with Swift Argument Parser (long story).
I finally found myself in the same spot than with the linter. All I needed to do was create another target, and Bob's you're uncle. Well actually he was mine. I digress.
let package = Package(
name: "WonderfulPackage",
products: [
// ...
.executable(name: "Docs", targets: ["Docs"])
],
dependencies: [
// Dependencies declare other packages that this package depends on.
// .package(url: /* package url */, from: "1.0.0"),
// ... normal dependencies
.package(url: "https://github.com/krugazor/SourceDocs", from: "0.7.0")
],
targets: [
// ... normal targets
.target(
name: "Docs",
dependencies: ["sourcedocslib"])
]
)Another well-placed main file:
// Docs command main
// runs SourceDocs
import Foundation
import SourceDocs
do {
switch try SourceDocs().runOnSPM(moduleName: "WonderfulPackage",
outputDirectory: FileManager.default.currentDirectoryPath+"/Documentation") {
case .success:
print("Successful run of the documentation phase")
case .failure(let failure):
print(failure.localizedDescription)
}
} catch {
print(error.localizedDescription)
}Now, the command swift run Docs generates the markdown documentation in the Documentation directory.
Parsing main.swift (1/1)
Removing reference documentation at 'WonderfulPackage/Documentation/KituraStarter'... ✔
Generating Markdown documentation...
Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/structs/WonderfulPackage.md ✔
Writing documentation file: WonderfulPackage/Documentation/WonderfulPackage/README.md ✔
Done 🎉
Successful run of the documentation phase
✅ Vanilla swift environment
✅ No install needed
✅ Works on Linux and MacOS
✅ Integrated into SPM
⚠️ When running in XCode, the current directory is always wonky for packages
With 0.8 dropping, a few things in my previous posts changed, thankfully not much. And by trying to train bigger models, I ran into a huge RAM issue, so I'll share what I did in a few paragraphs
valueWithGradient is now a global module function, and you have to call it through TensorFlow like this:
let (loss, grad) = TensorFlow.valueWithGradient(at: model) { (model: TextModel) -> Tensor<Float> in
let logits = model(sampleFeatures)
return softmaxCrossEntropy(logits: logits, labels: sampleLabels)
}Also, they revamped the serialization mechanics, you can now get serializable data through
try model.serializedParameters()It so happens that someone told me to try with characters trigrams instead of word trigrams. I have no idea if the results are better or worst yet, because the dataset generated is huge: 4*<number of chars>, and a somewhat simple text file gave way to a magnificent 96GB of RAM usage.
Of course, this means that the program can't really run. It also meant I had to find an alternative way, and the simplest I know of that I could implement quickly was storing all the trigrams in a big database, and extract random samples from it, rather than doing it in memory. This meants going from 96GB of RAM usage down to 4GB.
I do Kitura stuff, and I ♥️ PostgreSQL, so I went for a simple ORM+Kuery setup.
The table stores trigrams, and I went for generics for the stored structure:
struct StorableTrigram<FScalar, TScalar> : Codable where FScalar : TensorFlowScalar, FScalar : Codable, TScalar : TensorFlowScalar, TScalar : Codable {
var random_id : Int64
var t1 : FScalar
var t2 : FScalar
var t3 : FScalar
var r : TScalar
}
extension StorableTrigram : Model {
static var tableName: String {
get {
return ("StorableTrigram"+String(describing: FScalar.self)+String(describing: TScalar.self)).replacingOccurrences(of: " ", with: "_")
}
}
}The random_id will be used to shuffle the lines into multiple partitions later, and the tableName override is to avoid < and > from the table name.
One of the key things needed to avoid saturating the RAM is to partition the data. As the rest of the training loop expects an array, I decided to go with a custom Collection that can fit in a for loop and load only the current partition:
struct RandomAccessPartition : Collection {
let numberOfPartitions: Int
let db : ConnectionPool
typealias Index = Int
var startIndex: Int { return 0 }
var endIndex: Int { return numberOfPartitions-1 }
func index(after i: Int) -> Int {
return i+1
}
subscript(position: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
let partitionSize = Int64.max / Int64(numberOfPartitions)
let start_rid = partitionSize * Int64(position)
let end_rid = partitionSize * Int64(position + 1)
var rf : [[Float]] = []
var rl : [Int32] = []
let lsem = DispatchSemaphore(value: 0)
db.getConnection() { conn, err in
if conn == nil {
lsem.signal()
return
}
conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" WHERE random_id >= \(start_rid) AND random_id < \(end_rid)") { resultSet in
resultSet.asRows { rows,error in
guard let rows = rows else {
lsem.signal()
return
}
for row in rows {
if let t1 = row["t1"] as? Float,
let t2 = row["t1"] as? Float,
let t3 = row["t1"] as? Float,
let r = row["r"] as? Int32 {
rf.append([t1,t2,t3])
rl.append(r)
}
}
lsem.signal()
}
}
}
lsem.wait()
let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
let labelsT = Tensor<Int32>(rl)
return (featuresT, labelsT)
}
}Relying on random_id for the partitions is a bit iffy, but thankfully PostgreSQL can re-randomize those ids somewhat fast works well enough for my use
The three key features of that batch-holding struct was:
So here's the relevant code, with breaks for explanations:
struct RandomAccessStringStorage {
var db : ConnectionPool
var tableCreated : Bool = false
let original: [String]
let vocabulary: [String]
let indexHelper: [Int:Int]
init(db database: ConnectionPool, original o: [String], terminator: String? = nil, fromScratch: Bool) {
db = database
Database.default = Database(database) // shady, but hey
original = o
let f : [[Float]]
let l : [Int32]
let v : [String]
let h : [Int:Int]
if let term = terminator {
(f,l,v,h) = RandomAccessStringStorage.makeArrays(original, terminator: term)
} else {
(f,l,v,h) = RandomAccessStringStorage.makeArrays(original)
}
vocabulary = v
indexHelper = h
if fromScratch {
deleteAll()
for i in 0..<f.count {
insertTrigram(t1: f[i][0], t2: f[i][1], t3: f[i][2], r: l[i])
}
}
}
mutating func deleteAll() {
let _ = try? StorableTrigram<Float,Int32>.dropTableSync()
tableCreated = false
}
mutating func insertTrigram(t1: Float, t2: Float, t3: Float, r: Int32) {
if !tableCreated {
let _ = try? StorableTrigram<Float,Int32>.createTableSync()
tableCreated = true
}
let trig = StorableTrigram(random_id: Int64.random(in: Int64(0)...Int64.max), t1: t1, t2: t2, t3: t3, r: r)
let lsem = DispatchSemaphore(value: 0)
trig.save { st, error in
lsem.signal()
}
lsem.wait()
}
// ...
}The two makeArrays are copied and pasted from the in-memory TextBatch, and the only other thing the initialization relies on is the insertion in the DB system.
There are two ways of drawing random items: a one-off and partition the data into random chunks:
func randomSample(of size: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) {
var rf : [[Float]] = []
var rl : [Int32] = []
let lsem = DispatchSemaphore(value: 0)
db.getConnection() { conn, err in
if conn == nil {
lsem.signal()
return
}
conn!.execute("SELECT * FROM \"\(StorableTrigram<Float,Int32>.tableName)\" ORDER BY random() LIMIT \(size)") { resultSet in
resultSet.asRows { rows,error in
guard let rows = rows else {
lsem.signal()
return
}
for row in rows {
if let t1 = row["t1"] as? Float,
let t2 = row["t1"] as? Float,
let t3 = row["t1"] as? Float,
let r = row["r"] as? Int32 {
rf.append([t1,t2,t3])
rl.append(r)
}
}
lsem.signal()
}
}
}
lsem.wait()
let featuresT = Tensor<Float>(shape: [rf.count, 3], scalars: rf.flatMap { $0 })
let labelsT = Tensor<Int32>(rl)
return (featuresT, labelsT)
}Random selection in Pg actually works pretty well, but can't be repeated, which is why we have to rely on the random_id to partition:
func randomSample(splits: Int) -> RandomAccessPartition<Float,Int32> {
// reshuffle (will take a while)
// update "StorableTrigramFloatInt32" SET random_id = cast(9223372036854775807 * random() as bigint);
let lsem = DispatchSemaphore(value: 0)
db.getConnection() { conn, err in
if conn == nil {
lsem.signal()
return
}
conn!.execute("UPDATE \"\(StorableTrigram<Float,Int32>.tableName)\" SET random_id = cast(9223372036854775807 * random() as bigint)") { resultSet in
lsem.signal()
}
}
lsem.wait()
return RandomAccessPartition<Float,Int32>(numberOfPartitions: splits, db: self.db)
}The update will re-randomize the ids, paving the way for the RandomAccessPartition.
Of course the tradeoff in terms of performance is rather big, especially in the initialization phase, but hey, more ram to do other things when the model is training!