TL;DR

I'm not as rusty as I thought I'd be. And YES that kind of challenge has a place in the coding world (see conclusion)

Just like every year, I had a blast banging my head on the Advent of Code calendar. It so happens that this year, I had a lot less brain power to focus on them due to the exam season at school and the ensuing panicky students, but it could also be because my brain isn't up to spec.

Some of my students are/were doing the challenges, so I didn't want to post anything that would help them, but now that the year is almost over, I wanted to go over the puzzles and give out impressions (and maybe hints).

Easing in

Days 1 to 7 were mostly about setting up the stage, getting into the habit of parsing the input and using the right kind of structure to store the data.

Nothing super hard, it was "just" lists, hashmaps, and trees, until day 4. Day 4 was especially funny to me, because I wrote HoledRange / Domain just for that purpose (disjointed ranges and operations on them). Except I decided to do this year's calendar in Julia, and the library I wrote is for Swift. Just for kicks, I rewrote parts of the library, and I might even publish it.

Days 5, 6 and 7 highlighted the use of stacks, strings, and trees again. Nothing too hard.

Getting harder

My next favorite is day 9. It's about a piece of rope you drag the head of, and have to figure out what the tail does. If you've ever played zig-zaging a shoelace you'll know what I mean. String physics are fun, especially in inelastic cases.

Many ways to do that, but once you realize how the tail catches up to the head when the latter is moved, multi-segmented chains are just a recursive application of the same.

I was waiting for a day 10-like puzzle, as there tends to be one every year, and I majored in compilers all those years ago. State machines, yuuuuuusssssssss.

A lot of puzzles involve path finding after that, which isn't my strong suit for some reason. But since the algorithms are already out there (it was really funny to see the spike in google searches for Dijkstra and A*), it's "just" a matter of encoding the nodes and the edges in a way that works.

Day 13 is fun, if only because it can be instantly solved in some language with eval, which will treat the input as a program. I still wrote my own comparison functions, because I like manipulating numbers, lists and inequalities.

Day 14 is "sand simulation", that is grains of sand that settle in a conical shape that keeps expanding laterally. Once you find the landing point on each ledge and the maximum width of the pile, there's a calculable result. Otherwise, running the simulation works too, there aren't that many grains. For part 2, I just counted the holes rather than the grains.

Day 15 is about union and intersections of disjointed ranges again, except in 2D. Which, with the Manhattan distance approximation, gets back to 1D fairly quickly.

Day 16 stumped quite a few people, because of the explosive nature of path searching. Combinatorics are pretty hard to wrap your head around. Personally, I went for "reachability" within the remaining time, constructed my graph, and explored. It was probably non optimal.

Day 17 make me inordinately proud. Nuff said.

Catching up

Because of the aforementioned  workload, I was late by that point, so I decided to take my time and not complete the challenge by Xmas. Puzzles were getting hard, work was time-consuming, so the pressure needed to go down.

Because of the 3D background that I had, I tackled day 18 with raytracing, which is way over-engineered, but reminded me of the good ole times. Part 2 was trickier with that method, because suddenly I had 2 kinds of "inside".

Day 19 was path finding again, the trick being how to prune paths that didn't lead in a good direction. Probably the one that used up the most memory, and therefore the one I failed the most.

Because of my relative newness to Julia, I had to go through many hoops to solve day 20. As it turns out, screw_dog over on mastodon gave me the bit I lacked to solve it simply, although way after I solved it using other means.

Day 21 goes back to my compiler roots and tree optimizations, and Julia makes the huge integer manipulation relatively easy, so, there. Pretty proud of my solution:

Part 1:   3.709 ms (31291 allocations: 1.94 MiB)
Part 2:   4.086 ms (31544 allocations: 1.95 MiB)

Which, on my relatively slow mac mini is not bad at all! Symbolic linear equation solving (degree one, okay) is a fun thing to think about. I even think that the algorithm I devised would work on trees where the unknown appears on both sides of the tree. Maybe I'll test it some day.

Day 22. Aaaaaaaah day 22. Linked lists get you all the way if and only if you know how to fold a cube from a 2D pattern. I don't so, as many of the other participants, I hardcoded the folding. A general solution just eluded me. It's on my todo list of reading for later.

Day 23 is an interesting variant of Conway's Game of Life, and I don't believe there is a way to simplify a straight up simulation, but I fully accept I could be wrong. So I used no trick, and let the thing run for 40s to get the result.

Day 24 was especially interesting for me, for all the wrong reasons. As I mentioned, graph traversal isn't my forte. But the problem was setup in a way that "worked" for me: pruning useless paths was relatively easy, so the problem space didn't explode too quickly. I guess I should use the same method on previous puzzles that I was super clumsy with.

Finally day 25 is a straight up algorithmic base conversion problem that's a lot of fun for my brain. If you remember how carry works when adding or subtracting numbers, it's not a big challenge, but thinking in base 5 can trip you up.

Conclusion

I honestly didn't believe I could hack it this year. I don't routinely do that kind of problem anymore, I have a lot of things going on at school, on top of dealing with the long tail of Covid and its effects on education. Family life was a bit busy with health issues (nothing life threatening, but still time consuming), and the precious little free time that I had was sure to be insufficient for AoC.

I'm glad I persevered, even if it took me longer than I wished it had. I'm glad I learned how to use Julia better. And I'm happy I can still hack it.

I see here and there grumblings about formal computer science. During and after AoC, I see posts, tweets, toots, etc, saying that the "l33t c0d3" is useless in practical, day-to-day, professional development. Big O notation, formal analysis, made up puzzles that take you into voluntarily difficult territories, all these things aren't a reflection of the skills that are needed nowadays to write good apps, to make good websites, and so on.

It's true. Ish.

You can write code that works without any kind of formal training. Today's computing power and memory availability makes optimization largely irrelevant unless you are working with games or embedded systems, or maybe data science. I mean, we can use 4GB of temporary memory for like 1/4 of a second to parse and apply that 100kB json file, and it has close to no impact on the perceived speed of our app, right? Right. And most of the clever algorithms are part of the standard library anyway, or easily findable.

The problem, as usual, is at scale. The proof-of-concept, prototype, or even 1.0 version, of the program may very well work just fine with the first 100 users, or 1000 or whatever the metric is for success. Once everything takes longer than it should, there are only 3 solutions:

• rely on bigger machines, which may work for a time, but ultimately does not address the problem
• scale things horizontally, which poses huge synchronization issues between the shards
• reduce the technical debt, which is really hard

The first two rely on compute power being relatively cheap. And most of us know about the perils of infrastructure costs. That meme regularly makes the rounds.

It's not about whether you personally can solve some artificially hard problem using smart techniques, so that's ok if you can't do every puzzle in AoC or other coding challenges. It's not about flexing with your big brain capable of intuiting the bigO complexity of a piece of code. It's about being able to think about these problems in a way that challenges how you would normally do it. It's about expanding your intuition and your knowledge about the field you decided to work in.

It's perfectly OK for an architect to build only 1 or 2 level houses, there's no shame in it. But if that architect ever wants to build a 20+ stories building, the way to approach the problem is different.

Same deal with computer stuff. Learning is part of the experience.

I've been really interested in Julia for a while now, tinkering here and there with its quirks and capabilities.

This year, I've decided to try and do the whole of Advent of Code using that language.

First impressions are pretty good:
- map, reduce, and list/array management in general are really nice, being first-class citizens. I might even get over the fact that indices start at 1
- automatic multithreading when iterating over collections means that some of these operations are pretty speedy
- it's included in standard jupyterhub images, meaning that my server install gives me access to a Julia environment if I am not at my computer for some reason

Now it's kind of hard to teach old dogs new tricks, so I'm sure I misuse some of the features by thinking in "other languages". We'll see, but 4 days in, I'm still fairly confident.

First part of doing RNN text prediction with TensorfFlow, in Swift

For all intents and purposes, it's about statistics. The question we are trying to solve is either something along the lines of "given an input X, what is the most probable Y?", or along the lines of "given an input X, what is the probability of having Y?"

Of course, simple probability problems have somewhat simple solutions: if you take a game of chess and ask for a next move based on the current board, you can do all the possible moves and sort them based on the probability of having a piece taken off the board, for instance. If you are designing an autopilot of some kind, you have an "ideal" attitude (collection of yaw, pitch and roll angles), and you calculate the movements of the stick and pedals that will most likely get you closer to that objective. If your last shot went left of the target, chances are, you should go right. Etc etc etc.

But the most interesting problems don't have obvious causality. If you have pasta, tomatoes and ground meat in your shopping basket, maybe your next item will be onions, because you're making some kind of bolognese, maybe it will be soap, because that's what you need, maybe it will be milk, because that's the order of the shelves you are passing by.

Machine learning is about taking a whole bunch of hopefully consistent data (even if you don't know for sure that it's consistent), and use it to say "based on this past data, the probabilities for onions, soap and milk are X, Y, and Z, and therefore the most probable is onions.

The data your are basing your predictive model on is really really important. Maybe your next item is consistent with the layout of the shop. Maybe it is consistent with what other customers got. Maybe it's consistent to your particular habits. Maybe it's consistent with people who are in the same "category" as you (your friends, your socio-economic peers, your cultural peers, ... pick one or more).

So you want a lot of data to have a good prediction, but not all the data, because noise (random data) does not reveal a bias (people in the shop tend to visit the shelves in that order) or doesn't exploit a bias (people following receipes want those ingredients together).

Oh yes, there are biases. Lots of them. Because ML uses past data to predict the future, if the data we use was based on bad practices, recommendations won't be a lot better.

There is a branch of machine learning that starts ex nihilo but it is beyond the scope of this introduction, and generates data based on a tournament rather than on actual facts. Its principle is roughly the same, though.

So, to recap:

• We start with a model with random probabilities, and a series of "truths" ( X leads to Y )
• We try with a Z, see what the model predicts
• We compare it to a truth, and fix the probabilities a little bit so that it matches
• Repeat with as many Zs as possible to be fairly confident the prediction isn't too far off

If you didn't before, now you know why ML takes a lot of resources. Depending on the number of possible Xs, Ys and the number of truths, the probability matrix is potentially humongous. Operations (like fixing the probabilities to match reality) on such enormous structures aren't cheap.

If you want a more detailed oriented explanation, with maths and diagrams, you can read my other attempt at explaining how it works.

Swift TensorFlow

There are a few contenders in the field of "ML SDK", but one of the most known is TensorFlow, backed by Google. It also happens to have a Swift variant (almost every other ML environment out there is either Python or R).

And of course, the documentation is really... lacking, making this article more useful than average along the way.

In their "Why Swift?" motivation piece, the architects make a good case, if a little bit technical, as to why swift makes a good candidate for ML.

The two major takeaways you have to know going in are:

• It's a different build of Swift. You cannot use the one that shipped with Xcode (yet)
• It uses a lot of Python interoperability to work, so some ways of doing things will be a bit alien

The performance is rather good, comparable or better than the regular Python TensorFlow for the tasks I threw at it, so there's that.

But the documentation... My oh my.

Let's take an example: Tensor is, as the name of the framework implies, the central feature of the system. Its documentation is here: https://www.tensorflow.org/swift/api_docs/Structs/Tensor

Sometimes, that page greets me in Greek... But hey, why not. There is little to no way to navigate the hierarchy, other than going on the left side, opening the section (good luck if you don't already know if it's a class, a protocol or a struct you're looking for), and if you use the search field, it will return pages about... the Python equivalents of the functions you're looking for.

Clearly, this is early in the game, and you are assumed to know how regular TensorFlow works before attempting to do anything with STF.

But fret not! I will hold your hand so that you don't need to look at the doc too much.

The tutorials are well written, but don't go very far at all. Oh and if you use that triple Dense layers on more than a toy problem (flower classification that is based on numbers), your RAM will fill so fast that your computer will have to be rebooted. More on that later.

And, because the center of ML is that "nudge" towards a better probability matrix (also called a Tensor), there is the whole @differentiable thing. We will talk about it later.

A good thing is that Python examples (there are thousands of ML tutorials in Python) work almost out of the box, thanks to the interop.

Data Preparation

Which Learning will my Machine do?

I have always thought that text generation was such a funny toy example (if a little bit scary when you think about some of the applications): teach the machine to speak like Shakespeare, and watch it spit some play at you. It's also easy for us to evaluate in terms of what it does and how successful it is. And the model makes sense, which helps when writing a piece on how ML works.

A usual way of doing that is using trigrams. We all know than predicting the next word after a single word is super hard. And our brains tend to be able to predict the last word of a sentence with ease. So, a common way of teaching the machine is to have it look at 3 words to predict a 4th.

I am hungry -> because, the birds flew -> away, etc

Of course, for more accurate results, you can extend the number of words in the input, but it means you must have a lot more varied sentence examples.

What we need to do here is assign numbers to these words (because everything is numbers in computers) so that we have a problem like "guess the function f if f(231,444,12)->123, f(111,2,671)->222", which neural networks are pretty good at.

So we need data (a corpus), and we need to split it into (trigram)->result

Now, because we are ultimately dealing with probabilities and rounding, we need the input to be in Float, so that the operations can wriggle the matrices by fractions, and we need the result to be an Int, because we don't want something like "the result is the word between 'hungry' and 'dumb'".

The features (also called input) and the labels (also called outputs) have to be stored in two tensors (also called matrices), matching the data we want to train our model on.

That's where RAM and processing time enter the arena: the size of the matrix is going to be huge:

• Let's say the book I chose to teach it English has 11148 words in it (it's Tacitus' Germany), that's 11148*3-2 trigrams (33442 lines in my matrices, 4 columns total)
• The way neural networks function, you basically have a function parameter per neuron that gets nudged at each iteration. In this example, I use two 512 parameters for somewhat decent results. That means 2 additional matrices of size 33442*512.
• And operations regularly duplicate these matrices, if only for a short period of time, so yea, that's a lot of RAM and processing power.

Here is the function that downloads a piece of text, and separates it into words:

func loadData(_ url: URL) -> [String] {
let sem = DispatchSemaphore(value: 0)
var result = [String]()

let session = URLSession(configuration: URLSessionConfiguration.default)
//     let set = CharacterSet.punctuationCharacters.union(CharacterSet.whitespacesAndNewlines)
let set = CharacterSet.whitespacesAndNewlines
session.dataTask(with: url, completionHandler: { data, response, error in
if let data = data, let text = String(data: data, encoding: .utf8) {
let comps = text.components(separatedBy: set).compactMap { (w) -> String? in
// separate punctuation from the rest
if w.count == 0 { return nil }
else { return w }
}
result += comps
}

sem.signal()
}).resume()

sem.wait()
return result
}

Please note two things: I make it synchronous (I want to wait for the result), and I chose to include word and word, separately. You can keep only the words by switching the commented lines, but I find that the output is more interesting with punctuation than without.

Now, we need to setup the word->int and int->word transformations. Because we don't want to look at all the array of words every time we want to search for one, there is a dictionary based on the hashing of the words that will deal with the first, and because the most common words have better chances to pop up, the array for the vocabulary is sorted. It's not optimal, probably, but it helps makes things clear, and is fast enough.

func loadVocabulary(_ text: [String]) -> [String] {
var counts = [String:Int]()

for w in text {
let c = counts[w] ?? 0
counts[w] = c + 1
}

let count = counts.sorted(by: { (arg1, arg2) -> Bool in
let (_, value1) = arg1
let (_, value2) = arg2
return value1 > value2
})

return count.map { (arg0) -> String in
let (key, _) = arg0
return key
}
}

func makeHelper(_ vocabulary: [String]) -> [Int:Int] {
var result : [Int:Int] = [:]

vocabulary.enumerated().forEach { (arg0) in
let (offset, element) = arg0
result[element.hash] = offset
}

return result
}

Why not hashValue instead of hash? turns out, on Linux, which this baby is going to run on, the values are more stable with the latter rather than the former, according to my tests.

The data we will work on therefore is:

struct TextBatch {
let original: [String]
let vocabulary: [String]
let indexHelper: [Int:Int]
let features : Tensor<Float> // 3 words
let labels : Tensor<Int32> // followed by 1 word
}

We need a way to initialize that struct, and a couple of helper functions to extract some random samples to train our model on, and we're good to go:

extension TextBatch {
public init(from: [String]) {
let h = makeHelper(v)
var f : [[Float]] = []
var l : [Int32] = []
for i in 0..<(from.count-3) {
if let w1 = h[from[i].hash],
let w2 = h[from[i+1].hash],
let w3 = h[from[i+2].hash],
let w4 = h[from[i+3].hash] {
f.append([Float(w1), Float(w2), Float(w3)])
l.append(Int32(w4))
}
}

let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 }) let labelsT = Tensor<Int32>(l) self.init( original: from, vocabulary: v, indexHelper: h, features: featuresT, labels: labelsT ) } func randomSample(of size: Int) -> (features: Tensor<Float>, labels: Tensor<Int32>) { var f : [[Float]] = [] var l : [Int32] = [] for i in 0..<(original.count-3) { if let w1 = indexHelper[original[i].hash], let w2 = indexHelper[original[i+1].hash], let w3 = indexHelper[original[i+2].hash], let w4 = indexHelper[original[i+3].hash] { f.append([Float(w1), Float(w2), Float(w3)]) l.append(Int32(w4)) } } var rf : [[Float]] = [] var rl : [Int32] = [] if size >= l.count || size <= 0 { let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap {$0 })
let labelsT = Tensor<Int32>(l)
return (featuresT, labelsT)
}
let idx = Int.random(in: 0..<l.count)
rf.append(f[idx])
rl.append(l[idx])
}
}

let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 }) let labelsT = Tensor<Int32>(l) return (featuresT, labelsT) } func randomSample(splits: Int) -> [(features: Tensor<Float>, labels: Tensor<Int32>)] { var res = [(features: Tensor<Float>, labels: Tensor<Int32>)]() var alreadyPicked = Set<Int>() let size = Int(floor(Double(original.count)/Double(splits))) var f : [[Float]] = [] var l : [Int32] = [] for i in 0..<(original.count-3) { if let w1 = indexHelper[original[i].hash], let w2 = indexHelper[original[i+1].hash], let w3 = indexHelper[original[i+2].hash], let w4 = indexHelper[original[i+3].hash] { f.append([Float(w1), Float(w2), Float(w3)]) l.append(Int32(w4)) } } for part in 1...splits { var rf : [[Float]] = [] var rl : [Int32] = [] if size >= l.count || size <= 0 { let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap {$0 })
let labelsT = Tensor<Int32>(l)
return [(featuresT, labelsT)]
}
let idx = Int.random(in: 0..<l.count)
rf.append(f[idx])
rl.append(l[idx])
}
}

let featuresT = Tensor<Float>(shape: [f.count, 3], scalars: f.flatMap { $0 }) let labelsT = Tensor<Int32>(l) res.append((featuresT,labelsT)) } return res } } In the next part, we will see how to set the model up, and train it. From this list, the gist is that most languages can't process 9999999999999999.0 - 9999999999999998.0 Why do they output 2 when it should be 1? I bet most people who've never done any formal CS (a.k.a maths and information theory) are super surprised. Before you read the rest, ask yourself this: if all you have are zeroes and ones, how do you handle infinity? If we fire up an interpreter that outputs the value when it's typed (like the Swift REPL), we have the beginning of an explanation: Welcome to Apple Swift version 4.2.1 (swiftlang-1000.11.42 clang-1000.11.45.1). Type :help for assistance. 1> 9999999999999999.0 - 9999999999999998.0$R0: Double = 2
2> let a = 9999999999999999.0
a: Double = 10000000000000000
3> let b = 9999999999999998.0
b: Double = 9999999999999998
4> a-b
$R1: Double = 2 Whew, it's not that the languages can't handle a simple substraction, it's just that a is typed as 9999999999999999 but stored as 10000000000000000. If we used integers, we'd have:  5> 9999999999999999 - 9999999999999998$R2: Int = 1

Are the decimal numbers broken? 😱

A detour through number representations

Let's  look at a byte. This is the fundamental unit of data in a computer and  is made of 8 bits, all of which can be 0 or 1. It ranges from 00000000 to 11111111 ( 0x00 to 0xff in hexadecimal, 0 to 255 in decimal, homework as to why and how it works like that due by monday).

Put like that, I hope it's obvious that the question "yes, but how do I represent the integer 999 on a byte?" is meaningless. You can decide that 00000000 means 990 and count up from there, or you can associate arbitrary values to the 256 possible combinations and make 999 be one of them, but you can't have both the 0 - 255 range and 999. You have a finite number of possible values and that's it.

Of  course, that's on 8 bits (hence the 256 color palette on old games). On  16, 32, 64 or bigger width memory blocks, you can store up to 2ⁿ different values, and that's it.

The problem with decimals

While  it's relatively easy to grasp the concept of infinity by looking at  "how high can I count?", it's less intuitive to notice that there is the same amount of numbers between 0 and 1 as there are integers.

So,  if we have a finite number of possible values, how do we decide which  ones make the cut when talking decimal parts? The smallest? The most  common? Again, as a stupid example, on 8 bits:

• maybe we need 0.01 ... 0.99 because we're doing accounting stuff
• maybe we need 0.015, 0.025,..., 0.995 for rounding reasons
• We'll just encode the numeric part on 8 bits ( 0 - 255 ), and the decimal part as above

But that's already  99+99 values taken up. That leaves us 57 possible values for the rest of infinity. And that's not even mentionning the totally arbitrary nature of the  selection. This way of representing numbers is historically the first  one and is called "fixed" representation. There are many ways of  choosing how the decimal part behaves and a lot of headache when coding  how the simple operations work, not to mention the complex ones like  square roots and powers and logs.

Floats (IEEE 754)

To  make it simple for chips that perform the actual calculations, floating  point numbers (that's their name) have been defined using two  parameters:

• an integer n
• a power (of base b) p

Such that we can have n x bᵖ, for instance 15.3865 is 153863 x 10^(-4). The question is, how many bits can we use for the n and how many for the p.

The standard is to use 1 bit for the sign (+ or -), 23 bits for n, 8 for p, which use 32 bits total (we like powers of two), and using base 2, and n is actually 1.n.  That gives us a range of ~8 million values, and powers of 2 from -126  to +127 due to some special cases like infinity and NotANumber (NaN).

$$(-1~or~1)(2^{[-126...127]})(1.[one~of~the~8~million~values])$$

In theory, we have numbers from -10⁴⁵ to 1038 roughly, but some numbers can't be represented in that form. For  instance, if we look at the largest number smaller than 1, it's 0.9999999404. Anything between that and 1 has to be rounded. Again, infinity can't be represented by a finite number of bits.

Doubles

The  floats allow for "easy" calculus (by the computer at least) and are  "good enough" with a precision of 7.2 decimal places on average. So when  we needed more precision, someone said "hey, let's use 64 bits instead  of 32!". The only thing that changes is that n now uses 52 bits and p 11 bits.

Coincidentally, double has more a meaning of double size than double precision, even though the number of decimal places does jump to 15.9 on average.

We  still have 2³² more values to play with, and that does fill some  annoying gaps in the infinity, but not all. Famously (and annoyingly),  0.1 doesn't work in any precision size because of the base 2. In 32 bits  float, it's stored as 0.100000001490116119384765625, like this:

(1)(2⁻⁴)(1.600000023841858)

Conversely, after double size (aka doubles), we have quadruple size (aka quads), with 15 and 112 bits, for a total of 128 bits.

Back to our problem

Our value is 9999999999999999.0. The closest possible value encodable in double size floating point is actually 10000000000000000, which should now make some kind of sense. It is confirmed by Swift when separating the two sides of the calculus, too:

2> let a = 9999999999999999.0
a: Double = 10000000000000000

Our  big brain so good at maths knows that there is a difference between  these two values, and so does the computer. It's just that using  doubles, it can't store it. Using floats, a will be rounded to 10000000272564224 which isn't exactly better. Quads aren't used regularly yet, so no luck there.

It's  funny because this is an operation that we puny humans can do very  easily, even those humans who say they suck at maths, and yet those  touted computers with their billions of math operations per second can't  work it out. Fair enough.

The kicker is, there is a litteral infinity of examples such as this one, because trying to represent infinity in a finite number of digits is impossible.