[Dev Diaries] Force Directed Layout

Sometimes, the worlds of computer graphics and physics collide in the most beautiful ways.

Everybody loves UICollectionViews, but after a while, those grids tend to be a bit restrictive. The "flow" layout is fine and all, but what if we wanted to have the cells arranged in an "organic" manner?

Let's talk about forces

In nature, almost everything is about forces. Gravity is probably the one everyone is the most familiar with, and it goes straight down at roughly 9.8 m/s/s. What that means is that at the beginning, your object goes at 0 m/s, then after a second it goes at 9.8 m/s, then after another second, it goes at 19.6 m/s, etc. When an object is on the ground, the gravity still applies, but it's compensated by the floor pushing up with a force equal to the gravity.

Good ole Newton says that what matters is the sum of all the forces that apply to an object. From that you can deduce the acceleration the object is subjected to (in m/s/s), and by derivation, the speed of an object at a given time. Speed is the distance moved by unit of time, so from the speed, we can determine the position of an object on which the force applies.

F = ma \implies a = F/m

For the rest of this post, we will assume everything has a mass of 1, because it simplifies a bit the operations.

Exemple: gravity on a slope

In terms of code, I decided to have a struct representing forces in a way that will give me the operations I will need later:

  • adding two forces together
  • getting the angle of the force in a plane
  • getting the magnitude (length of the vector)
struct Force {
    let dx: CGFloat
    let dy: CGFloat
    static func +(lhs: Force, rhs: Force) -> Force {
        return Force(lhs.dx+rhs.dx,lhs.dy+rhs.dy)
    }
    
    static func /(lhs: Force, rhs: CGFloat) -> Force {
        return Force(lhs.dx/rhs, lhs.dy/rhs)
    }
    
    init(_ x: CGFloat, _ y: CGFloat) {
        dx = x
        dy = y
    }
    
    var angle : CGFloat {
        return CGFloat(atan2(Double(dy), Double(dx)))
    }
    
    var magnitude : CGFloat {
        return sqrt(dx*dx+dy*dy)
    }
}
Attraction

We'll need our cells to be attracted to a certain point in the plane. Ideally, if there's only one cell, it'll sit at that point, and if there are two, they will be as close as possible to there, while not overlapping (more on that later).

Gravity has flair, but its exponential nature is a bit too powerful. For force directed graphs (like the ones dot produces), we tend to go with its linear cousin: the spring. It also is easier to wrap your head around: you attach all the cells to the center with a spring, and they will try to get back there.

Spring force is expressed with Hooke's law

F = -k•∂
  • k is the stiffness of the spring
  • ∂ is how far we are from the "ideal" distance to the other end of the spring

You can see that compressing a spring produces an "extension" force (it pushes the two ends away because ∂ is negative), and the opposite "attraction" force (when ∂ is positive).

Repulsion

We could use the spring for repulsion as well, but it'd mean "wiring" all the cells with their nearest neighbours and always keeping track of which cells are attached to which other cells. It's too complex to manage, but fortunately, we have another physics law that repulses and looks quite natural: magnets.

Magnets attract obviously, but they also repulse quite heavily. And because we want that repulsion to be extremely high when the objects are super close and negligible when they are far away from each other, this time we kind of need an inverse square law. Coulomb's law to the rescue!

F = k \frac{Q_1 * Q_2}{r^2}
  • k isn't that important for us, but its value is 9×109 N⋅m2⋅C−2
  • The two Qs are the electric charges of the objects
  • r is the distance between them

You can see that objects with the same electric charge repulse each other, while objects with opposite charges attract each other.

So why didn't we use this for our attraction law? The big problem with using that kind of force is that it's very computationally expensive: every cell will affect every other cell. Calculating the resulting force is proportional to the square of the number of cells. Simple spring mechanics is a lot simpler to manage for attraction, but we won't escape the complexity of repulsion.

Sidenote: what's up with that k value? 😳
That's physics for you, I'm afraid. We need to match the formula to the real world, and that's the number that fits what we see.
For our purposes, we have made-up identical "charges" on our cells anyways, so we will have k(Q²/r²). Might as well ignore k altogether and go with Q³/r²
Of course, that value of Q has no "physical" value, it's just convenient.
Let's create a universe

So, we have decided on our forces, now we need something to apply them to. Meet Node, the structure which will handle the physics side of our layout. It needs to hold 3 informations:

  • the position the cell is at : x and y
  • some way to differentiate it from other cells (because of the repulsion bit) : a uuid
  • a way to link it to the data source of the UICollectionView : an IndexPath

Here's the boring first part:

struct Node : Equatable, Hashable {
    var x : CGFloat = 0
    var y : CGFloat = 0
    let uuid = UUID() // for identification purposes
    var indexPath : IndexPath
    
    init(x px: CGFloat = 0, y py: CGFloat = 0, for idx: IndexPath) {
        x = px
        y = py
        indexPath = idx
    }
    
    static func == (lhs: Node, rhs: Node) -> Bool {
        return lhs.uuid == rhs.uuid
    }
    // ...
}

The first interesting bit is the attraction function. It is extremely verbose for education purposes, you can compactify and optimize it quite a bit.

func attraction(center: CGPoint, stiffness: CGFloat) -> Force {
    // Hooke's Law: F = -k•∂ (∂ being the "ideal" distance minus the actual distance)
    let dx = x - center.x
    let dy = y - center.y
    let angle = CGFloat(atan2(Double(dy), Double(dx)))
    let delta = sqrt(dx*dx+dy*dy)
    let intensity = stiffness * delta
    let ix = abs(intensity * cos(angle))
    let fx : CGFloat
    if center.x > x { // positive force to the right
        fx = ix
    } else {
        fx = -ix
    }
    let iy = abs(intensity * sin(angle))
    let fy : CGFloat
    if center.y > y { // positive force to the bottom
        fy = iy
    } else {
        fy = -iy
    }
    return Force(fx,fy)
}

The really big one, although not that complicated from the code point of view, is the repulsion force. It has to take into account every single other node. As with the attraction, it is verbose for the sake of clarity, and can obviously be optimized. One such optimization is to return early with a 0 if the distance is "too great", but that's beyond the scope of what we are trying to do here.

func repulsion(others: [Node], charge: CGFloat) -> Force {
    var totalForce = Force(0,0)
    for n in others.filter({ (on) -> Bool in
        on.uuid != self.uuid // just in case
    }) {
        // Coulomb’s Law; F = k(Q1•Q2/r²)
        // Since we're dealing with arbitrary "charges" here, we'll simplify to F = C³/r²
        // We want repulsion (Q1=Q2) and not deal with big numbers, so that works
        var dx = x - n.x
        var dy = y - n.y
        if dx == 0 && dy == 0 { // wiggle a bit
            let room : CGFloat = 0.05
            dx += CGFloat.random(in: -room...room)
            dy += CGFloat.random(in: -room...room)
        }
        let angle = CGFloat(atan2(Double(dy), Double(dx)))
        let delta = max(0.000001,sqrt(dx*dx+dy*dy)) // do NOT divide by zero you fool
        let intensity = pow(charge,3)/(delta*delta)
        let ix = abs(intensity * cos(angle))
        let fx : CGFloat
        if n.x > x { // positive force to the left
            fx = -ix
        } else {
            fx = ix
        }
        let fy : CGFloat
        let iy = abs(intensity * sin(angle))
        if n.y > y { // positive force to the bottom
            fy = -iy
        } else {
            fy = iy
        }
        
        totalForce = totalForce + Force(fx,fy)
    }
    
    return totalForce
}

Note the check for divide-by-zero shenanigans (in case two cells overlap), that is covered in two different ways: we wriggle the cells a bit to avoid being exactly on top of each other and we don't divide by zero if the wriggling failed. The resulting force is the sum of being repulsed by every other cell.

Putting it all together, we'll just need to sum attraction and repulsion. For some reason, I have also included the possibility of adding a force to a node.

func globalForce(center: CGPoint, otherNodes: [Node], stiffness: CGFloat, charge: CGFloat) -> Force {
    let a = attraction(center: center, stiffness: stiffness)
    let r = repulsion(others: otherNodes, charge: charge)
    return a + r
}

static func +(lhs: Node, rhs: Force) -> Node {
    return Node(x: lhs.x+rhs.dx, y: lhs.y+rhs.dy, for: lhs.indexPath)
}

Because we have to compute the global force (which depends on every node) for every node, we end up with a square order complexity. But since the force calculation is independant for every node, we can spread the calculations of as many threads as we can to claw a bit of performance back. Thankfully I have a Task system handy.

Here is the function that takes all nodes, and computes where they should go next. springStiffness and electricCharge are instance variables of the layout. Its output is the pair composed of the new node positions and the total movement of all the nodes, for a reason that will become clear later.

func computeNewPositions(center: CGPoint, nodes: [Node]) -> (nodes: [Node], movement: CGFloat) {
    // if the total movement is less than threshold, will return nil
    var totalMovement : CGFloat = 0
    var newNodes : [Node] = []
    var computeTasks : [Task] = []
    let lock = NSLock()
    for n in nodes {
        let t = Task() {
            let f = n.globalForce(center: center, otherNodes: nodes, stiffness: self.springStiffness, charge: self.electricCharge)
            let nn = n + f
            lock.lock()
            newNodes.append(nn)
            totalMovement += f.magnitude
            lock.unlock()
        }
        computeTasks.append(t)
    }
    let waitSem = DispatchSemaphore(value: 0)
    let compute = Task.group(computeTasks)
    compute.perform { (outcome) in
        waitSem.signal()
    }
    waitSem.wait()
    
    return (newNodes, totalMovement)
}

Now's the time to implement a few of the overrides for our custom layout. Let's start with the obvious ones:

protocol ForceDirectedLayoutDelegate {
    func layoutDidFinish()
}
class ForceDirectedLayout : UICollectionViewLayout {
    // in case you want to know
    var delegate : ForceDirectedLayoutDelegate? = nil
    
    // maths
    var springStiffness : CGFloat = 0.02 // max(width,height)/1000 seems ok
    var electricCharge : CGFloat = 10 // max(width,height)/2 seems ok
    var cellSize : CGSize = CGSize(width: 20,height: 20) // obviously too small for real use

    // ...

    override func shouldInvalidateLayout(forBoundsChange newBounds: CGRect) -> Bool {
        return true // because the center changes...
    }
    
    override var collectionViewContentSize: CGSize {
        var totalRect = CGRect(origin: CGPoint.zero, size: self.collectionView?.frame.size ?? CGSize.zero)
        for cachedA in cachedAttributes.values {
            totalRect = totalRect.union(
                CGRect(x: cachedA.center.x-cachedA.size.width*0.5,
                       y: cachedA.center.y-cachedA.size.height*0.5,
                       width: cachedA.size.width, height: cachedA.size.height))
        }
        return totalRect.size
    }
}

The eagle eyed among you will have noticed a cachedAttributes instance variable. For some reason, UICollectionView itself doesn't cache the current position and size of a cell. If you call collectionView.layoutAttributesForItem(at: idx) it gives you a nil value, forcing you to regenerate the attributes at every stage (and there are many). I have decided to cache the cell attributes internally as a backup, and if Apple decides at some point to return a non-nil value, I will use it. But what it forces me to do is to make sure I detect cell deletion. If a cell is removed, I need to remove its cached attributes. You can skip that deletion part if you want, or code it more efficiently, it's just for completeness' sake.

/// Unfortunately, collectionView.layoutAttributesForItem doesn't seem to be caching the previous attributes
/// We do it ourselves as a backup
fileprivate var cachedAttributes = [IndexPath:UICollectionViewLayoutAttributes]()

/// Since we cache the data ourselves, we need to cleanup if elements are removed
override func prepare() {
    super.prepare()
    
    guard let collection = self.collectionView else {
        cachedAttributes.removeAll()
        return
    }
    let sectionCount = collection.dataSource?.numberOfSections?(in: collection) ?? 1
    var rowCounts = [Int:Int]()
    for s in 0..<sectionCount {
        rowCounts[s] = collection.dataSource?.collectionView(collection, numberOfItemsInSection: s) ?? 0
    }
    for removed in cachedAttributes.keys.filter({ (idx) -> Bool in
        return idx.section >= sectionCount || idx.row >= (rowCounts[idx.section] ?? 0) // hence the dictionary, no index out of bounds
    }) {
        cachedAttributes.removeValue(forKey: removed)
    }
}

And finally, the override that makes everything work: the layout function itself. Unfortunately, if a node changes position, it affects every other node, potentially, so we can't update only a partial list. We need to do them all every single time.

But wait. We have been talking about forces for 2000+ words here. Forces can help you figure out positions but they aren't positions themselves. Yep, that's right, we have an animated layout: from the current position of the cells, you calculate where the cells should go next. So... when does the animation end?

That's where the movement output of the computation function comes in. When we reach an equilibrium (like the floor force and the gravity exactly compensating each other), the movement drops to zero. Of course, in the real world, it's a neverending process but we can just decide that the layout stops animating as soon as the total movement of every cell falls below a certain threshold. I have decided arbitrarily that a third of a pixel is imperceptible enough, even on a 3x retina screen but if you want to stop the animation sooner, feel free to raise that a bit.

So, the mechanics will be as follows:

  • we take the current positions of all the cells
  • we compute their new positions
  • if they didn't move much, we stop
  • if they still are moving quite a bit, we instruct the collection view to keep going by invalidating the layout again
// unfortunately, every node affects every other node, so we can't do partial updates
override func layoutAttributesForElements(in rect: CGRect) -> [UICollectionViewLayoutAttributes]? {
    var attributes : [UICollectionViewLayoutAttributes] = []
    var nodes = [Node]()
    
    if let collectionView = self.collectionView {
        for i in 0..<(collectionView.dataSource?.collectionView(collectionView, numberOfItemsInSection: 0) ?? 0) {
            let idx = IndexPath(row: i, section: 0)
            let currentAttributes : UICollectionViewLayoutAttributes
            if let ca = (collectionView.layoutAttributesForItem(at: idx) ?? cachedAttributes[idx]) {
                currentAttributes = ca
            } else {
                // randomize start positions, just for funsies
                currentAttributes = UICollectionViewLayoutAttributes(forCellWith: idx)
                currentAttributes.center = CGPoint(
                    x: CGFloat.random(in: 0...self.collectionViewContentSize.width),
                    y: CGFloat.random(in: 0...self.collectionViewContentSize.height)
                )
                currentAttributes.size = cellSize
                cachedAttributes[idx] = currentAttributes
            }
            attributes.append(currentAttributes)
            nodes.append(Node(x: currentAttributes.center.x, y: currentAttributes.center.y, for: idx))
        }
        
        let center = self.collectionView?.absoluteCenter ?? CGPoint.zero
        let nextIteration : (nodes: [Node], movement: CGFloat)
        nextIteration = computeNewPositions(
            center: center,
            nodes: nodes)
        
        for n in nextIteration.nodes {
            if let attrsIdx = attributes.firstIndex(where: { $0.indexPath == n.indexPath }) {
                let attrs = attributes[attrsIdx]
                attrs.center = CGPoint(x: n.x, y: n.y)
                attributes[attrsIdx] = attrs
            }
        }
        
        // if it's still moving, keep going
        if nextIteration.movement > 0.3 { // subpixel animation
            DispatchQueue.main.asyncAfter(deadline: DispatchTime.now() + 0.05) {
                self.invalidateLayout()
            }
        } else {
            DispatchQueue.main.async {
                self.delegate?.layoutDidFinish()
            }
        }
        
    }
    
    return attributes
}

Please note that I position new cells randomly. They could appear in the middle of the other cells. Depending on the effect you want to go for, you can have they come from a specific point, or randomly from the borders, etc.

And that's it! I have made a quick example that adds round cells progressively to a collection view, and here's the result:

You can obviously get the whole thing on GitHub, prepared for use as a swift package for the future Xcode version.

Autonomous Cars? Not yet...

From Designer News:

Perhaps the biggest technical obstacle, however, is converting human understanding into robotic intelligence. The intelligence that enables  human beings to drive a car is largely taken for granted, and  replicating it is proving to be a bigger chore than engineers foresaw.

Color me surprised... 🙄

Two things here:

  • "converting human understanding into robotic intelligence" requires knowing what understanding and intelligence are
  • human brain are exceptionally good at filtering the useless (you don't have to think about chewing, until you choke on something), whereas, for now, computers can only deal with the explicit

Maths != CS

If you read this blog, you might have seen quite a few articles about things that look like maths. This is my classical training seeping in. But, and that's a big one, because CS is born out of mathematics doesn't mean that in order to be good at computers, you need to be Ramanujan.

I guess that you can say that technically most of CS is run on mathematics, but it's the kind of maths that mathematicians don't like. The one where you can't use one of the most powerful tool in their arsenal: infinity.

Go back to your last somewhat high-level class involving maths. It's full of little things like "and when you go to infinity" or "when it tends to infinity" or the ever-present "..." that means "carry on to infinity".

When you do a maths proof, either you need infinity (or an infinite process), or you're dealing with a not-that-interesting problem.

The infinity issue

For us CS majors, infinity is a tricky concept. We like it in general, but we hate it when we deal with it. The very first case of "bad" infinity every coder faces earlier than they like is the so-called "infinite loop". Of course, it's not really infinite as it is bound to the program that's stuck in it, which probably won't run forever, but everyone agrees it's bad.

We don't do infinity. We have finite amounts of RAM, disk space, time, etc... Even the numbers we manipulate tend to have finite minima and maxima. Of course, that realization often comes at the cost of a major catastrophy, when coders forget that infinite is a thing we can't actually do, but hey...

The main thing you have to take away from all that stuff should be that if you manipulate something that grows all the time, or shrinks all the time, or splits all the time, you're in trouble. Because "all the time" tends to infinity.

So, what parts of maths should you be looking at to improve your CS?

The finite arithmetic stuff

Modulo arithmetic (or modular arithmetic) is a way of looking at numbers without infinity. In past articles, I've illustrated (hopefully successfully) that the integer stuff that we do in our programs is always problematic in some way. That is, it's problematic if you don't take the modulus into account:

255_{8 bits} + 1 = 0

A lot of exploits use that bug: if your "admin ID" is 0 and you have UInt16 IDs for your users, then the 65536th user will be admin as well. Maybe you don't plan on having that many users... Buuuuuuuuuuuuuuut...

Complexity

Ah, the BigO notation. Even if you've never had to calculate it by hand (I have...), you've seen it in algorithm discussions.

Complexity is about evaluating the resources a program will take. Because we live in a world obsessed with speed, most of you think that complexity is about the speed efficiency of an algorithm. And, fair point, more often than not, people will advance a nice BigO notation measuring the rough estimate of the number of operations your algorithm will take:

Dijkstra \, has \, a \, complexity \, of \, O(|V^{2}|)

That means that the number of steps for Dijkstra's algorithm is probably going to be in the order of magnitude of the square of the number of vertices.

That's an interesting metrics, because it will tell us roughly what happens in a growth scenario:

  • I write my Dijkstra algorithm for my super duper GPS app (I really should use more modern algorithms, but that's the one I know)
  • Being a good developer, I test it on a fairly large case, let's say 1000 crossroads
  • I time the result to 1s. Cool, good enough
  • But with a larger case, say the number of crossroads in a county, the time will grow exponentially:
time(x) = (x/1000)^{2}s \newline time(1000) = 1s \newline time(10000) = 100s \, (almost \, 2 \, min) \newline time(100000) = 10000s \, (almost \, 3 \, hours)

So, okay, maybe you need to optimize your algorithm, or redo it.

But one aspect that tends to be overlooked is that BigO notation doesn't only measure time. Complexity also deals with space. Most data is kind of linear in terms of space complexity: you add one more item in your list, it takes up one more "chunk of space", right?

Unless your data happens to have interdependency. Take the crossroad example above... Adding one vertex (node, choice, etc) actually adds at least the same amount of edges, because otherwise your crossroad isn't connected to, you know, roads... So, the space your "map" takes is kind of linear but it will take a multiple of your node counts of space to make it fit. And as soon as your single bit of data may in theory be connected to every other bit of data, the space it takes could be exponential.

The recursion problem

This leads me to a problem that is ours and not mathematicians'. Because of the way computers work, if you have a recursive function, say

func factorial(_ x: Int) -> Int {
  if x <= 0 { return 1 } // why not
  return x * factorial(x-1)
}
          

Every time factorial is called, it "saves" in a special bit of memory the state where it was to come back to it later. Here, it saves the x value before doing the new calculation, then restores it to be able to do the multiplication.

Whatever memory that state takes (it's more than just x), this simple function will take xtimes the  memory to perform. And since we don't have infinite memory, at some point, we will have exhausted it, and our program, which is all neat and perfectly reasonable, will crash... if the modulus kink doesn't get us first.

Application to the modern Web stuff

We are at a point where we will severely need to optimize our code to use less energy (looking at you bitcoin), but that's only one side of the problem. Because most stuff goes through the Internet nowadays, and massive calculation infrastructures are reasonably cheap, we have largely forgotten about the need to optimize for space. You can read every old where on the web old farts like me complaining about page sizes (for reference, a typical page on this blog weighs about half a meg, with only half of that transmitted on the pipe thanks to compression - the home page of GitHub weighs a whooping 5MB, 4.3MB compressed).

Optimizing page loads means optimizing transfer times. Transfer time is a function of bandwidth and size.

So, your brand new app talks to your server using JSON files, huh? Did you stop to think that JSON grows with the number of characters it contains?

Even a big number such as 981282348613667519 takes only 8 bytes in memory. But in your favorite text format (JSON, XML, and the like), it takes 19 bytes, almost 3 times as much. So, if you want to optimize your app's speed, a low hanging fruit might be to look at ways to transfer as little as possible.

Other areas of mathematical interest for CS people

Geometry, of course. As long as you put pixels on a screen and have a coordinate system in your user inputs (touch, click, etc), you need your trigonometry.

Dealing with animations and vector graphics means having a basic grasp of Bezier curves, and a tiiiiiiiny teeeeeeeeny bit of taste for derivative and tengents.

If, for some reason, you need to deal with audio, or any other kind of "waveform", you'll get by a lot better by knowing the basics of Fourier transformation, which, I'll freely admit, hurt my brain really badly when I was learning about them. But you only need to know what they do, not how they work, for most uses we developers have to deal with. Don't let the scary formula scare you, it's just numbers, and no one asks you to prove they work on a mathematical level. If someone is asking that of you, then I hope maths is your thing.

Any kind of modern compression leans heavily on difference stuff, and yes, it helps to have some knowledge in differential mathematics (a.k.a. looking at the evolution of the rate of change). Again, this is a biiiiiiiiiiig and scary field of mathematics to most, but CS can use a fraction of its core concepts to make things work better for us.

If you do 3D, then you live and breathe topology, as well as quaternions, and projection, and there's little I can do for you. Of course, projection stuff is kind of a database thing as well, so if you're doing heavy-duty SQL stuff, you should probably read up on that too.

Machine-learning, and all the other dataset manipulation techniques out there, rely mostly on statistics and algebra.

You said maths != CS

I did. In all of these cases above, the simple fact that we just take the infinity away makes it all a lot simpler in some cases, and a bit more complicated in others.

Knowing about the maths that idealize our problem gives us an interesting perspective to work with, but the fact we don't have the luxury of "etc ad infinitum" makes most of the maths we take inspiration from too broad for us to use directly. We only need special cases, we only use a subset, we only take a specific branch of a field.

It's kind of like our instinct that tells us that if we launch a ball straight up in the air, and there's some wind, it might not land exactly where it started from. If we have general knowledge of certain fields of mathematics, we can see the path we expect the ball to take more accurately. We just don't need to know the whole field of fluid dynamics and newtonian physics to compensate for it, just their general trends.

OCaml ❤️

From this page:

This project allows to run OCaml programs on PIC microcontrollers.

What a blast from the past. Emmanuel Chailloux (credited as the "meta-supervisor") was my advisor on my CS master's degree, back when I did "industrial" coding (C, Java and Objective-C) during nights, and "fun" research coding in OCaml during days.

Over the past years (decades, who am I kidding?), I've seen the language go from vaguely obscure academic torture device for CS students to the de-facto standard for certain types of developments (including Facebook's Messenger), and I use it as an counter example to my own students' "but you're teaching us something we'll never use in real life" kind of objections.

And now, people may start thinking it's a good tool to assail one of the last bastions of "but code should look like the machine executes it", PIC programming. So cool.

More programming languages, more ways to look at a problem, more varied solutions.

A Little Bit Of Math #1

Aaaaaah the math tricks! When you know 'em, you love 'em, and when you don't, you pay for extra computing resources.

Today's math trick has to do with averages. Averages are easy, right? you take all the numbers in the list, you sum them, and then you divide by the count... Pft, that's no trick!

A_{list} = \frac{\sum_{i=0}^{list.size - 1} list[i]}{list.size}

Except... there is a little something called overflow. Let's take the case of integers, and let's assume we're working with UInt8 objects. What's the average of [233,212]? It is 222.5 which gets rounded to 223. But our good'ol summation doesn't work:

 1> let v1 : UInt8 = 233
 2> let v2 : UInt8 = 212
 3> let sum = v1 + v2

EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Depending on who you ask, 233+212 either wraps around or causes an error. 255 is the maximum value, after which there is nothing. Either way, we wouldn't be happy with the go-around either: 233+212 = 190, which gives an average of 95 when divided by 2.

Musical Interlude : "Zino, dear, I don't care, I can use BIGGER numbers!"

Yes, you can, up to a point. Most languages have a maximum integer width, and sure, you can probably find unbounded implementations of integers for your language. In Swift, you can check this out, it's really nice. BUT while it's technically possible to handle arbitrary precision, you start hitting all sorts of issues with storing that data ("I love using blobs in my database"), converting it for practical use ("My users can remember 200+ digits numbers easyyyyyyyyy"), etc. Plus, you generally don't want to replace every single Int use in your code by something coming from an external dependency, with all the headache that implies, just for the sake of type safety.

Enter the Maths (royal fanfare ♫♩🎺)

Let's start with a basic observation:

v1+v2 = ({\frac{v1}{2} + \frac{v2}{2}})*2

If I divide by two, then multiply by two, I've done nothing. In the case of integers, it's not quite true, as the division will be rounded to the closest value, but for big numbers, it's not that bad. But what does that give us?

Well...

\frac{v1}{2} + \frac{v2}{2} \lt Int_{max}

The sum of the two halves will fit in an integer, because each is guaranteed to be smaller than half the maximum. Right? Then we can multiply the result by 2 to get the sum, maybe. But it might overflow. Good thing we are trying to get the average, because we were about to divide by two, which cancels out the multiplication.

A_{(233,212)} = ((\frac{233}{2}+\frac{212}{2})*2)/2 = \frac{233}{2}+\frac{212}{2}

Musical Interlude: "Mind => Blown"

Sure, we kind of lose some precision: 233/2 will be rounded to 117 so the average calculated will be 223, but it could easily have been rounded down at some point.

Anyways... Onward and upward! What can we do with a big list of numbers? We could use the same trick, and just divide wholesale. The major issue is that we would severely compound the rounding errors. Imagine we're still playing with UInt8 elements and you have 200 of them. Any of them divided by 200 would result in 0 or maybe 1. Your average wouldn't look very good.

Cue the Return of the Maths (royal fanfare ♫♩🎺)

{\sum_{i=0}^{list.size - 1} list[i]} = {\sum_{i=0}^{list.size - 2} list[i]} + list[list.size - 1]

(As in (x+y+z+t) = (x+y+z) + t)

  • Okay, and?
  • Let's divide by list.size, and we get the average
A_{list} = \frac{\sum_{i=0}^{list.size - 2} list[i] + list[list.size-1]}{list.size}

The top-left part looks familiar, it's almost as if it was the average of the list minus the last element... 😬

All we would need to do is to divide by list.size - 1... But if we multiply and divide by the same thing... 🤔

\frac{(list.size-1)*\frac{\sum_{i=0}^{list.size - 2} list[i]}{list.size-1} + list[list.size-1]}{list.size}

Which is

A_{list} = \frac{(list.size-1)*A_{list-last} + list[list.size-1]}{list.size}

Musical Interlude: Smells like recursion

So... The code will basically look like this:

  • If the list is empty (because we're good programmers and handle edge cases), the result is 0
  • If the list contains one element, the average is easy
  • If the list contains two elements, we can use the divide by two trick, the rounding error shouldn't be that bad
  • If the list contains more elements, we do the average by aggregate, and hope the rounding errors will be somewhat contained.

Side note on the rounding errors: the bigger the divider, the higher the rounding error (potentially). But by doing a rolling average, we have a rounding error that worsens as we go through the list rather than being bad at every step. It's not ideal, but it's still better.

So, let's set the stage up: I have a list of big numbers I want the average of.

2988139172152746883
4545331521850540616
5693938727954663282
5884889191787885217
3111881160526182838
8720326064806005009
8427311181199404053
7983003740783657027
2965909035096967706
1211883882534796072
5703029716464526164
8424273336993151821
774296368044414872
14130533330426236
2230589047337383318
8337015733785964014
9153431205551083918
3249272057022384528
8254667294021634003
6758234862357239854

They are all Int64 integers, which is the highest bit native signed variant available (Int128 has been coming since 2017). They come from a PostgreSQL database that stores big numbers for a very good reason I won't get into.

Now, if I plug these numbers into an unbounded calculator, the average should be 5221577691680052871.55 or so I'm told.

My recursive Swift function looks like this:

func sumMean(_ input: [Int64]) -> Int64 {
    if input.count == 0 { // uninteresting
        return 0
    }
    if input.count == 1 { // easy
        return input[0]
    }
    
    // general trick : divide by two (will introduce rounding errors)
    if input.count == 2 {
        let i1 = input[0] / 2
        let i2 = input[1] / 2
        let mean = (i1+i2) // (/2, then *2)
        return mean
    }
    
    let depth = Int64(input.count) - 1
    // rolling average formula
    let last = input.last!
    var rest = [Int64](input)
    rest = rest.dropLast()
    
    let restMean = sumMean(rest)
    // should be (depth * restMean + last) / depth+1, but overflow...
    let num = (restMean/2) + ((last/2)/depth)
    let res = (num / (depth+1)) * depth * 2
    return res
}

The reason for why num and res exist is left as an exercise.

Here's the calling code and the output:

var numbers : [Int64] = [
    2988139172152746883,
    4545331521850540616,
    5693938727954663282,
    5884889191787885217,
    3111881160526182838,
    8720326064806005009,
    8427311181199404053,
    7983003740783657027,
    2965909035096967706,
    1211883882534796072,
    5703029716464526164,
    8424273336993151821,
    774296368044414872,
    14130533330426236,
    2230589047337383318,
    8337015733785964014,
    9153431205551083918,
    3249272057022384528,
    8254667294021634003,
    6758234862357239854
]

print(sumMean(numbers))
5221577691680052740

As expected, we have rounding errors creeping in. This isn't the exact mean, but it's close enough: the difference is 131.55, which is a whopping 0.0000000000000025193534936693344360660675977627565% deviation.

As a side note, ordering matters:

  • unordered and sorted crescendo yield the same error
  • ordered reversed yields a 169.55 error margin

Given the scale, it's not a big deal, but keep in mind that this trick is only useful for fairly large numbers in a fairly large list, not for the extremes.