A Little Bit Of Math #1

Aaaaaah the math tricks! When you know 'em, you love 'em, and when you don't, you pay for extra computing resources.

Today's math trick has to do with averages. Averages are easy, right? you take all the numbers in the list, you sum them, and then you divide by the count... Pft, that's no trick!

A_{list} = \frac{\sum_{i=0}^{list.size - 1} list[i]}{list.size}

Except... there is a little something called overflow. Let's take the case of integers, and let's assume we're working with UInt8 objects. What's the average of [233,212]? It is 222.5 which gets rounded to 223. But our good'ol summation doesn't work:

 1> let v1 : UInt8 = 233
 2> let v2 : UInt8 = 212
 3> let sum = v1 + v2

EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Depending on who you ask, 233+212 either wraps around or causes an error. 255 is the maximum value, after which there is nothing. Either way, we wouldn't be happy with the go-around either: 233+212 = 190, which gives an average of 95 when divided by 2.

Musical Interlude : "Zino, dear, I don't care, I can use BIGGER numbers!"

Yes, you can, up to a point. Most languages have a maximum integer width, and sure, you can probably find unbounded implementations of integers for your language. In Swift, you can check this out, it's really nice. BUT while it's technically possible to handle arbitrary precision, you start hitting all sorts of issues with storing that data ("I love using blobs in my database"), converting it for practical use ("My users can remember 200+ digits numbers easyyyyyyyyy"), etc. Plus, you generally don't want to replace every single Int use in your code by something coming from an external dependency, with all the headache that implies, just for the sake of type safety.

Enter the Maths (royal fanfare ♫♩🎺)

Let's start with a basic observation:

v1+v2 = ({\frac{v1}{2} + \frac{v2}{2}})*2

If I divide by two, then multiply by two, I've done nothing. In the case of integers, it's not quite true, as the division will be rounded to the closest value, but for big numbers, it's not that bad. But what does that give us?

Well...

\frac{v1}{2} + \frac{v2}{2} \lt Int_{max}

The sum of the two halves will fit in an integer, because each is guaranteed to be smaller than half the maximum. Right? Then we can multiply the result by 2 to get the sum, maybe. But it might overflow. Good thing we are trying to get the average, because we were about to divide by two, which cancels out the multiplication.

A_{(233,212)} = ((\frac{233}{2}+\frac{212}{2})*2)/2 = \frac{233}{2}+\frac{212}{2}

Musical Interlude: "Mind => Blown"

Sure, we kind of lose some precision: 233/2 will be rounded to 117 so the average calculated will be 223, but it could easily have been rounded down at some point.

Anyways... Onward and upward! What can we do with a big list of numbers? We could use the same trick, and just divide wholesale. The major issue is that we would severely compound the rounding errors. Imagine we're still playing with UInt8 elements and you have 200 of them. Any of them divided by 200 would result in 0 or maybe 1. Your average wouldn't look very good.

Cue the Return of the Maths (royal fanfare ♫♩🎺)

{\sum_{i=0}^{list.size - 1} list[i]} = {\sum_{i=0}^{list.size - 2} list[i]} + list[list.size - 1]

(As in (x+y+z+t) = (x+y+z) + t)

  • Okay, and?
  • Let's divide by list.size, and we get the average
A_{list} = \frac{\sum_{i=0}^{list.size - 2} list[i] + list[list.size-1]}{list.size}

The top-left part looks familiar, it's almost as if it was the average of the list minus the last element... 😬

All we would need to do is to divide by list.size - 1... But if we multiply and divide by the same thing... 🤔

\frac{(list.size-1)*\frac{\sum_{i=0}^{list.size - 2} list[i]}{list.size-1} + list[list.size-1]}{list.size}

Which is

A_{list} = \frac{(list.size-1)*A_{list-last} + list[list.size-1]}{list.size}

Musical Interlude: Smells like recursion

So... The code will basically look like this:

  • If the list is empty (because we're good programmers and handle edge cases), the result is 0
  • If the list contains one element, the average is easy
  • If the list contains two elements, we can use the divide by two trick, the rounding error shouldn't be that bad
  • If the list contains more elements, we do the average by aggregate, and hope the rounding errors will be somewhat contained.

Side note on the rounding errors: the bigger the divider, the higher the rounding error (potentially). But by doing a rolling average, we have a rounding error that worsens as we go through the list rather than being bad at every step. It's not ideal, but it's still better.

So, let's set the stage up: I have a list of big numbers I want the average of.

2988139172152746883
4545331521850540616
5693938727954663282
5884889191787885217
3111881160526182838
8720326064806005009
8427311181199404053
7983003740783657027
2965909035096967706
1211883882534796072
5703029716464526164
8424273336993151821
774296368044414872
14130533330426236
2230589047337383318
8337015733785964014
9153431205551083918
3249272057022384528
8254667294021634003
6758234862357239854

They are all Int64 integers, which is the highest bit native signed variant available (Int128 has been coming since 2017). They come from a PostgreSQL database that stores big numbers for a very good reason I won't get into.

Now, if I plug these numbers into an unbounded calculator, the average should be 5221577691680052871.55 or so I'm told.

My recursive Swift function looks like this:

func sumMean(_ input: [Int64]) -> Int64 {
    if input.count == 0 { // uninteresting
        return 0
    }
    if input.count == 1 { // easy
        return input[0]
    }
    
    // general trick : divide by two (will introduce rounding errors)
    if input.count == 2 {
        let i1 = input[0] / 2
        let i2 = input[1] / 2
        let mean = (i1+i2) // (/2, then *2)
        return mean
    }
    
    let depth = Int64(input.count) - 1
    // rolling average formula
    let last = input.last!
    var rest = [Int64](input)
    rest = rest.dropLast()
    
    let restMean = sumMean(rest)
    // should be (depth * restMean + last) / depth+1, but overflow...
    let num = (restMean/2) + ((last/2)/depth)
    let res = (num / (depth+1)) * depth * 2
    return res
}

The reason for why num and res exist is left as an exercise.

Here's the calling code and the output:

var numbers : [Int64] = [
    2988139172152746883,
    4545331521850540616,
    5693938727954663282,
    5884889191787885217,
    3111881160526182838,
    8720326064806005009,
    8427311181199404053,
    7983003740783657027,
    2965909035096967706,
    1211883882534796072,
    5703029716464526164,
    8424273336993151821,
    774296368044414872,
    14130533330426236,
    2230589047337383318,
    8337015733785964014,
    9153431205551083918,
    3249272057022384528,
    8254667294021634003,
    6758234862357239854
]

print(sumMean(numbers))
5221577691680052740

As expected, we have rounding errors creeping in. This isn't the exact mean, but it's close enough: the difference is 131.55, which is a whopping 0.0000000000000025193534936693344360660675977627565% deviation.

As a side note, ordering matters:

  • unordered and sorted crescendo yield the same error
  • ordered reversed yields a 169.55 error margin

Given the scale, it's not a big deal, but keep in mind that this trick is only useful for fairly large numbers in a fairly large list, not for the extremes.