[Dev Diaries] URL Shortener Style Things

UUIDs are fine, but who doesn't like a decent String instead? It's shorter and it doesn't scare the non-programmers as much!

UUIDs are 128 bits, and I want to use a 64 characters map: a to z, A-Z, and 0-9, plus - and + to round it up.

64 possibilities is equivalent to 6 bits of data, and UUIDs are made of 16 bytes (8 bits of data). What that gives me is a way to split 16 bytes into 22 sixytes (yes I invented that word. So what?)

| 8 8 8 _ | 8 8 8 _ | 8 8 8 _ | 8 8 8 _ | 8 8 8 _ | 8
| 6 6 6 6 | 6 6 6 6 | 6 6 6 6 | 6 6 6 6 | 6 6 6 6 | 6 6

Why? Because 3x8 = 6x4, same number of bits in both.

Now, we redistribute the bits around (Xs are the bits fron the bytes, Ys are the bits from the sixytes):

XXXXXX|XX XXXX|XXXX XX|XXXXXX
YYYYYY|YY YYYY|YYYY YY|YYYYYY

With some shifts and some binary or, we're down from a 36 hexadecimal character string with dashes to a 22 character with a very low probability of punctuation. Of course if you want to disambiguate the symbols like O and 0, you can change the character map, as long as your charmap stays 64 items long.

extension UUID {
    static let charmap = 
["a","b","c","d","e","f","g","h","i","j","k","l","m","n",
"o","p","q","r","s","t","u","v","w","x","y","z",
"A","B","C","D","E","F","G","H","I","J","K","L","M","N",
"O","P","Q","R","S","T","U","V","W","X","Y","Z",
"0","1","2","3","4","5","6","7","8","9","-","+"]
    static let charmapSet =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-+"

    var tinyWord : String {
        let from = self.uuid
        let bytes = [from.0, from.1, from.2,from.3,from.4,from.5,from.6,from.7,from.8,from.9,
                     from.10, from.11, from.12,from.13,from.14,from.15]
        
        // split in 6-bits ints
        var sbytes : [UInt8] = []
        for i in 0..<5 {
            let b1 = bytes[i*3]
            let b2 = bytes[i*3+1]
            let b3 = bytes[i*3+2]
            
            let sb1 = b1 >> 2
            let sb2 = (b1 & 0x03) << 4 | (b2 >> 4)
            let sb3 = (b2 & 0x0f) << 2 | (b3 >> 6)
            let sb4 = (b3 & 0x3f)
            sbytes += [sb1,sb2,sb3,sb4]
        }
        // all done but the last byte
        sbytes.append(bytes[15]>>2)
        sbytes.append(bytes[15]&0x03)
         
        var result = ""
        for i in sbytes {
            result += UUID.charmap[Int(i)]
        }
        
        return result
    }
}

The reverse procedure is a bit longer, because we have to stage the values in groups of 4 sexytes for 3 bytes, and do a couple of sanity checks.

extension UUID {
    init?(tinyWord: String) {
        if tinyWord.count != 22 || !tinyWord.allSatisfy({ UUID.charmapSet.contains($0) }) { return nil }
        var current : UInt8 = 0
        var bytes : [UInt8] = []
        for (n,c) in tinyWord.enumerated() {
            guard let idx32 = UUID.charmap.firstIndex(of: String(c)) else { return nil }
            let idx = UInt8(idx32)
            if n >= 20 { // last byte
                if n == 20 {
                    current = idx << 2
                } else {
                    current |= idx
                    bytes.append(current)
                }
            } else if n % 4 == 0 { // first in cycle
                current = idx << 2
            } else if n % 4 == 1 { // combine
                current |= idx >> 4
                bytes.append(current)
                current = (idx & 0xf) << 4
            } else if n % 4 == 2 { // combine
                current |= (idx >> 2)
                bytes.append(current)
                current = (idx & 0x3) << 6
            } else {
                current |= idx
                bytes.append(current)
                current = 0
            }
        }
        
        // double check
        if bytes.count != 16 { return nil }
        
        self.init(uuid: (bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7], bytes[8], bytes[9],
                         bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15]))
    }
}

Let's test this!

let u = UUID()
let w = u.tinyWord
print(u.uuidString+" : \(u.uuidString.count)")
print(w+" : \(w.count)")
print(UUID(tinyWord: w)!)
30A5CB6E-778F-4218-A333-3BC8B5A40B65 : 36
mkxlBNEpqHIJmZViTAqlzb : 22
30A5CB6E-778F-4218-A333-3BC8B5A40B65

Now I have a "password friendly" way to pass UUIDs around. Is it a waste of time (because I could just pass the UUIDs around, they aren't that much longer)? Who knows? It makes my shortened URLs a bit less intimidating ūüėĀ

Double Precision (Not)

From this list, the gist is that most languages can't process 9999999999999999.0 - 9999999999999998.0

Why  do they output 2 when it should be 1? I bet most people who've never  done any formal CS (a.k.a maths and information theory) are super  surprised.

Before you read the rest, ask yourself this: if all you have are zeroes and ones, how do you handle infinity?

If  we fire up an interpreter that outputs the value when it's typed (like  the Swift REPL), we have the beginning of an explanation:

Welcome to Apple Swift version 4.2.1 (swiftlang-1000.11.42 clang-1000.11.45.1). Type :help for assistance.
  1> 9999999999999999.0 - 9999999999999998.0
$R0: Double = 2
  2> let a = 9999999999999999.0
a: Double = 10000000000000000
  3> let b = 9999999999999998.0
b: Double = 9999999999999998
  4> a-b
$R1: Double = 2

Whew, it's not that the languages can't handle a simple substraction, it's just that a is typed as 9999999999999999 but stored as 10000000000000000.

If we used integers, we'd have:

  5> 9999999999999999 - 9999999999999998
$R2: Int = 1

Are the decimal numbers broken? ūüėĪ

A detour through number representations

Let's  look at a byte. This is the fundamental unit of data in a computer and  is made of 8 bits, all of which can be 0 or 1. It ranges from 00000000 to 11111111 ( 0x00 to 0xff in hexadecimal, 0 to 255 in decimal, homework as to why and how it works like that due by monday).

Put like that, I hope it's obvious that the question "yes, but how do I represent the integer 999 on a byte?" is meaningless. You can decide that 00000000 means 990 and count up from there, or you can associate arbitrary values to the 256 possible combinations and make 999 be one of them, but you can't have both the 0 - 255 range and 999. You have a finite number of possible values and that's it.

Of ¬†course, that's on 8 bits (hence the 256 color palette on old games). On ¬†16, 32, 64 or bigger width memory blocks, you can store up to 2‚ĀŅ different values, and that's it.

The problem with decimals

While  it's relatively easy to grasp the concept of infinity by looking at  "how high can I count?", it's less intuitive to notice that there is the same amount of numbers between 0 and 1 as there are integers.

So,  if we have a finite number of possible values, how do we decide which  ones make the cut when talking decimal parts? The smallest? The most  common? Again, as a stupid example, on 8 bits:

  • maybe we need 0.01 ... 0.99 because we're doing accounting stuff
  • maybe we need 0.015, 0.025,..., 0.995 for rounding reasons
  • We'll just encode the numeric part on 8 bits ( 0 - 255 ), and the decimal part as above

But that's already  99+99 values taken up. That leaves us 57 possible values for the rest of infinity. And that's not even mentionning the totally arbitrary nature of the  selection. This way of representing numbers is historically the first  one and is called "fixed" representation. There are many ways of  choosing how the decimal part behaves and a lot of headache when coding  how the simple operations work, not to mention the complex ones like  square roots and powers and logs.

Floats (IEEE 754)

To  make it simple for chips that perform the actual calculations, floating  point numbers (that's their name) have been defined using two  parameters:

  • an integer n
  • a power (of base b) p

Such that we can have n x bŠĶĖ, for instance 15.3865 is 153863 x 10^(-4). The question is, how many bits can we use for the n and how many for the p.

The standard is to use 1 bit for the sign (+ or -), 23 bits for n, 8 for p, which use 32 bits total (we like powers of two), and using base 2, and n is actually 1.n.  That gives us a range of ~8 million values, and powers of 2 from -126  to +127 due to some special cases like infinity and NotANumber (NaN).

$$(-1~or~1)(2^{[-126...127]})(1.[one~of~the~8~million~values])$$

In theory, we have numbers from -10‚Āī‚ĀĶ to 1038 roughly, but some numbers can't be represented in that form. For ¬†instance, if we look at the largest number smaller than 1, it's 0.9999999404. Anything between that and 1 has to be rounded. Again, infinity can't be represented by a finite number of bits.

Doubles

The  floats allow for "easy" calculus (by the computer at least) and are  "good enough" with a precision of 7.2 decimal places on average. So when  we needed more precision, someone said "hey, let's use 64 bits instead  of 32!". The only thing that changes is that n now uses 52 bits and p 11 bits.

Coincidentally, double has more a meaning of double size than double precision, even though the number of decimal places does jump to 15.9 on average.

We  still have 2³² more values to play with, and that does fill some  annoying gaps in the infinity, but not all. Famously (and annoyingly),  0.1 doesn't work in any precision size because of the base 2. In 32 bits  float, it's stored as 0.100000001490116119384765625, like this:

(1)(2‚ĀĽ‚Āī)(1.600000023841858)

Conversely, after double size (aka doubles), we have quadruple size (aka quads), with 15 and 112 bits, for a total of 128 bits.

Back to our problem

Our value is 9999999999999999.0. The closest possible value encodable in double size floating point is actually 10000000000000000, which should now make some kind of sense. It is confirmed by Swift when separating the two sides of the calculus, too:

2> let a = 9999999999999999.0
a: Double = 10000000000000000

Our  big brain so good at maths knows that there is a difference between  these two values, and so does the computer. It's just that using  doubles, it can't store it. Using floats, a will be rounded to 10000000272564224 which isn't exactly better. Quads aren't used regularly yet, so no luck there.

It's  funny because this is an operation that we puny humans can do very  easily, even those humans who say they suck at maths, and yet those  touted computers with their billions of math operations per second can't  work it out. Fair enough.

The kicker is, there is a litteral infinity of examples such as this one, because trying to represent infinity in a finite number of digits is impossible.