[Dev Diaries] Todotxt

For reasons I took an interest in playing with Todo.txt. Naturally, I went and looked for a swift package capable of importing and exporting this format. It is surprisingly hard to find something that is somewhat recent and full-featured.

Anyways, I went with the intention of contributing to this package, both because I like its simplicity and the fact that it uses the new DSL semantics for the regexps. And boy, oh boy, it tested me.

The trouble with my brain

We are all products of our history. Regular expressions don't faze me one bit. I used to write parts of compilers, lexers and yaccers are old friends, and the whole lambda reduction way regexps work isn't new.

BUT I know that they are a major hindrance for a lot of newcomers. Their syntax is unlike anything they have ever encountered (well, except for when they use * to mean "zero or more characters"), and don't get me started on the whole capture groups ( especially in Javascript ).

So I came into this project thinking it would be half an hour of clever regexps, and that's it.

Oh noes, there's Builders

In theory, I love builders. SwiftUI may not be my favorite way to build UIs, but it's fun and it works. Ditto for Ktor's HTML/CSS DSLs.

For regular expressions though... to me they fail to capture (haha) the mechanics, and will work only for relatively simple expressions.

let regex = Regex {
    "$"
    Capture {
      OneOrMore(.digit)
      "."
      Repeat(.digit, count: 2)
    }
    Anchor.endOfLine
}

This is equivalent to \$([0-9]+\.[0-9][0-9])$ I guess, and it's actually an example of why the DSL system is easier: $ and . have meaning in regexps and have to be escaped. It makes that line totally undecipherable for most newcomers.

Let's set aside the fact that nesting too many builders leads to incredibly long type inference (when it doesn't fail) and compile time. It comes with the territory, SwiftUI afficionados will tell you.

Let's also set aside the compactness of the regexps versus the DSL type because it's irrelevant to the point. In any language, you can write your code as compact and illegible as you like, it doesn't mean it's better. The tradeoffs you choose for your project in terms of inclusiveness (towards more junior members, or at least with regard to the tech stack you are using), and maintainability (look at you being all clever and writing some l33tk0d3 that you won't be able to debug in 6 months time... I learned that lesson a while ago and have decided that in most cases, readability is better than cleverness).

The problem (for me) with DSLs as applied to regular expressions is that it fails to capture the mechanics of backtracking and cursor position.

How regular expressions kind of work

As anyone who's taken an algorithmics class will tell you, strings are a pain. They can mostly only been read linearly (going from one character to the next or the previous), they have encodings and special cases (pun intended), and what they contain mean something different to the programmer than the actual text it contains (looking at you JSON).

"17.51" kind of is a number, and most definitely a string. If an error creeps in, it becomes something really really tedious to deal with. Like "17.,51". What the hell does THAT mean? 17.0 followed by 51? Or is the comma used in the French way to separate the decimal part, and . for multiplication? 17 x 0.51 ? And don't get me started on dates...

And so we have to define rules and use pattern matching. You can say that the string contains elements separated by commas, and each element is a number. And if it isn't, raise an error.

  • Look for every combination of n digits, followed by an optional point, followed by m optional digits, followed by a comma OR the end of the line
  • You should end up with a list of n∂.m∂ strings, that you can now parse to floating point
  • That should give an expression that looks like this:
    ([0-9]+\.{0,1}[0-9]*)(,|$)

There's a lot of problems to unpack here, like the fact that we kind of have to capture the group matching the separator/end of line to keep the expression simple, the weird dance of square and curly brackets and parenthesiseseseses, the whys and the hows of + and *, and so much more. But this will suffice for a brief explanation of how it works under the hood.

See, the point of it all is state machines and backtracking. When you use the regexp on a string, it keeps track of what it's done, what it's trying to do and where to go next. And if the current attempt fails, it reverts back to a previous state. The experts in regexps will groan and moan because I explain things too casually, and they should, but you don't need the nitty gritty details of how it actually works to grasp what I'm getting at.

So let's look at "17.,.6,51", and try to apply the regular expression.

  • We are at character 0,  and we are looking for 1 or more digits.
  • We find 17, so we look for 0 or 1 dot.
  • We find one, so we have 17., and we look for 0 or more digits
  • We find none, so we're still good, and we look for either a comma or the end of the line
  • We find a comma, so we have a complete match: 17.,
  • We push the result in the list of "found things", and advance the character to the end of the match, 3
  • We look again for a digit or more
  • We find none, so it can't be a match, so we advance by one character, and try again
  • Ditto for the dot
  • We're now at character 5, and try again
  • We have a digit, so we start a new match: 6
  • 0 or more dots, check. 0 or more digits, check. Comma, check. So we have a new match: 6
  • Start again, and we find 51
  • In the end we have 3 matches: 17., 6, and 51 (plus the comma or end of line, that is captured in a separate group)

Hold on. It was .6 not 6! Well... yes, but not according to our rules. We have to have a digit before the dot.

Never you mind, you say, we'll say we can have 0 or more instead of one or more:
([0-9]*\.{0,1}[0-9]*)(,|$)

Yes, and it almost works the same:

  • Up to 51, we get almost identical results, except we do capture .6 instead of 6
  • BUT we're not done. We try the regexp again at the end of the string, because we have to look at it all, and we get:
  • There is 0 digit, followed by 0 dot, followed by 0 digit, followed by the end of the line. So there's an extra match: ""...

The two important parts to grasp from all this are that starting at the character we are at, we try to match the pattern to the rest of the string, and if it fails, we advance by one character, and try again. Current state, and backtracking to try again.

The weeds: lookaheads and negations

We need to talk about a couple of special cases:

  • What if we need to not match something (reject some characters, for example)?
  • What if we need to match things only if up ahead in the string there's something else (or not something else)?

Like, say, the dot should appear only if there are digits afterwards (we say that 17. isn't a valid number but 17.0, and .6, are). Or if we are dealing with dates (😖), the rules for matching depend on whether or not we have a time afterwards.

Using "normal" regexps, negation it is kind of easy: [^abc] will match anything but a, b or c. Note that the caret that is used to usually indicate "beginning of a line" is used for negation within the match... Yea. Oh and you have to list every character you do not wish to match.

So kind of a solution is to use "lookaheads". They are basically patterns that match (or not) things before or after the current character, but do not move the current character cursor. They kind of start their own internal state, disregarding the global state of the parent expression.

.*(?=,) means trying to find a comma, and matching anything ( .* ) up till then.

"17.,.6,51" and this pattern will match "17.,.6". Why is it stopping at the last comma instead of the first? Because .* is greedy, it tries to get as many characters as possible. Comma is a character, so it matches, until it can't anymore because it's the last comma.

We can amend the .* to a .*? which will act as a lazy matcher, stopping at the first occurence, and giving us 17., .6, and two empty strings. Because, yes, after we found 17. we try again starting there, and it does match "anything with a comma at the end".

We can also look backwards and see if anything before the current character matches something, with .*(?<=,), and both negations of these with... !. Because yea, caret was really a weird idea. It'd look like this: .*(?!,) and .*(?!<,).

That was a long weeds expedition.

And the DSLs?

Ok, so the regexp mechanics is messy and requires a lot of concentration, so why the hell not make it prettier and/or more legible? After all I did say that I value maintainability over compactness most of the time.

Here's a regexp to match the title of a todo.txt file:

        let reference = Reference(String.self)
        let titlematch = OneOrMore {
          ChoiceOf {
            One(.whitespace)
            OneOrMore(.word)
          }
          NegativeLookahead {
            OneOrMore {
              OneOrMore(.word)
              One(":")
            }
          }
        }
        let regex = Regex {
          Anchor.startOfLine
          ZeroOrMore {
            ChoiceOf {
              One("x ")
              OneOrMore {
                "("
                (try? Regex("[A-Z]"))!
                ") "
              }
              One(.iso8601Date(timeZone: .gmt))
              One(.whitespace)
            }
          }
          NegativeLookahead {
            ChoiceOf {
              One(.iso8601Date(timeZone: .gmt))
              OneOrMore {
                One(.whitespace)
                OneOrMore(.word)
                One(":")
              }
            }
          }
          Capture(titlematch, as: reference,
                  transform: { word -> String in String(word) })
        }
        
        let match = input.firstMatch(of: regex)
        if let match {
          return match[reference].trimmingCharacters(in: CharacterSet.whitespacesAndNewlines) // because we capture the last space
        } else {
          return nil
        }

reference is the variable that should hold the mattern once it's matched. It's weird, but not any weirder than the numbered capture groups, so there.

Because of the Builder's very complex type inference, I had to split the "two" regexps, otherwise, the compiler failed to build. That's a legibility problem because that thing I store in titlematch is really a part of the bigger regexp, and the call site is now murkier. Plus it's a variable, so does it mean I can use it multiple times? In capture groups? Unclear, so probably not.

Then there is .word... it doesn't mean match a word. It means match a character that could be part of a word. Yes. A single character. When you're not paying attention, it can bite you really bad.

I'll set aside the fact that I capture an extra space every now and again. I'm sure that there is a way to avoid doing it, but the regexp was complicated enough as it was and I spent way too much time on it that I care to admit. It's a peculiarity of todo.txt rather than of the regex system, but a space is used to separate the elemets of the format, except when it's in the title.

Also, why is there an actual regular expression in there? Because insofar as I can understand it, you can't express [A-Z] (any capital latin letter) with this system. It's either .word (any character that can be part of a word, uppercase or not), or something else that's not a character.

Also, also, One doesn't accept a closure like OneOrMore does, so I'm forced to accept multiple priorities ( (LETTER) ) instead of only one, for some reason.

Let's dig into that one, shall we? So we're looking for some text that is after the start of the line, after the optional x that marks completion of the task, the (LETTER) marking the priority, and after the optional two dates (creation and optionally completion), but before any special character ( + and @ denote project and context, respectively) or key:value pair.

If you got that by looking at the code instantly, congratulations. I didn't. And I wrote it.

For reference, the regexp would look something like that (I don't include the date formatting, because I'm lazy and would probably do it after I extracted the matches):
^(x |\([A-Z]\) |[0-9-]| )+([\w ]+)(?! \w+:\w+)

OK, sure, this expression is a bit cryptic, and captures an extra space as well for the title. Because I didn't go into details for the date, it also doesn't quite work, because it lets dates with less or more than 3 components pass, instead of only valid ones.

But these 40-ish lines really aren't that clear to me in the things they will and will not match. Especially with the callsite being split into variables, but that's on Swift, not on the DSL thing. To my mind, the regexp is something that you "slide along" your string, stopping when it matches. Yes it's made of blocks and captures and weird characters, but it's kind of the same "thing" as a string, whereas that visual block syntax seems kind of "orthogonal" to the string, not of the same type, and therefore makes it harder for me to think about them.

But it's probably me and my old habit-encrusted brain. Drop me a line if you find an error, or if you disagree with decent arguments!