[Dev Diaries] Todotxt

For reasons I took an interest in playing with Todo.txt. Naturally, I went and looked for a swift package capable of importing and exporting this format. It is surprisingly hard to find something that is somewhat recent and full-featured.

Anyways, I went with the intention of contributing to this package, both because I like its simplicity and the fact that it uses the new DSL semantics for the regexps. And boy, oh boy, it tested me.

The trouble with my brain

We are all products of our history. Regular expressions don't faze me one bit. I used to write parts of compilers, lexers and yaccers are old friends, and the whole lambda reduction way regexps work isn't new.

BUT I know that they are a major hindrance for a lot of newcomers. Their syntax is unlike anything they have ever encountered (well, except for when they use * to mean "zero or more characters"), and don't get me started on the whole capture groups ( especially in Javascript ).

So I came into this project thinking it would be half an hour of clever regexps, and that's it.

Oh noes, there's Builders

In theory, I love builders. SwiftUI may not be my favorite way to build UIs, but it's fun and it works. Ditto for Ktor's HTML/CSS DSLs.

For regular expressions though... to me they fail to capture (haha) the mechanics, and will work only for relatively simple expressions.

let regex = Regex {
    "$"
    Capture {
      OneOrMore(.digit)
      "."
      Repeat(.digit, count: 2)
    }
    Anchor.endOfLine
}

This is equivalent to \$([0-9]+\.[0-9][0-9])$ I guess, and it's actually an example of why the DSL system is easier: $ and . have meaning in regexps and have to be escaped. It makes that line totally undecipherable for most newcomers.

Let's set aside the fact that nesting too many builders leads to incredibly long type inference (when it doesn't fail) and compile time. It comes with the territory, SwiftUI afficionados will tell you.

Let's also set aside the compactness of the regexps versus the DSL type because it's irrelevant to the point. In any language, you can write your code as compact and illegible as you like, it doesn't mean it's better. The tradeoffs you choose for your project in terms of inclusiveness (towards more junior members, or at least with regard to the tech stack you are using), and maintainability (look at you being all clever and writing some l33tk0d3 that you won't be able to debug in 6 months time... I learned that lesson a while ago and have decided that in most cases, readability is better than cleverness).

The problem (for me) with DSLs as applied to regular expressions is that it fails to capture the mechanics of backtracking and cursor position.

How regular expressions kind of work

As anyone who's taken an algorithmics class will tell you, strings are a pain. They can mostly only been read linearly (going from one character to the next or the previous), they have encodings and special cases (pun intended), and what they contain mean something different to the programmer than the actual text it contains (looking at you JSON).

"17.51" kind of is a number, and most definitely a string. If an error creeps in, it becomes something really really tedious to deal with. Like "17.,51". What the hell does THAT mean? 17.0 followed by 51? Or is the comma used in the French way to separate the decimal part, and . for multiplication? 17 x 0.51 ? And don't get me started on dates...

And so we have to define rules and use pattern matching. You can say that the string contains elements separated by commas, and each element is a number. And if it isn't, raise an error.

  • Look for every combination of n digits, followed by an optional point, followed by m optional digits, followed by a comma OR the end of the line
  • You should end up with a list of n∂.m∂ strings, that you can now parse to floating point
  • That should give an expression that looks like this:
    ([0-9]+\.{0,1}[0-9]*)(,|$)

There's a lot of problems to unpack here, like the fact that we kind of have to capture the group matching the separator/end of line to keep the expression simple, the weird dance of square and curly brackets and parenthesiseseseses, the whys and the hows of + and *, and so much more. But this will suffice for a brief explanation of how it works under the hood.

See, the point of it all is state machines and backtracking. When you use the regexp on a string, it keeps track of what it's done, what it's trying to do and where to go next. And if the current attempt fails, it reverts back to a previous state. The experts in regexps will groan and moan because I explain things too casually, and they should, but you don't need the nitty gritty details of how it actually works to grasp what I'm getting at.

So let's look at "17.,.6,51", and try to apply the regular expression.

  • We are at character 0,  and we are looking for 1 or more digits.
  • We find 17, so we look for 0 or 1 dot.
  • We find one, so we have 17., and we look for 0 or more digits
  • We find none, so we're still good, and we look for either a comma or the end of the line
  • We find a comma, so we have a complete match: 17.,
  • We push the result in the list of "found things", and advance the character to the end of the match, 3
  • We look again for a digit or more
  • We find none, so it can't be a match, so we advance by one character, and try again
  • Ditto for the dot
  • We're now at character 5, and try again
  • We have a digit, so we start a new match: 6
  • 0 or more dots, check. 0 or more digits, check. Comma, check. So we have a new match: 6
  • Start again, and we find 51
  • In the end we have 3 matches: 17., 6, and 51 (plus the comma or end of line, that is captured in a separate group)

Hold on. It was .6 not 6! Well... yes, but not according to our rules. We have to have a digit before the dot.

Never you mind, you say, we'll say we can have 0 or more instead of one or more:
([0-9]*\.{0,1}[0-9]*)(,|$)

Yes, and it almost works the same:

  • Up to 51, we get almost identical results, except we do capture .6 instead of 6
  • BUT we're not done. We try the regexp again at the end of the string, because we have to look at it all, and we get:
  • There is 0 digit, followed by 0 dot, followed by 0 digit, followed by the end of the line. So there's an extra match: ""...

The two important parts to grasp from all this are that starting at the character we are at, we try to match the pattern to the rest of the string, and if it fails, we advance by one character, and try again. Current state, and backtracking to try again.

The weeds: lookaheads and negations

We need to talk about a couple of special cases:

  • What if we need to not match something (reject some characters, for example)?
  • What if we need to match things only if up ahead in the string there's something else (or not something else)?

Like, say, the dot should appear only if there are digits afterwards (we say that 17. isn't a valid number but 17.0, and .6, are). Or if we are dealing with dates (😖), the rules for matching depend on whether or not we have a time afterwards.

Using "normal" regexps, negation it is kind of easy: [^abc] will match anything but a, b or c. Note that the caret that is used to usually indicate "beginning of a line" is used for negation within the match... Yea. Oh and you have to list every character you do not wish to match.

So kind of a solution is to use "lookaheads". They are basically patterns that match (or not) things before or after the current character, but do not move the current character cursor. They kind of start their own internal state, disregarding the global state of the parent expression.

.*(?=,) means trying to find a comma, and matching anything ( .* ) up till then.

"17.,.6,51" and this pattern will match "17.,.6". Why is it stopping at the last comma instead of the first? Because .* is greedy, it tries to get as many characters as possible. Comma is a character, so it matches, until it can't anymore because it's the last comma.

We can amend the .* to a .*? which will act as a lazy matcher, stopping at the first occurence, and giving us 17., .6, and two empty strings. Because, yes, after we found 17. we try again starting there, and it does match "anything with a comma at the end".

We can also look backwards and see if anything before the current character matches something, with .*(?<=,), and both negations of these with... !. Because yea, caret was really a weird idea. It'd look like this: .*(?!,) and .*(?!<,).

That was a long weeds expedition.

And the DSLs?

Ok, so the regexp mechanics is messy and requires a lot of concentration, so why the hell not make it prettier and/or more legible? After all I did say that I value maintainability over compactness most of the time.

Here's a regexp to match the title of a todo.txt file:

        let reference = Reference(String.self)
        let titlematch = OneOrMore {
          ChoiceOf {
            One(.whitespace)
            OneOrMore(.word)
          }
          NegativeLookahead {
            OneOrMore {
              OneOrMore(.word)
              One(":")
            }
          }
        }
        let regex = Regex {
          Anchor.startOfLine
          ZeroOrMore {
            ChoiceOf {
              One("x ")
              OneOrMore {
                "("
                (try? Regex("[A-Z]"))!
                ") "
              }
              One(.iso8601Date(timeZone: .gmt))
              One(.whitespace)
            }
          }
          NegativeLookahead {
            ChoiceOf {
              One(.iso8601Date(timeZone: .gmt))
              OneOrMore {
                One(.whitespace)
                OneOrMore(.word)
                One(":")
              }
            }
          }
          Capture(titlematch, as: reference,
                  transform: { word -> String in String(word) })
        }
        
        let match = input.firstMatch(of: regex)
        if let match {
          return match[reference].trimmingCharacters(in: CharacterSet.whitespacesAndNewlines) // because we capture the last space
        } else {
          return nil
        }

reference is the variable that should hold the mattern once it's matched. It's weird, but not any weirder than the numbered capture groups, so there.

Because of the Builder's very complex type inference, I had to split the "two" regexps, otherwise, the compiler failed to build. That's a legibility problem because that thing I store in titlematch is really a part of the bigger regexp, and the call site is now murkier. Plus it's a variable, so does it mean I can use it multiple times? In capture groups? Unclear, so probably not.

Then there is .word... it doesn't mean match a word. It means match a character that could be part of a word. Yes. A single character. When you're not paying attention, it can bite you really bad.

I'll set aside the fact that I capture an extra space every now and again. I'm sure that there is a way to avoid doing it, but the regexp was complicated enough as it was and I spent way too much time on it that I care to admit. It's a peculiarity of todo.txt rather than of the regex system, but a space is used to separate the elemets of the format, except when it's in the title.

Also, why is there an actual regular expression in there? Because insofar as I can understand it, you can't express [A-Z] (any capital latin letter) with this system. It's either .word (any character that can be part of a word, uppercase or not), or something else that's not a character.

Also, also, One doesn't accept a closure like OneOrMore does, so I'm forced to accept multiple priorities ( (LETTER) ) instead of only one, for some reason.

Let's dig into that one, shall we? So we're looking for some text that is after the start of the line, after the optional x that marks completion of the task, the (LETTER) marking the priority, and after the optional two dates (creation and optionally completion), but before any special character ( + and @ denote project and context, respectively) or key:value pair.

If you got that by looking at the code instantly, congratulations. I didn't. And I wrote it.

For reference, the regexp would look something like that (I don't include the date formatting, because I'm lazy and would probably do it after I extracted the matches):
^(x |\([A-Z]\) |[0-9-]| )+([\w ]+)(?! \w+:\w+)

OK, sure, this expression is a bit cryptic, and captures an extra space as well for the title. Because I didn't go into details for the date, it also doesn't quite work, because it lets dates with less or more than 3 components pass, instead of only valid ones.

But these 40-ish lines really aren't that clear to me in the things they will and will not match. Especially with the callsite being split into variables, but that's on Swift, not on the DSL thing. To my mind, the regexp is something that you "slide along" your string, stopping when it matches. Yes it's made of blocks and captures and weird characters, but it's kind of the same "thing" as a string, whereas that visual block syntax seems kind of "orthogonal" to the string, not of the same type, and therefore makes it harder for me to think about them.

But it's probably me and my old habit-encrusted brain. Drop me a line if you find an error, or if you disagree with decent arguments!


[AoC 2022] Recap

TL;DR

I'm not as rusty as I thought I'd be. And YES that kind of challenge has a place in the coding world (see conclusion)

Just like every year, I had a blast banging my head on the Advent of Code calendar. It so happens that this year, I had a lot less brain power to focus on them due to the exam season at school and the ensuing panicky students, but it could also be because my brain isn't up to spec.

Some of my students are/were doing the challenges, so I didn't want to post anything that would help them, but now that the year is almost over, I wanted to go over the puzzles and give out impressions (and maybe hints).

Easing in

Days 1 to 7 were mostly about setting up the stage, getting into the habit of parsing the input and using the right kind of structure to store the data.

Nothing super hard, it was "just" lists, hashmaps, and trees, until day 4. Day 4 was especially funny to me, because I wrote HoledRange / Domain just for that purpose (disjointed ranges and operations on them). Except I decided to do this year's calendar in Julia, and the library I wrote is for Swift. Just for kicks, I rewrote parts of the library, and I might even publish it.

Days 5, 6 and 7 highlighted the use of stacks, strings, and trees again. Nothing too hard.

Getting harder

My next favorite is day 9. It's about a piece of rope you drag the head of, and have to figure out what the tail does. If you've ever played zig-zaging a shoelace you'll know what I mean. String physics are fun, especially in inelastic cases.

Many ways to do that, but once you realize how the tail catches up to the head when the latter is moved, multi-segmented chains are just a recursive application of the same.

I was waiting for a day 10-like puzzle, as there tends to be one every year, and I majored in compilers all those years ago. State machines, yuuuuuusssssssss.

A lot of puzzles involve path finding after that, which isn't my strong suit for some reason. But since the algorithms are already out there (it was really funny to see the spike in google searches for Dijkstra and A*), it's "just" a matter of encoding the nodes and the edges in a way that works.

Day 13 is fun, if only because it can be instantly solved in some language with eval, which will treat the input as a program. I still wrote my own comparison functions, because I like manipulating numbers, lists and inequalities.

Day 14 is "sand simulation", that is grains of sand that settle in a conical shape that keeps expanding laterally. Once you find the landing point on each ledge and the maximum width of the pile, there's a calculable result. Otherwise, running the simulation works too, there aren't that many grains. For part 2, I just counted the holes rather than the grains.

Day 15 is about union and intersections of disjointed ranges again, except in 2D. Which, with the Manhattan distance approximation, gets back to 1D fairly quickly.

Day 16 stumped quite a few people, because of the explosive nature of path searching. Combinatorics are pretty hard to wrap your head around. Personally, I went for "reachability" within the remaining time, constructed my graph, and explored. It was probably non optimal.

Day 17 make me inordinately proud. Nuff said.

Catching up

Because of the aforementioned  workload, I was late by that point, so I decided to take my time and not complete the challenge by Xmas. Puzzles were getting hard, work was time-consuming, so the pressure needed to go down.

Because of the 3D background that I had, I tackled day 18 with raytracing, which is way over-engineered, but reminded me of the good ole times. Part 2 was trickier with that method, because suddenly I had 2 kinds of "inside".

Day 19 was path finding again, the trick being how to prune paths that didn't lead in a good direction. Probably the one that used up the most memory, and therefore the one I failed the most.

Because of my relative newness to Julia, I had to go through many hoops to solve day 20. As it turns out, screw_dog over on mastodon gave me the bit I lacked to solve it simply, although way after I solved it using other means.

Day 21 goes back to my compiler roots and tree optimizations, and Julia makes the huge integer manipulation relatively easy, so, there. Pretty proud of my solution:

Part 1:   3.709 ms (31291 allocations: 1.94 MiB)
Part 2:   4.086 ms (31544 allocations: 1.95 MiB)

Which, on my relatively slow mac mini is not bad at all! Symbolic linear equation solving (degree one, okay) is a fun thing to think about. I even think that the algorithm I devised would work on trees where the unknown appears on both sides of the tree. Maybe I'll test it some day.

Day 22. Aaaaaaaah day 22. Linked lists get you all the way if and only if you know how to fold a cube from a 2D pattern. I don't so, as many of the other participants, I hardcoded the folding. A general solution just eluded me. It's on my todo list of reading for later.

Day 23 is an interesting variant of Conway's Game of Life, and I don't believe there is a way to simplify a straight up simulation, but I fully accept I could be wrong. So I used no trick, and let the thing run for 40s to get the result.

Day 24 was especially interesting for me, for all the wrong reasons. As I mentioned, graph traversal isn't my forte. But the problem was setup in a way that "worked" for me: pruning useless paths was relatively easy, so the problem space didn't explode too quickly. I guess I should use the same method on previous puzzles that I was super clumsy with.

Finally day 25 is a straight up algorithmic base conversion problem that's a lot of fun for my brain. If you remember how carry works when adding or subtracting numbers, it's not a big challenge, but thinking in base 5 can trip you up.

Conclusion

I honestly didn't believe I could hack it this year. I don't routinely do that kind of problem anymore, I have a lot of things going on at school, on top of dealing with the long tail of Covid and its effects on education. Family life was a bit busy with health issues (nothing life threatening, but still time consuming), and the precious little free time that I had was sure to be insufficient for AoC.

I'm glad I persevered, even if it took me longer than I wished it had. I'm glad I learned how to use Julia better. And I'm happy I can still hack it.

I see here and there grumblings about formal computer science. During and after AoC, I see posts, tweets, toots, etc, saying that the "l33t c0d3" is useless in practical, day-to-day, professional development. Big O notation, formal analysis, made up puzzles that take you into voluntarily difficult territories, all these things aren't a reflection of the skills that are needed nowadays to write good apps, to make good websites, and so on.

It's true. Ish.

You can write code that works without any kind of formal training. Today's computing power and memory availability makes optimization largely irrelevant unless you are working with games or embedded systems, or maybe data science. I mean, we can use 4GB of temporary memory for like 1/4 of a second to parse and apply that 100kB json file, and it has close to no impact on the perceived speed of our app, right? Right. And most of the clever algorithms are part of the standard library anyway, or easily findable.

The problem, as usual, is at scale. The proof-of-concept, prototype, or even 1.0 version, of the program may very well work just fine with the first 100 users, or 1000 or whatever the metric is for success. Once everything takes longer than it should, there are only 3 solutions:

  • rely on bigger machines, which may work for a time, but ultimately does not address the problem
  • scale things horizontally, which poses huge synchronization issues between the shards
  • reduce the technical debt, which is really hard

The first two rely on compute power being relatively cheap. And most of us know about the perils of infrastructure costs. That meme regularly makes the rounds.

It's not about whether you personally can solve some artificially hard problem using smart techniques, so that's ok if you can't do every puzzle in AoC or other coding challenges. It's not about flexing with your big brain capable of intuiting the bigO complexity of a piece of code. It's about being able to think about these problems in a way that challenges how you would normally do it. It's about expanding your intuition and your knowledge about the field you decided to work in.

It's perfectly OK for an architect to build only 1 or 2 level houses, there's no shame in it. But if that architect ever wants to build a 20+ stories building, the way to approach the problem is different.

Same deal with computer stuff. Learning is part of the experience.


[Dev Diaries] Advent of Code

I've been really interested in Julia for a while now, tinkering here and there with its quirks and capabilities.

This year, I've decided to try and do the whole of Advent of Code using that language.

First impressions are pretty good:
- map, reduce, and list/array management in general are really nice, being first-class citizens. I might even get over the fact that indices start at 1
- automatic multithreading when iterating over collections means that some of these operations are pretty speedy
- it's included in standard jupyterhub images, meaning that my server install gives me access to a Julia environment if I am not at my computer for some reason

Now it's kind of hard to teach old dogs new tricks, so I'm sure I misuse some of the features by thinking in "other languages". We'll see, but 4 days in, I'm still fairly confident.


Procedure All The Things! 🥳

Blender 3.1 is out!, and I made a thing, and that makes me so happy. (👍≖‿‿≖)👍

The laundry list of enhancements and features is very long, but 2 are especially important to me right now:

  • support for Metal acceleration on M1 (I can finally make the fans on my new laptop go)
  • procedural geometry nodes

Now, if you read what I write every once in a while, you'll know that I'm a big fan of anything that automates and facilitates manual tasks. I'm a developer after all.

Back on the old blog, I used to rave about the superior procedural texture system in blender that mostly allowed me to get free of images and UV maps. Granted, I'm no artist, and I tend towards mechanical/naturalistic scenes where procedural textures can be used. It's just so great to set a few parameters and see a decent result without having to spend hours in a pixel editor. Plus, it works at any resolution. So, there's that.

Back when Blender got its revamp and started getting some attention again, the white whale on the forums and various conversations was "when will we be able to node all the things?". It's is a very hard problem to solve, and the blender community has been at it like maniacs.

And... geometry nodes are finally here. I can finally replace most of my particle systems with a nice, clean, geometry node.

On a lark, I decided to retake my huge island project, a re-imagining of the Island of Myst that used to take 5h/frame to render, because of the millions of leaves, blades of grass, and rocks that the particle system generated. It was very hard to work with because the RAM usage would explode and renders would fail sometimes. Without culling and boolean operations dealing with the field of view, the scene used in the vicinity of 24G, making it all but impossible to render on a GPU that I can afford.

After a little experimenting to get my feet wet, I managed to visually get roughly the same result on less than 2.2G of RAM. That freed me to add... more geometry and more complexity 🤓

And because of the M1 architecture, the ceiling of VRAM usage is very high anyways, so I let it rip and got even more geometry in.

There is somewhere between 2 and two and a half billion polygons that I know of. Probably more.

Here is the original I based my scene on:

From the Wikipedia page

The end result is this (warning -- 4K image):

Myst Library

It uses 3.7G of VRAM (😳), about an hour and a half to render on a laptop (🤪), it's not even as complex as I can make it go (🤓), it's all geometry, no bump map tricks (🤩), and I could finally check that there were fans in my laptop (🙃)

I could do better on the render side, the complexity of the lighting means a lot of artifacts, and, of course, as I said before, I'm no artist, so some of the geometry is a bit iffy. Plus it's only 1024 samples, because I wanted to be fast-ish.

But Blender continues to impress, and who knows? Maybe they'll add a Do-What-I-Want-No-What-I-Type button in a future release that will magically enhance my skills.


[Dev Diaries] ELIZA

Back in the olden days...

Before the (oh so annoying) chatbots, before conversational machine-learning, before all of that, there was... ELIZA.

It is a weird little part of computer history that nerds like me enjoy immensely, but that is fairly unknown from the public.

If I ask random people when they think chatting with a bot became a Thing, they tend to respond "the 90s" or later (usually roughly ten years after they were born, for weird psychological reasons).

But back in the 60s, the Turing Test was a big thing indeed. Of course, nowadays, we know that this test, as it was envisionned, isn't that difficult, but back then it was total fiction.

Enters Joseph Weizenbaum, working at the MIT in the mid 60s, who decided to simplify the problem of random conversation by using a jedi mind trick: the program would be a stern doctor, not trying to ingratiate itself to the user. We talk to that kind of terse and no nonsense people often enough that it could be reasonably assumed that it wouldn't faze a normal person.

It's not exactly amicable, but it was convincing enough at the time for people to project some personnality onto it. It became a real Frankenstein story: Weizenbaum was trying to show how stupid it was, and the concept behind man-machine conversations, but users kept talking to it, sometimes even confiding as they would to a doctor. And the more Weizenbaum tried to show that it was a useless piece of junk with the same amount of intelligence as your toaster, the more people became convinced this was going to revolutionize the psychiatry world.

Weizenbaum even felt compelled to write a book about the limitations of computing, and the capacity of the human brain to anthropomorphise the things it interacts with, as if to say that to most people, everything is partly human-like or has human-analogue intentions.

He is considered to be one of the fathers of artificial intelligence, despite his attempts at explaining to everyone that would listen that it was somewhat a contradiction in terms.

Design

ELIZA was written in SLIP, a language that worked as a subset or an extension or Fortran and later ALGOL, and was designed to facilitate the use of compounded lists (for instance (x1,x2,(y1,y2,y3),x3,x4)), which was something of a hard-ish thing to do back in the day.

By modern standards, the program itself is fairly simplistic:

  • the user types an input
  • the input is parsed for "keywords" that ELIZA knows about (eg I am, computer, I believe I, etc), which are ranked more or less arbitrarily
  • depending on that "keyphrase", a variety of options are available like I don't understand that or Do computers frighten you?

Where ELIZA goes further than a standard decision tree, is that it has access to references. It tries to take parts of the input and mix them with its answer, for example: I am X -> Why are you X?

It does that through something that would become regular expression groups, and then transforming certain words or expressions into their respective counterparts.

For instance, something like I am like my father would be matched to ("I am ", "like my father"), then the response would be ("Why are you X?", "like my father"), then transformed to ("Why are you X?", "like your father"), then finally assembled into Why are you like your father?

Individually, both these steps are simple decompositions and substitutions. Using sed and regular expressions, we would use something like

$ sed -n "s/I am \(.*\)/Why are you \1?/p"
I am like my father
Why are you like my father?
$ echo "I am like my father" | sed -n "s/I am \(.*\)/Why are you \1?/p" | sed -n "s/my/your/p"
Why are you like your father?

Of course, ELIZA has a long list of my/your, me/you, ..., transformations, and multiple possibilities for each keyword, which, with a dash of randomness, allows the program to respond differently if you say the same thing twice.

But all in all, that's it. ELIZA is a very very simple program, from which emerges a complex behavior that a lot of people back then found spookily humanoid.

Taking a detour through (gasp) JS

One of the available "modern" implementations of ELIZA is in Javascript, as are most things. Now, those who know me figure out fairly quickly that I have very little love for that language. But having a distaste for it doesn't mean I don't need to write code in it every now and again, and I had heard so much about the bafflement people feel when using regular expressions in JS that I had to try myself. After all, two birds, one stone, etc... Learn a feature of JS I do not know, and resurrect an old friend.

As I said before, regular expressions (or regexs, or regexps) are relatively easy to understand, but a lot of people find them difficult to write. I'll just give you a couple of simple examples to get in the mood:

[A-Za-z]+;[A-Za-z]+

This will match any text that has 2 words (whatever the case of the letters) separated by a semicolon. Note the differenciating between uppercase and lowercase.
Basically, it says that I want to find a series of letters on length at least 1 (+) followed by ; followed by another series of letters of length at least 1

.*ish

Point (.) is a special character that means "any character", and * means "0 or more", so here I want to find anything ending in "ish"

Now, when you do search and replace (is is the case with ELIZA) or at least search and extract, you might want to know what is in this .* or [A-Za-z]+. To do that you use groups:

(.*)ish

This will match the same strings of letters, but by putting it in parenthesiseseseseseseseseses (parenthesiiiiiiiiiiiii? damn. anyway), you instruct the program to remember it. It is then stored in variables with the very imaginative names of \1, \2, etc...

So in the above case, if I apply that regexp to "easyish", \1 will contain "easy"

Now, because you have all these special characters like point and parenthesis and  whatnot, you need to differenciate when you need the actual "." and "any character". We escape those special characters with \.

([A-Za-z]+)\.([A-Za-z]+)

This will match any two words with upper and lower case letters joined by a dot (and not any character, as would be the case if I didn't use \), and remember them in \1 and \2

Of course, we have a lot of crazy special cases and special characters, so, yes, regexps can be really hard to build. For reference, the Internet found me a regexp that looks for email adresses:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Yea... Moving on.

Now, let's talk about Javascript's implementation of regular expressions. Spoiler alert, it's weird if you have used regexps in any other language than perl. That's right, JS uses the perl semantics.

In most languages, regular expressions are represented by strings. It is a tradeoff that means you can manipulate it like a string (get its length, replace portions of it, have it built out of string variables etc), but it makes escaping nighmareish:

"^\\s*\\*\\s*(\\S)"

Because \ escapes the character that follows, you need to escape the escaper to keep it around: if you want \. as part of your regexp, more often than not, you need to type "\\." in your code. It's quite a drag, but the upside is that they work like any other string.

Now, in JS (and perl), regexps are a totally different type. They are not between quotes, but between slashes (eg /^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/). On one hand, you don't have to escape the slashes anymore and they more closely resemble the actual regexp, but on the other hand, they are harder to compose or build programmatically.

As I said, it's a different tradeoff, and to each their own.

Where it gets bonkers is how you use them. Because the class system is... what it is, and because there is no operator overload, you can't really get the syntactic elegance of perl, so it's kind of a bastard system where you might type something like

var myRe = /d(b+)d/;
var isOK = "cdbbdbsbz".match(); // not null because "dbbd" is in the string

match and matchAll aren't too bad, in the sense that they return the list of matching substrings (here, only one), or null, so it does have kind of a meaning.

The problem arises when you need to use the dreaded exec function in order to use the regexp groups, or when you use the g flag in your regexp.

The returned thing (I refuse to call it an object) is both an array and a hashmap/object at the same time.

In result[0] you have the matched substring (here it would be "dbbd"), and in result[X] you have the \X equivalents (here \1 would be "bb", so that's what you find in result[1]). So far so not too bad.

But this array also behaves like an object: result.index gives you the index of "the match" which is probably the first one.

Not to mention you use string.match(regex) and regex.exec(string)

const text = 'cdbbdbsbz';
const regex = /d(b+)d/g;
const found = regex.exec(text);

console.log(found);
console.log(found.index);
console.log(found["index"]);
Array ["dbbd", "bb"]
1
1

So, the result is a nullable array that sometimes work as an object. I'll let that sink in for a bit.

This is the end

Once I got the equivalence down pat, it was just a matter of copying the data and rewriting a few functions, and ELIZA was back, as a libray, so that I could use it in CLI tools, iOS apps, or MacOS apps.

When I'm done fixing the edge cases and tinkering with the ranking system, I might even publish it.

In the meantime, ELIZA and I are rekindling an old friendship on my phone!