Back in the olden days...
Before the (oh so annoying) chatbots, before conversational machine-learning, before all of that, there was... ELIZA.
It is a weird little part of computer history that nerds like me enjoy immensely, but that is fairly unknown from the public.
If I ask random people when they think chatting with a bot became a Thing, they tend to respond "the 90s" or later (usually roughly ten years after they were born, for weird psychological reasons).
But back in the 60s, the Turing Test was a big thing indeed. Of course, nowadays, we know that this test, as it was envisionned, isn't that difficult, but back then it was total fiction.
Enters Joseph Weizenbaum, working at the MIT in the mid 60s, who decided to simplify the problem of random conversation by using a jedi mind trick: the program would be a stern doctor, not trying to ingratiate itself to the user. We talk to that kind of terse and no nonsense people often enough that it could be reasonably assumed that it wouldn't faze a normal person.
It's not exactly amicable, but it was convincing enough at the time for people to project some personnality onto it. It became a real Frankenstein story: Weizenbaum was trying to show how stupid it was, and the concept behind man-machine conversations, but users kept talking to it, sometimes even confiding as they would to a doctor. And the more Weizenbaum tried to show that it was a useless piece of junk with the same amount of intelligence as your toaster, the more people became convinced this was going to revolutionize the psychiatry world.
Weizenbaum even felt compelled to write a book about the limitations of computing, and the capacity of the human brain to anthropomorphise the things it interacts with, as if to say that to most people, everything is partly human-like or has human-analogue intentions.
He is considered to be one of the fathers of artificial intelligence, despite his attempts at explaining to everyone that would listen that it was somewhat a contradiction in terms.
Design
ELIZA was written in SLIP, a language that worked as a subset or an extension or Fortran and later ALGOL, and was designed to facilitate the use of compounded lists (for instance (x1,x2,(y1,y2,y3),x3,x4)
), which was something of a hard-ish thing to do back in the day.
By modern standards, the program itself is fairly simplistic:
- the user types an input
- the input is parsed for "keywords" that ELIZA knows about (eg
I am
,computer
,I believe I
, etc), which are ranked more or less arbitrarily - depending on that "keyphrase", a variety of options are available like
I don't understand that
orDo computers frighten you?
Where ELIZA goes further than a standard decision tree, is that it has access to references. It tries to take parts of the input and mix them with its answer, for example: I am X
-> Why are you X?
It does that through something that would become regular expression groups, and then transforming certain words or expressions into their respective counterparts.
For instance, something like I am like my father
would be matched to ("I am ", "like my father")
, then the response would be ("Why are you X?", "like my father")
, then transformed to ("Why are you X?", "like your father")
, then finally assembled into Why are you like your father?
Individually, both these steps are simple decompositions and substitutions. Using sed
and regular expressions, we would use something like
$ sed -n "s/I am \(.*\)/Why are you \1?/p"
I am like my father
Why are you like my father?
$ echo "I am like my father" | sed -n "s/I am \(.*\)/Why are you \1?/p" | sed -n "s/my/your/p"
Why are you like your father?
Of course, ELIZA has a long list of my/your
, me/you
, ..., transformations, and multiple possibilities for each keyword, which, with a dash of randomness, allows the program to respond differently if you say the same thing twice.
But all in all, that's it. ELIZA is a very very simple program, from which emerges a complex behavior that a lot of people back then found spookily humanoid.
Taking a detour through (gasp) JS
One of the available "modern" implementations of ELIZA is in Javascript, as are most things. Now, those who know me figure out fairly quickly that I have very little love for that language. But having a distaste for it doesn't mean I don't need to write code in it every now and again, and I had heard so much about the bafflement people feel when using regular expressions in JS that I had to try myself. After all, two birds, one stone, etc... Learn a feature of JS I do not know, and resurrect an old friend.
As I said before, regular expressions (or regexs, or regexps) are relatively easy to understand, but a lot of people find them difficult to write. I'll just give you a couple of simple examples to get in the mood:
[A-Za-z]+;[A-Za-z]+
This will match any text that has 2 words (whatever the case of the letters) separated by a semicolon. Note the differenciating between uppercase and lowercase.
Basically, it says that I want to find a series of letters on length at least 1 (+
) followed by ;
followed by another series of letters of length at least 1
.*ish
Point (.
) is a special character that means "any character", and *
means "0 or more", so here I want to find anything ending in "ish"
Now, when you do search and replace (is is the case with ELIZA) or at least search and extract, you might want to know what is in this .*
or [A-Za-z]+
. To do that you use groups:
(.*)ish
This will match the same strings of letters, but by putting it in parenthesiseseseseseseseseses (parenthesiiiiiiiiiiiii? damn. anyway), you instruct the program to remember it. It is then stored in variables with the very imaginative names of \1
, \2
, etc...
So in the above case, if I apply that regexp to "easyish", \1
will contain "easy"
Now, because you have all these special characters like point and parenthesis and whatnot, you need to differenciate when you need the actual "." and "any character". We escape those special characters with \
.
([A-Za-z]+)\.([A-Za-z]+)
This will match any two words with upper and lower case letters joined by a dot (and not any character, as would be the case if I didn't use \
), and remember them in \1
and \2
Of course, we have a lot of crazy special cases and special characters, so, yes, regexps can be really hard to build. For reference, the Internet found me a regexp that looks for email adresses:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Yea... Moving on.
Now, let's talk about Javascript's implementation of regular expressions. Spoiler alert, it's weird if you have used regexps in any other language than perl. That's right, JS uses the perl semantics.
In most languages, regular expressions are represented by strings. It is a tradeoff that means you can manipulate it like a string (get its length, replace portions of it, have it built out of string variables etc), but it makes escaping nighmareish:
"^\\s*\\*\\s*(\\S)"
Because \
escapes the character that follows, you need to escape the escaper to keep it around: if you want \.
as part of your regexp, more often than not, you need to type "\\."
in your code. It's quite a drag, but the upside is that they work like any other string.
Now, in JS (and perl), regexps are a totally different type. They are not between quotes, but between slashes (eg /^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/
). On one hand, you don't have to escape the slashes anymore and they more closely resemble the actual regexp, but on the other hand, they are harder to compose or build programmatically.
As I said, it's a different tradeoff, and to each their own.
Where it gets bonkers is how you use them. Because the class system is... what it is, and because there is no operator overload, you can't really get the syntactic elegance of perl, so it's kind of a bastard system where you might type something like
var myRe = /d(b+)d/;
var isOK = "cdbbdbsbz".match(); // not null because "dbbd" is in the string
match
and matchAll
aren't too bad, in the sense that they return the list of matching substrings (here, only one), or null
, so it does have kind of a meaning.
The problem arises when you need to use the dreaded exec
function in order to use the regexp groups, or when you use the g
flag in your regexp.
The returned thing (I refuse to call it an object) is both an array and a hashmap/object at the same time.
In result[0]
you have the matched substring (here it would be "dbbd"
), and in result[X]
you have the \X
equivalents (here \1
would be "bb"
, so that's what you find in result[1]
). So far so not too bad.
But this array also behaves like an object: result.index
gives you the index of "the match" which is probably the first one.
Not to mention you use string.match(regex)
and regex.exec(string)
const text = 'cdbbdbsbz';
const regex = /d(b+)d/g;
const found = regex.exec(text);
console.log(found);
console.log(found.index);
console.log(found["index"]);
Array ["dbbd", "bb"]
1
1
So, the result is a nullable array that sometimes work as an object. I'll let that sink in for a bit.
This is the end
Once I got the equivalence down pat, it was just a matter of copying the data and rewriting a few functions, and ELIZA was back, as a libray, so that I could use it in CLI tools, iOS apps, or MacOS apps.
When I'm done fixing the edge cases and tinkering with the ranking system, I might even publish it.
In the meantime, ELIZA and I are rekindling an old friendship on my phone!