You Will Never Take The Debugging Out!

Continuing on the somewhat long-winded grandiloquent course on software development, something that sticks out these days is the way people convince themselves that bug-free applications can exist. It’s like the Loch Ness Monster. In theory it might be possible, and there’s no way to disprove it totally, but all the empirical evidence and every attempt at finding it is a failure.

Mind you, I’m not saying it’s not a good objective to have a bug-free application rolling out. It’s just very very very unlikely.

A computer is a very complex ecosystem: a lot of pieces of software are running on the same hardware and can have conflicting relationships. The frameworks you are using might have gaping holes or hidden pitfalls. Your own work might deal with problems in a way that might not be fully working at the next revision of the OS you support. Or you were rushed the day you wrote these ten lines of code that now make the application blow up in mid-air.

And that’s OK!

As long as the bugs are acknowledged and fixed, in a timely fashion, they are part of the natural life-cycle of an application. Contrary to biological life that allows for a certain margin of error, and kind of self-corrects it, computer programs either work or don’t, they can’t really swerve back on the right track. That’s why most people see bugs in a different way that they see a mistake in other aspects of life: computers are rather unforgiving appliances, and the software relies and expresses itself solely through them. And given the fact that computers are put in the hands of people who by and large don’t expect to have to learn the software to be able to use it, that’s a recipe for disappointments.

Back when I used to teach, I would tell my students of the very first application I released (DesInstaller), and the first “bug report” I got. It went along the lines of “Your software is a piece of crap! Every time I hit cmd alt right-shift F1 and enter, it crashes!”.

First off, this was absolutely true. The application did indeed crash on such occasions. Therefore it is a genuine bug. The real question is “why in hell would I have found out that bug in my development cycle?”. I can’t even type that shortcut without hurting my fingers in the process, so the chances of me finding out that crash-condition were pretty much nil.

When I write an application, it’s hard for me to imagine a user would use it in a barbaric fashion. Whenever I test it, I have to somehow change the way I think to put myself in the user’s shoes. It is incredibly hard to do as you probably know from personal experience. However, somehow, we have to do it. What we just cannot do is to add on top of that the layer of complexity that is the interaction with other pieces of software competing for the same resources. It would be like trying to figure out in advance where the next person who’ll bounce into you in the street will be.

Anyway, I digress. This piece is not about explaining why there will probably never be any bug-free application out there, it’s about the mindset we have to set ourselves in it when making an application: debugging is vital, and is here to stay.

So, right off the bat, resources have to be allocated to debugging and the debugging skillset has to be acquired by any person or company toddling in the software business. QA is not optional, and fine tuning isn’t either.

Basically, once a bug has been reported, there are several ways to deal with it depending on the ramifications of the bug (is the application unusable, or is it a forgivable glitch?), the depth of the bug (is it something caused by just one exact and somewhat small cause, or is it a whole family of problems rolled into one visible symptom?), the fixability of the bug (will correcting it imply 10 lines of changed code, or 50% of the base to be re-written?), and the probable durability of the fix (will it hold for a long time, or is it something that will break again at the next OS update?). Identifying these factors is crucial, and hard. Exceptionally so in some cases.

1/ Ramifications

This is a relatively (compared to the others, at any rate) easy thing to identify. You are supposed to know what your target audience is, after all. If your users are seasoned pros, they might overlook the fact that your application crashes .01% of the time of their day-to-day use of it.
Broader audiences might be trickier, because the public opinion is such a volatile thing: bad publicity could sink your product’s sales real fast. Then again, the general bugginess of some systems/applications out there and the lack of public outcry seems to indicate the general public is kind of lenient, provided the bugs are fixed and not too frequent.

2/ Depth

That is probably the hardest thing to figure out. Having sketchy circumstances and symptoms, especially in a somewhat big piece of software, makes finding out the real depth of a bug little more than an educated guess. Crash logs help, of course, but even crash logs (the exact state of the computer program at the moment of the crash) don’t tell the whole story.

I trust that any developer worth his/her salt won’t leave any divide-by-zero or somesuch bugs in the code, especially if time has been made for the QA leg of the development cycle. Therefore, when I talk about bugs, I’m not really talking about that easily-fixed family of bugs, where having the exact position in the code where it crashed tells the whole story.

Complex bugs tend to come from complex reasons. Knowing where it happens helps. Knowing how you got at this precise point with this precise state has yet to be figured out.

Has the user chosen an action that should be illegal and therefore should have been filtered out before we got to that point? Is there a flaw in the reasoning our lines of code are built on (aka “building sand castles”)? Is there an unforeseen bug in the underlying framework or library we are using? Is there a hardware component to this bug?

The approach I tend to use in these cases is bottom up. It’s not the shortest way by a long shot, but it tends to root other potential bugs in the process:

  1. I start from the line in my code the application has crashed (as soon as I have found out, which can be an adventure in itself)
  2. I seek all the calling paths that might have taken me at this point in my code (using some calling/caller graph such as doxygen )
  3. I map out in all these paths based on logic: in all these branches, remove every truly impossible ones (dues to arguments or logic gates such as if/else)
  4. I then consolidate the branches together as much as possible (if this branch is executed then this one is as well, so might as well group them together) to minimize the variability of the input set. Behind these barbaric words is a very simple concept: find out how many switches you have at the top that are truly independent from each other.
  5. I build a program that uses these branches, takes a set of values corresponding to the entry set, and permutes through each possibility of each input entry

Sometimes you can do it all in your head if the program is small, but in any program that has a few thousands of lines of code, being thorough generally goes through this process.

Once I get there, though, what I have is a list of every permutation causing a crash. The rule of thumb is, the shorter this list, the shallower the bug.

3/ Fixability

Counter-intuitively, this might actually be completely independent from the depth of the bug. A very deep bug that has a very small footprint in terms of causes can take as little as one line of code to fix it (example: unsigned ints versus signed ones). The time it takes to find the bug is not really related to the time it takes to fix it.

The problem here is more like taking the opposite path of finding the bug: If I change this to fix the root cause of my bug, what impact will it have on anything that was built on top of the function the bug was in? In many ways, if you want to be systemic about it, this process is actually longer than the previous one: using the line of code you just fixed, you have to examine every path containing that section and looking hard for any implied modification.

A real world example I could take is changing a cogwheel in a mechanism: are all the connected ones the right size too? if it changes another wheel, what impact does it have on the ones connected to it? etc etc.

It can be very long and very tiresome, or it can be a walk in the park, depending on the redundancy and the structure of your program

4/ Durability

This is the trickiest one of them all, because sometimes, there is just not enough information to figure it out. It’s not hard per se, but it depends on so many factors that the best thing you will achieve is mostly a bet.

The first two factors to consider are actually points 2 and 3. Especially if what you’re tasked to give is an estimate of the time it will take to fix a bug or how much resources will be needed to fix it. Since the other two can be really hard to evaluate, something that stands on these two is bound to ricochet even harder.

Then you have to factor in the general quality of the platform the program is running on (do they introduce new bugs every now and then? Is any of the frameworks you based your application on susceptible to change?), the kind of users you have (do they tinker a lot? Is your program part of a chain of tools that might change as well?), and the time that can realistically be invested in finding out what the bug actually is and how to fix it.

Time to conclude, I guess. Bugs are here to stay. Accept it from the beginning and rather than hoping for the best, prepare for the worst.

For developers, make debugging-friendly code. That means factoring your code as much as possible (if you fix a bug once, it’s fixed everywhere), having clear types and possible values for your parameters (none of that “if I cast it, it stops complaining”), and, I know it’s a little old-fashioned but, output debugging data in human readable format at various choke points.

For project managers, make time for QA and debugging. It’s not shameful to have a bug in the product. It is to have a stupid bug that could have easily been fixed if the dev team had had one more day, though. Don’t assume the developer who asks for a little more time, especially if he’s able to tell you the reasoning behind it, is a lazy bastard who should have done his job better the first time around. There is no certain metric for the chance of a bug happening. Rule of thumb is that there will be a minor bug for every few user-centered features, and one major bug every couple of thousands of lines of code.

And for end-users, be firm, but fair. While the number of bugs is not linked to the size of the company putting the application out there, finding out about one takes time and resources. If you paid for a piece of software, you sure have a right to ask the developer to fix it. And any developer who is proud of his work will definitely fix it. It may take a little while, though, depending on the bug, the size of the company, the number of products they have out there and the size of said products. Even I, as a developer, don’t know how long it will take to fix a bug, so don’t expect anything instantaneous. Better have a good surprise than insanely high expectations, right?
That being said, if your requests are ignored after a couple of releases, you have every right to remove your patronage and stop paying for the software or the service. It’s up to you to decide if the fix is vital to your workflow or not. But think twice before advocating foul play.


Leave a Reply