The Lost Art of Defensive Programming

When I was a programmer lass back in the days back when the internet was spelled ARPANET and www wasn’t yet a thing, I learned my craft from the teachings of giants.

I worked for a company that was very big on fault tolerance because stock exchanges and bank ATM networks trusted them with their money.  Things couldn’t go down for no reason, and you couldn’t lose data – that would cost money.

Even simple things like system maintenance and installing and configuring software had to work or be backed out in very careful ways, and often under very restricted conditions. There were windows for doing these operations, and they were carefully planned an audited, and any operation that resulted in an extended window cost money.

Other people’s money or stock transactions or phone calls or health information was on the line, so the companies that used these platforms were literally betting their business and their reputation on our reliability.

So I.learned to anticipate errors and protect data using transactions that were guaranteed, even in the face of failures, and came to understand that the transaction monitor was the heart of the business.  For maintenance operations and manageability, where I spent most of my time, I learned to anticipate problems, and make it very clear what was going wrong.  And I learned to be wary of conditions I’d never seen before and couldn’t imagine happening under normal circumstances  and make sure that they were visible and that the software failed fast and things were recoverable, either backwards or forwards.

I also learned the must basic, fundamental part of debugging – step through your code.  Either as a mental desk check (back when I started, it took a while to  trot over to the computer room to thread 8″ reel tapes so I could move new software to the test systems, even longer if I had to generate a new system image.  So I learned to read through and ask at each statement “Is this right?  What could go wrong here?”  Doesn’t take long for small changes, saves lots of time.

And then I learned to step through the code with a debugger (though, yes, I can debug with print statements too), watching statement by statement, to see that the values returned from calls matched my expectations, both good and bad. And to remind myself to ask, “What can go wrong here?”

Because that’s really the key for defensive programming. It’s not to ask, “How is this going to go in a happy-day case?” but “What can go wrong here?  How can the program detect it?  How can it report the anomaly so users can understand how to address it?  Can it keep going in the face of this situation? Or does this violate something fundamental so that execution must stop/reset?”

Why?  Because things change – memory and swap space fills up, network configurations differ, someone forgets to run as an admin user, system configurations evolve, networks hiccup, and on and on and on.  And things fail in mysterious ways.

The art of defensive programming seems to be lost today, especially in scripting and high-level languages that make it easier to just chain things together in a happy-day string of commands. “I know that if I split this string at the ‘>’ and take the next 8 characters then I get the current time in 24 hour format” only works if the output doesn’t change (I know, let’s prepend the date!) and someone doesn’t use the special character > for something unexpected and if there is an actual time there and it’s hh:mm:ss with no subsecond decimals and 24 hour format, and on and on and on….

I regularly encountering scripts that blithely assume that all is well when doing simple things, like copying files to nodes in a cluster or grepping to pull out the line that has the space available for a particular file system, or that if the command fails, the reason is a single well-known reason and the way to work around it is thus-and-so, no additional error checking required.  Just mush on. Any other failure gets blame-the-user, “You should configure things my way” (your /etc/hosts contains two different aliases for a server?  Should only contain one! And the first one MUST be the one this program is looking for!) or ignored.

Most recently, I had lots of fun when I got errors in a multi-step function right after a warning that said, “Warning – you might see errors in the next phase but that’s OK”.  I got errors, but apparently not the right errors, so I shouldn’t have ignored them because they were NOT ok.. As a user, how am I supposed to know what errors the author had in mind that are ignorable?  I call these fire alarm problems, as in,  “we’re testing the fire alarm between 10:00 and 12:00, so if you hear alarm bells, it’s not a fire.”  Well…. it might be….

Let me get all curmudgeonly for a second. “Move fast and break things” is cool and all, but it does NOT translate into “keep fixing one problem at a time until you teach your code not to barf on the errors that you’ve encountered in testing and first round deployment in the rarified air that is your small, not-scaled test environment.”  Code in the wild encounters plenty of situations you haven’t seen in testing….  And if you’re delivering a product or stack, dollars to doughnuts, it will encounter all kinds of things you’ve never seen before.

The easiest way that I’ve found to deal with this, especially in a dynamic administration or manageability kind of world is to code for the result I expect, rather than what I don’t want to see. Every place something can go wrong, define what “done” looks like and test for that.  If I’ve just created a file, does the file exist where I think it does?  If I’ve just written to it,  then not only should the write count be what I expect, but the file should have the expected data in it.

It’s really simple, and really hard at the same time.

In the end, there are really two kinds of checks – checks you add for debugging only (notice that I checked the success of my file write two different ways), and checks for unexpected failures that can happen any time.  The former checks can be removed in production code (checking the write count should be sufficient after the program is done, or even simply validated when stepping through via a debugger).  The latter should always be there (accessing the file you created a second ago might fail for any number of reasons, so check every single time). When in doubt, leave it in, and tune performance later.  Speed may be sexy, but only until you hit a wall:  Prematurely optimizing these out using an  “of course it works!” programming methodology inevitably costs more than the performance overhead of leaving them in.

And check for positives and what you expect to happen, not negatives since it’s easier to think of ways that things can go right than it is to be sure that you’re covering all the ways that things go wrong. Network names can be standardized differently,clustered logical servers might not have consecutive sever names or match the order of the alias names, clocks might not be synchronized, dates might be returned in different formats, filenames can have embedded special characters.  And those are the simple obscure bugs.

Finally, if a known non-blocking error can occur, check for it explicitly (down to parsing the error/exception info and ensuring that it precisely matches the signature you’re looking for – a file-doesn’t-exist error on the correct file name) and either suppress it or change it to an info message

And when something does go wrong, for the love of all that is programmatically kind and good, please include all the information you can when reporting the issue, ideally with a unique identifier for the location and invocation.  It will help the person trying to understand why a formerly well-behaved program has flipped out at 3am… and the person trying to understand the situation might well be you.