Do you want to hear about that one time all the apparatus of modern software engineering – unit tests and continuous integration – actually helped the programmer to break code, rather than fix it?
I was working on some code a bunch of other people wrote. It wasn’t a particularly large code base, but it had grown organically, with contributions by several people. There was a suite of hundreds of unit and integration tests and a continuous integration system that ran the tests with every commit. It’s a good, modern, setup right?
I was going to make this one tiny change to a small part of the code. I opened it up and started to follow the relevant logic. I kept going over the code and I just couldn’t understand how it worked. In fact, I decided, it actually didn’t work! But it was passing all the tests. I mean ALL the hundreds of tests. In fact this code had been the codebase for years! So, of course, the logical conclusion was that I just didn’t understand the code.
But first, I would do a test. I constructed a small input which I thought would crash the code and I ran it. And it did crash the code! What were the odds that I would be able to craft a simple, straightforward example that would crash code that was passing hundreds of tests and was in production for years?
Now, I couldn’t ask the people who wrote the code about it anymore, so I don’t know how this actually happened, but here is my best guess: the person made a set of changes to the code base. Then they ran the test suite and found failures. Now, I think this person or persons were under some severe time pressure. Instead of systematically tracing the failure, they started to guess what the problem was, and based on these guesses they started to put in exception clauses into the code. “If this, then that” code patterns in multiple places, in a pile, until all the tests passed again. So what they had done was figure out, somewhat randomly, how to make the new code pass the tests.
Purveyors of fine statistical methods will detect here a whiff of that dreaded malodor – over fitting. Instead of teaching the code the general principles of the problem it was supposed to solve, the programmer(s) had caused the code to effectively memorize the answers to the set of questions it was being presented with in the test suite.
Now here I come, with my ad hoc n+1 th test case. And the machine falls over, because it hasn’t been programmed to solve the problem – but only taught to get the correct answers to that particular set of tests.
This episode was very interesting to me from a sociological point of view. Here was a case where all the modern apparatus of software development – unit tests, continuous integration – worked exactly opposite to their intended effect.