It's crucial to remember that some bugs are easy to fix, and some are not. In many cases, the manifestation of the bug (e.g., an incorrect calculation on a utility bill) makes it immediately obvious to the computer programmer (a) where the bug is located, and (b) how to fix the problem. But in some cases, it's not easy to figure out a solution to the bug, even when we know where it's located; to compound the problem, sometimes we inadvertently introduce new bugs when we attempt to fix an old bug. Sometimes the act of fixing the bug also requires tedious, time-consuming "clean-up" activities, especially in the case of corrupted databases; without knowing the details, I would imagine this to be part of the problem that PNM Gas and Electric is facing.
- The problem was apparently widespread enough that the company felt compelled to send a letter to every customer. If it had been only a handful of people, they could have been contacted individually. But if a large number of customers were affected, then it will probably take months to correct all of the mistakes even after the computer bugs have been discovered and fixed.
- If it had been possible to find and fix the computer bugs within a matter of a few days, it's highly unlikely that PNM would have bothered sending a letter to every one of its customers; at most, an apology might have been included in the next month's utility bill. But the text of the letter clearly indicates that, as of the date that the letter was written, the debugging effort was still underway; the best we can hope for is that the problems will be "cleared up soon." Indeed, the company's web site indicates that the problems began when a new system was installed on November 30, 1998; the web site also shows a January 1999 press release, and a February 1999 "open letter" to customers who had not received any bills for more than two months. Note that the company did not promise any particular date for finishing the repairs; if the bugs had been simple to fix, one would imagine a confident statement that "we'll have all of these problems resolved no later than the end of this month." I admire PNM's honesty, and I congratulate the managers who resisted the temptation to make a promise that they might not have been able to keep.
In the worst case, we don't even know where the bug is located. And by the nature of Y2K projects (and, for that matter, any software development project), those are the kind of bugs we're most likely to encounter once the computer system has been put into operation. After all, the "easy" bugs will have been identified, found, and fixed during the normal testing activity that precedes the operational use of the system. It's the bugs that we didn't find that remain, lurking in the software, that pop up and wreak havoc after the system has been put into operation -- and while some of these bugs are simple problems located in an obscure part of the program that hardly anyone uses (and therefore nobody bothered to test), most of them are the difficult, subtle, complex bugs.
The most difficult bugs usually involve "interface" problems between separately developed software systems; or interface bugs between a software system and a piece of computer hardware; or an interface problem that involves several different vendors of hardware, system software (e.g., operating systems), communications software (e.g., LAN networking software), and application programs. The bug may be located in hardware/software component A, but might not make its existence known there; instead, it may corrupt a piece of data that is sent on to component B. Component B may exacerbate the problem by further corrupting the data before passing it on to component C; and the situation may continue in this fashion until, finally, component X, much further downstream, "blows up" because the data it's dealing with is so badly corrupted. The focus of attention will initially be on X, and it might be correctly observed that X should have done more thorough error-checking to enable it to reject the erroneous data without blowing up. But where did the bad data come from? Not from X, and not from its immediate predecessor; the search moves laboriously backwards, often involving products and components developed by different vendors, or different business partners. Everyone ends up pointing fingers at one another, and one hears a chorus of complaints: "It's IBM's fault!" "It's the damn telephone company!" "Microsoft screwed up again!" "I told you we should never have trusted the software from XYZ supplier company!" Meanwhile, the clock is ticking. Hours pass, then days. It can literally be weeks before the problem is fixed; indeed, it's quite possible that PNM is going through just such an ordeal.
And then there are the embedded systems. They too may contain Y2K bugs that were overlooked during testing efforts, and they too may fail when post-2000 dates are encountered. But there's a significant difference between the repair of embedded systems, and the repair of "pure" software systems. In the latter case, the faulty computer instructions are rewritten or replaced, and the corrected program is re-compiled or re-assembled in a matter of minutes (in the case of a very large software system, it may be somewhat longer, but rarely more than an hour or two). The recompiled "binary" or "object" program can then be reinstalled on the computer, and life returns to normal.
But in the case of an embedded system, the faulty chip, or logic board, or controller, must be physically removed and replaced with a Y2K-compliant version. If such a replacement is immediately available, the replacement process may be a matter of moments (unless, of course, an entire production line has to be shut down, or the faulty component has to be retrieved from the bottom of the ocean floor). However, the more likely scenario is that the "spare" unit has the same Y2K logic error as the one that just failed! Thus, it's more likely that the computer engineers will have to contact the vendor, determine whether or not a Y2K-compliant version exists, and then place an order for immediate delivery. But "immediate" may be a matter of days; if hundreds of other companies are experiencing similar problems with the same embedded system (e.g., a faulty chip within a popular PBX telephone switchboard), there may be a backlog of orders that need to be filled. Once again, days drag into weeks; the three-day winter snowstorm metaphor is unlikely to be relevant.
What percentage of the Y2K bugs are likely to be the quick, easy-to-fix variety, and what percentage are likely to be the difficult, time-consuming variety? Alas, nobody knows. Let me repeat this: nobody knows. If you don't believe me, try a simple test: ask your friendly bank, or your friendly telephone company, or your friendly electric utility, to provide you with statistics on the "mean time to repair" (MTTR) for all of their software problems during, say, the past five years. Software "maturity" surveys by organizations like the Software Engineering Institute strongly suggest that less than 10% of the software organizations around the world even have such data available -- and it's highly unlikely they'll share such data with anyone outside the organization.
Some organizations might use the familiar argument, "But historical data doesn't matter! Y2K projects are different!" To which you can respond by saying, "Fine -- show me the data from the testing phase of your Y2K projects. How many bugs did you find, and what was the MTTR for those bugs? How many new bugs did you introduce for each old Y2K bug that you removed?" If you can get a plausible answer to such questions, aggregated on an industry-wide basis, and validated by an independent auditing firm (e.g., something like the General Accounting Office in the federal government), then you might have a credible argument for the three-day snowstorm metaphor.
Otherwise, we can only make an educated guess about the likely outcome. Here's one such guess: let's assume that the Gartner Group is correct in its estimate that 30% of U.S. firms that have finished repairing their mission-critical systems will suffer at least one mission-critical failure. Let's also assume, optimistically, that 80% of those failures will be of the simple variety, requiring less than three days to fix. That still leaves 20% that will take longer than three days, potentially much longer. That means 30% times 20%, or a total of 6%, of the U.S. firms will be facing at least one mission-critical failure requiring more than three days to repair. By itself, that's not in the "doomsday" category, though it will certainly be more than a winter snowstorm if that 6% happens to include your local utility company, or your local hospital, or your local bank. But don't forget the earlier part of our analysis: while it's possible that only a small portion of the Fortune 500 companies will be suffering from long-term mission-critical problems, they'll also be suffering from failures of their non-mission-critical systems. And this will be compounded by an even larger number of failures on the part of small businesses (some of whom are suppliers to the Fortune 500), small towns (where the employees of the Fortune 500 companies live), and small countries (who represent import/export trading partners for the Fortune 500 companies). Unless I'm missing something very basic here, I don't understand how the cumulative effect of all these problems will result in something as modest as a three-day winter snowstorm.
If it's that obvious and straightforward, then why do government leaders continue to advocate the three-day snowstorm metaphor so strenuously? This is not a casual question: as Y2K guru Steve Davis points out, the Federal Emergency Management Agency (FEMA) used to recommend that citizens stockpile 14 days of food and water in case of unexpected emergencies such as blizzards, hurricanes, or earthquakes. As Davis points out on his web site:
This suggests that FEMA is conveying the message, "We used to suggest that you stockpile two weeks of food for hurricanes, blizzards, earthquakes, and other emergencies. But now we're focusing on Y2K, and we think it will be less severe than those other kind of emergencies. In fact, we'd prefer that you forget about all those other kind of emergencies and restrict your preparedness efforts to a mere 2-3 days of stockpiling for this mild 'bump in the road' called Y2K.""You may be interested to know that while FEMA is only recommending two to three days of preparedness for Y2K, they have traditionally recommended two weeks of preparedness as a short-term disaster supply kit. Due to the conflict that this presented to them, they removed this two week guidance from their web site. However, I had a copy of it so you can still find it here. See the FEMA document "Food and Water in an Emergency" which says:Short-Term Supplies:
Even though it is unlikely that an emergency would cut off your food supply for two weeks, you should prepare a supply that will last that long. The easiest way to develop a two-week stockpile is to increase the amount of basic foods you normally keep on your shelves."
So, what should we conclude about all of this? There are only a few possible explanations for the government's determined efforts to popularize the three-day winter snowstorm metaphor:
Every citizen needs to decide for himself which of these three explanations is most plausible. And based on that decision, each citizen needs to decide for himself how his personal plans for Y2K preparedness will be affected by the public advice and recommendations from the government. I have my own opinion about all of this, and I plan to discuss it at length in a future essay; but in the meantime, it's important that you form your own opinion so that you can make your own decisions. Take some time this evening or this weekend, and find a quiet place to think quietly. Think about this: does it really make sense to assume that Y2K won't be any worse than a three-day winter snowstorm, or should you plan for something more serious? As for me: I've come to the conclusion that, instead of a three-day bump in the road, it's more likely that Y2K will be "the year of living dangerously."
- The government is privy to some amazing secrets about a positive outcome to the Y2K situation that veteran software engineers (like me) have never heard about, and could not figure out on their own.
- The government doesn't really know what will happen, but hopes that a combination of edicts, mandates, orders, and the power of positive thinking will somehow accomplish Y2K miracles that have never happened in the software industry for the past 30 years.
- The government actually does understand that things are likely to be far worse than publicly admitted, but has decided that it's not a good idea to say so.