18 December 2008

Computers feel that they should "crash", too!

I was discussing gold vs. bits in a computer with Yogi in a thread below here and mentioned that gold would still be around when computers failed. Yogi quipped, is that something you have been waiting for since they were invented? Well he knows that I am in computers for my living so I don''t know what he was thinking. My answer was that I have lived with computers failing the entire 30 years of my career. It''s what I do for a living, program computers and fix failing computers systems. The number of ways that computers can fail is almost too numerous to count.

Computers are dependent on so many support systems it''s a wonder they work as well as they do. Moreover, they are so complex on their own that they can''t be counted on to last for very long at all. As the amount of memory in a computer increased the number of errors that occurred on a random basis in that memory increased. This error rate is due to alpha particles colliding with the electrons in the computer memory and causing what is called a bit error. When a bit error occurs, a computer may be forced to shut down. The solution was to use Error Checking and Correcting memory or ECC. Well how does an error get detected much less corrected? The answer is that it takes an extra 7 bits for every 32 bits to check for and correct a single bit error in each 32 bit word. So the amount of memory required in most mission critical systems is 25% greater due to the susceptibility of computers to failure as a result of a single bit error.

Sun found out in early 2000 what a problem it could be to design a CPU chip running 400Mhz without having ECC in the cache. The Sparc 440Mhz and 400Mhz CPUs had problems with single bit errors. Sun had parity checking on their bus paths and when a parity error occurred, they were forced to stop the CPU. Well it turned out that in their multi-cpu systems this was causing machines to halt quite frequently, i.e. more than once a month per system in some cases. I had several big customers that were the victims of this particular failure and it was devestating when it happened because the system failed and rebooted. Let''s not even get into what happens to a mission critical application when that occurs in the middle of a busy day.

Of course people designing computer based systems have been aware of the need for redundancy for years, this is why things started moving from mainframes to servers in a distributed environment in the mid 90s. The distributed approach allows for an increase in the capacity of the system but it leads to greater complexity in the software. Why you ask? Well consider that all those computers now have to communicate and coordinate in order to divvy up the work they are doing. That requires more complex software and it also requires a large number of additional communications paths that weren''t there on the monolithic mainframe. It also puts a lot more computers out there with lot''s more opportunities for failures to be introduced. The mean time between failures decreases with the number of computers in the system. That means that a computer failure in a distributed system is much more likely than in a mainframe system. However, because of the redundancy, the failure of a single computer need not bring the system down but it often slows it down.

The third area of failure that began to occur with greater regularity in these distributed systems was with disks. When a disk fails the data on it is usually lost immediately. It''s true that in some cases much of the data can be recovered but not usually. So now we saw the emergence what are called Redundand Arrays of Inexpensive Disks or RAID technology. This requires two to three times the amount of disk capacity with respect to actual data stored which also decreases MTBF. The good news with RAID arrays is that when a disk fails, no data is lost. Now if the failed disk is not replaced, there is risk of the loss of data. So RAID arrays are always being updated with bigger, newer, faster models.

Because RAID arrays were restricted to a single computer, Storage Array Networks or SAN''s were invented. These are even more complicated devices that allow several big computers to share storage over an optical network. Of course they require multiple paths of data from multiple computers into them and so they are far more complex than even a simple RAID. All of this complexity leads to a tremendous need for people to service it as well as consuming lavish amounts of power directly through the components themselves and indirectly to keep it cool. This stuff will literally burn itself up if it''s not cooled night and day. Of course that costs money and energy, two things we seem to be running short of in the US of late.

OK you might think wow, what a mess can it get worse? Well of course it can because we haven''t even begun to touch on how software can cause computers to fail or how overloading a computer system can cause it to fail. Those topics will be the subject of another thread sometime in the future. For now just consider yourselves lucky that most of you only have to deal with one or a few computers failing you from time to time.

My experience with computers is why I deride bits in a computer on a regular basis on this forum. I know to those of you who are igorant of this it seems like I am some sort of Luddite but I promise you, I am anything but that.

1 comment:

Anonymous said...

We should strive for simplicity rather than complexity. Less not more.

Having been around computers most of my life, I am aware of how extremely delicate they really are. The points of failure are nearly infinite, it boggles the mind they work at all.