Failure is Not an Option: Technology

A quick introduction to the laws of probability

Nobody can predict when a given system is going to fail, but we can make predictions about how many systems are going to fail in a given period of time. To do this, we have to have a basic understanding of probability theory. Normally, this will be done by your S&M person.

The probability of any event e occuring (in a given period of time) is P_e which is a floating point number in the range of 0 to 1, inclusive. An event which cannot happen has a P_impossible of 0. An event which must happen has a P_certain of 1. If you toss a fair coin, there are only two possibilities, P_heads=0.5 and P_tails=0.5. Note that 0.5+0.5=1.0.

The probability that the computer will work in a given unit of time is the Mean Time To Repair(MTTR) divided by the Mean Time Between Failures (MTBF). P_failure=MTTR/MTBF. So, if the MTTR is 2 hours and the MTBF is 9000 hours (roughly a year), then the P_failure=2/9000=0.0002

Note that the event either occurs or does not occur. So, for example, if the machine kernel panics and reboots automatically, that is a failure for whatever quanta of time you use. Even if the machine reboots in three minutes, if you measure reliability in hours, it is considered down for the whole hour. Of course, if the machine kernel panics and reboots three times in that hour, it's still considered a single failure, but that's an unusual failure mode. In fact, kernel panics followed by a successful reboot are pretty rare, at least in the Linux world, it more common for a hardware problem to take the machine down and leave it down. My intuition from years of watching computers break suggests to me that an hour is about the right quantum of time for discussing such things, but that's just my preference.

The probability that both of two independent events will occur is the product of the probabilities of either event So, for example, if the P_failure of a computer is 0.0002 and you have two such computers, then the probability that the two of them will fail at the same hour is:
P_{failure_1∩~failure_2}=P_{failure_1}*P_{failure_2}=0.00022*0.00022=4.8e-08
(∩ is the symbol for intersection). there are 365.25*24=8766 hours in a year, so the probability that the two computers will fail in the same hour in a given year is 8766*4.93826E-08 or 0.000432888.

We usually quote reliability in terms of "nines". Since reliability is the opposite of failure, P_success=1-P_failure. In the example above, 1-0.000432888=0.999567. This is three nines reliability in one year.

The astute reader will ask why calculate the odds of failure, why not calculate the odds of success? We can, and we come up with the same answer. The probability that either of two independent events will occur is the sum of the probabilities of the events, or P_{success_1
∪ success_2}=P_{success_1}+P_{success_2} - P_{success_1}*P_{success_2} (The symbol ∪ means union). This may seem counterintuitive, but think about it using this table (where the column widths and heights are proportional to the probabilities):

		P_e1	P_~e1
	Probability	.7	.3
P_e2	.8	P_e2∩e1=.56	P_e2∩~e1=.24
P_~e2	.2	P_~e2∩e1=.14	P_~e2∩~e1=.06

Note that .56+.14=0.7, and so on. .56+.24+.14+.06=1.0. The problem with the intutitive thought is that the case of P_~e2∩~e1 is counted twice. Another way to think of it is
P_{success_1 ∪ success_2}=P_{success_2 ∩ fail_1}+P_{success_2
∩ success_1}+P_{fail_2 ∩ success_1}
P_{success_1 ∪ success_2}=P_{success_2}* P_{fail_1}+P_{success_2} * P_{success_1} + P_{fail_2}*P_{success_2}
In the case of the numbers above,
P_{success_1 ∪ success_2} = 8998/9000 * 2/9000 + 8998/9000 * 8998/9000 + 2/9000 * 8998/9000=0.999999950617284. But that's for a given hour. To find the probability of success over a year you must find P_success=(1-P_success*8766)+1 gives 0.999567, which is again 3 nines. So the reason why we do these calculations in terms of failure is that it's easier to do these calculations with probabilities of intersections instead of unions.

If you have N servers, any one of which is capable of doing the job, then P_{simultaneous_failure}=P_{failure_1}*P_{failure_2}*...*P_{failure_N} If all the machines are equally likely to fail, then P_{simultaneous_failure}=P_{failure_1}^N which is much easier to calculate. It should be a small number. For example, if P_failure=2/9000= 0.000222222 and you have 8 computers, any 1 of which is up to carrying the load, then P_{simultaneous_failure}=0.000222222⁸ or 5.94698*10^-30 . So 8 computers is probably overkill.

What about the probability of a failure if you need M computers out of N computers? The answer is either P_{simultaneous_failure} * M So, if your P_{simultaneous_failure} is 6*10^-30 and you need 4 computers, then your P_failure=1.5*10^-30. Calculating the same answer using P_{simultaneous_success} (which is 0.999 999 999 999 999 999 999 999 999 994 ) is left as an exercise for the reader.

A complicating factor

The discussion above assumes that the probabilities are independent. This is true for most but not all failure modes. A fire in the computer room, power failure (the most common failure mode), cooling failure, a fire below the computer room (see the stories: the fried computers, earthquake, insurrection, volcanic eruption are all disasters that will affect all computers. Simultaneous resignation of the operations staff is an interesting disaster -it's happened.

To deal with these possibilities, you've got to have a remote facility for disaster recovery (DR). DR is a whole chapter on to itself.

$Log: intro_to_stats.html,v $
Revision 1.1.1.1  2006/10/01 23:36:21  cvsuser
Initial checkin to CVS
Revision 1.2  2006/09/20 21:22:45  jeffs
Update discussion of statistics

Revision 1.1  2006/01/05 08:08:22  jeffs
Initial revision

Revision 1.1  2006/01/05 06:02:19  jeffs
Initial revision