Concepts of Probability and Reliability
Probability may be defined as a ratio of specific outcomes to total (possible) outcomes. If you were to flip a coin, there are really only two possibilities1 for how that coin may land: face-up (“heads”) or face-down (“tails”). The probability of a coin falling “tails” is thus one-half ( 1/2 ), since “tails” is but one specific outcome out of two total possibilities. Calculating the probability (P) is a matter of setting up a ratio of outcomes:
This may be shown graphically by displaying all possible outcomes for the coin’s landing (“heads” or “tails”), with the one specific outcome we’re interested in (“tails”) highlighted for emphasis:
The probability of the coin landing “heads” is of course exactly the same, because “heads” is also one specific outcome out of two total possibilities.
If we were to roll a six-sided die, the probability of that die landing on any particular side (let’s say the “four” side) is one out of six, because we’re looking at one specific outcome out of six total possibilities:
If we were to roll the same six-sided die, the probability of that die landing on an even-numbered side (2, 4, or 6) is three out of six, because we’re looking at three specific outcomes out of six total possibilities:
As a ratio of specific outcomes to total possible outcomes, the probability of any event will always be a number ranging in value from 0 to 1, inclusive. This value may be expressed as a fraction ( 1 2 ), as a decimal (0.5), or as a verbal statement (e.g. “three out of six”). A probability value of zero (0) means a specific event is impossible, while a probability of one (1) means a specific event is guaranteed to occur.
Probability values realistically apply only to large samples. A coin tossed ten times may very well fail to land “heads” exactly five times and land “tails” exactly five times. For that matter, it may fail to land on each side exactly 500,000 times out of a million tosses. However, so long as the coin and the coin-tossing method are fair (i.e. not biased in any way), the experimental results will approach2 the ideal probability value as the number of trials approaches infinity. Ideal probability values become less and less certain as the number of trials decreases, and become completely useless for singular (non-repeatable) events.
A familiar application of probability values is the forecasting of meteorological events such as rainfall. When a weather forecast service provides a rainfall prediction of 65% for a particular day, it means that out of a large number of days sampled in the past having similar measured conditions (cloud cover, barometric pressure, temperature and dew point, etc.), 65% of those days experienced rainfall. This past history gives us some idea of how likely rainfall will be for any present situation, based on similarity of measured conditions.
Like all probability values, forecasts of rainfall are more meaningful with greater samples. If we wish to know how many days with measured conditions similar to those of the forecast day will experience rainfall over the next ten years (3650 days total), the forecast probability value of 65% will be quite accurate. However, if we wish to know whether or not rain will fall on any particular (single) day having those same conditions, the value of 65% tells us very little. So it is with all measurements of probability: precise for large samples, ambiguous for small samples, and virtually meaningless for singular conditions3.
In the field of instrumentation – and more specifically the field of safety instrumented systems – probability is useful for the mitigation of hazards based on equipment failures where the probability of failure for specific pieces of equipment is known from mass production of that equipment and years of data gathered describing the reliability of the equipment. If we have data showing the probabilities of failure for different pieces of equipment, we may use this data to calculate the probability of failure for the system as a whole. Furthermore, we may apply certain mathematical laws of probability to calculate system reliability for different equipment configurations, and therefore minimize the probability of system failure by optimizing those configurations.
Just like weather predictions, predictions of system reliability (or conversely, of system failure) become more accurate as the sample size grows larger. Given an accurate probabilistic model of system reliability, a system (or a set of systems) with enough individual components, and a sufficiently long time-frame, an organization may accurately predict the number of system failures and the cost of those failures (or alternatively, the cost of minimizing those failures through preventive maintenance). However, no probabilistic model will accurately predict which component in a large system will fail tomorrow, much less precisely 1000 days from now.
The ultimate purpose, then, in probability calculations for process systems and automation is to optimize the safety and availability of large systems over many years of time. Calculations of reliability, while useful to the technician in understanding the nature of system failures and how to minimize them, are actually more valuable (more meaningful) at the enterprise level. At the time of this writing (2009), there is already a strong trend in large-scale industrial control systems to provide more meaningful information to business managers in addition to the basic regulatory functions intrinsic to instrument loops, such that the control system actually functions as an optimizing engine for the enterprise as a whole4, and not just for individual loops. I can easily foresee a day when control systems additionally calculate their own reliability based on manufacturer’s test data (demonstrated Mean Time Between Failures and the like), maintenance records, and process history, offering forecasts of impending failure in the same way weather services offer forecasts of future rainfall.
Probability mathematics bears an interesting similarity to Boolean algebra in that probability values (like Boolean values) range between zero (0) and one (1). The difference, of course, is that while Boolean variables may only have values equal to zero or one, probability variables range continuously between those limits. Given this similarity, we may apply standard Boolean operations such as NOT, AND, and OR to probabilities. These Boolean operations lead us to our first “laws” of probability for combination events.
The logical “NOT” function
For instance, if we know the probability of rolling a “four” on a six-sided die is 1/6 , then we may safely say the probability of not rolling a “four” is 5/6 , the complement of 1/6 . The common “inverter” logic symbol is shown here representing the complementation function, turning a probability of rolling a “four” into the probability of not rolling a “four”:
Symbolically, we may express this as a sum of probabilities equal to one:
We may state this as a general “law” of complementation for any event (A):
The complement of a probability value finds frequent use in reliability engineering. If we know the probability value for the failure of a component (i.e. how likely it is to fail), then we know the reliability value (i.e. how likely it is to function properly) will be the complement of its failure probability. To illustrate, consider a device with a failure probability of . Such a device could be said to have a reliability (R) value of , or 99.999%, since 1 – = .
The logical “AND” function
The AND function regards probabilities of two or more intersecting events (i.e. where the outcome of interest only happens if two or more events happen together, or in a specific sequence). Another example using a die is the probability of rolling a “four” on the first toss, then rolling a “one” on the second toss. It should be intuitively obvious that the probability of rolling this specific combination of values will be less (i.e. less likely) than rolling either of those values in a single toss, since two rolls gives us twice as many opportunities to land on the desired number. The shaded field of possibilities (36 in all) demonstrate the unlikelihood of this sequential combination of values compared to the unlikelihood of either value on either toss:
As you can see, there is but one outcome matching the specific criteria out of 36 total possible outcomes. This yields a probability value of one-in-thirty six ( ) for the specified combination, which is the product of the individual probabilities. This, then, is our second law of probability:
With these two valves in service, the probability of neither valve successfully shutting off flow (i.e. both valve 1 and valve 2 failing on demand; remaining open when they should shut) is the product of their individual failure probabilities:
P(assembly fail) = P(valve 1 fail open) × P(valve 2 fail open)
P(assembly fail) = 0.0002 × 0.0003
P(assembly fail) = 0.00000006 = 6 × 10−8
An extremely important assumption in performing such an AND calculation is that the probabilities of failure for each valve are not related. For instance, if the failure probabilities of both valve 1 and valve 2 were largely based on the possibility of a certain residue accumulating inside the valve mechanism (causing the mechanism to freeze in the open position), and both valves were equally susceptible to this residue accumulation, there would be virtually no advantage to having double block valves. If said residue were to accumulate in the piping, it would affect both valves practically the same. Thus, the failure of one valve due to this effect would virtually ensure the failure of the other valve as well. The probability of simultaneous or sequential events being the product of the individual events’ probabilities is true if and only if the events in question are completely independent.
We may illustrate the same caveat with the sequential rolling of a die. Our previous calculation showed the probability of rolling a “four” on the first toss and a “one” on the second toss to be × , or . However, if the person throwing the die is extremely consistent in their throwing technique and the way they orient the die after each throw, such that rolling a “four” on one toss makes it very likely to roll a “one” on the next toss, the sequential events of a “four” followed by a “one” would be far more likely than if the two events were completely random and independent. The probability calculation of × = holds true only if all the throws’ results are completely unrelated to each other.
Another, similar application of the Boolean AND function to probability is the calculation of system reliability (R) based on the individual reliability values of components necessary for the system’s function. If we know the reliability values for several crucial system components, and we also know those reliability values are based on independent (unrelated) failure modes, the overall system reliability will be the product (Boolean AND) of those component reliabilities. This mathematical expression is known as Lusser’s product law of reliabilities:
Rsystem = R1 × R2 × R3 × ・ ・ ・ × Rn
As simple as this law is, it is surprisingly unintuitive. Lusser’s Law tells us that any system depending on the performance of several crucial components will be less reliable than the least-reliable crucial component. This is akin to saying that a chain will be weaker than its weakest link!
To give an illustrative example, suppose a complex system depended on the reliable operation of six key components in order to function, with the individual reliabilities of those six components being 91%, 92%, 96%, 95%, 93%, and 92%, respectively. Given individual component reliabilities all greater than 90%, one might be inclined to think the overall reliability would be quite good. However, following Lusser’s Law we find the reliability of this system (as a whole) is only 65.3%. In his excellent text Reliability Theory and Practice, author Igor Bazovsky recounts the German V1 missile project during World War Two, and how early assumptions of system reliability were grossly inaccurate5. Once these faulty assumptions of reliability were corrected, development of the V1 missile resulted in greatly increased reliability until a system reliability of 75% (three out of four) was achieved.
The logical “OR” function
The OR function regards probabilities of two or more redundant events (i.e. where the outcome of interest happens if any one of the events happen). Another example using a die is the probability of rolling a “four” on either the first toss or on the second toss. It should be intuitively obvious that the probability of rolling a “four” on either toss will be more (i.e. more likely) than rolling a “four” on a single toss. The shaded field of possibilities (36 in all) demonstrate the likelihood of this either/or result compared to the likelihood of either value on either toss:
As you can see, there are eleven outcomes matching the specific criteria out of 36 total possible outcomes (the outcome with two “four” rolls counts as a single trial matching the stated criteria, just as all the other trials containing only one “four” roll count as single trials). This yields a probability value of eleven-in-thirty six ( ) for the specified combination. This result may defy your intuition, if you assumed the OR function would be the simple sum of individual probabilities ( + = or ), as opposed to the AND function’s product of probabilities ( × = ). In truth, there is an application of the OR function where the probability is the simple sum, but that will come later in this presentation.
For now, a way to understand why we get a probability value of 11/36 for our OR function with two input probabilities is to derive the OR function from other functions whose probability laws we already know with certainty. From Boolean algebra, DeMorgan’s Theorem tells us an OR function is equivalent to an AND function with all inputs and outputs inverted :
We already know the complement (inversion) of a probability is the value of that probability subtracted from one . This gives us a way to symbolically express the DeMorgan’s Theorem definition of an OR function in terms of an AND function with three inversions:
Knowing that and , we may substitute these inversions into the triple-inverted AND function to arrive at an expression for the OR function in simple terms of P(A) and P(B):
Distributing terms on the right side of the equation:
This, then, is our third law of probability:
Inserting our example probabilities of for both P(A) and P(B), we obtain the following probability for the OR function:
This confirms our previous conclusion of there being an probability of rolling a “four” on the first or second rolls of a die.
A similar application of the OR function is seen when we are dealing with exclusive events. For instance, we could calculate the probability of rolling either a “three” or a “four” in a single toss of a die. Unlike the previous example where we had two opportunities to roll a “four,” and two sequential rolls of “four” counted as a single successful trial, here we know with certainty that the die cannot land on “three” and “four” in the same roll. Therefore, the exclusive OR probability (XOR) is much simpler to determine than a regular OR function:
This is the only type of scenario where the function probability is the simple sum of the input probabilities. In cases where the input probabilities are mutually exclusive (i.e. they cannot occur simultaneously or in a specific sequence), the probability of one or the other happening is the sum of the individual probabilities. This leads us to our fourth probability law:
We may return to our example of a double-block valve assembly for a practical application of OR probability. When illustrating the AND probability function, we focused on the probability of both block valves failing to shut off when needed, since both valve 1 and valve 2 would have to fail open in order for the double-block assembly to fail in shutting off flow. Now, we will focus on the probability of either block valve failing to open when needed. While the AND scenario was an exploration of the system’s unreliability (i.e. the probability it might fail to stop a dangerous condition), this scenario is an exploration of the system’s unavailability (i.e. the probability it might fail to resume normal operation).
Each block valve is designed to be able to shut off flow independently, so that the flow of (potentially) dangerous process fluid will be halted if either or both valves shut off. The probability that process fluid flow may be impeded by the failure of either valve to open is thus a simple (non-exclusive) OR function:
A practical example of the exclusive-or (XOR) probability function may be found in the failure analysis of a single block valve. If we consider the probability this valve may fail in either condition (stuck open or stuck shut), and we have data on the probabilities of the valve failing open and failing shut, we may use the XOR function to model the system’s general unreliability. We know that the exclusive-or function is the appropriate one to use here because the two “input” scenarios (failing open versus failing shut) absolutely cannot occur at the same time:
Summary of probability laws
The complement (inversion) of a probability:
The probability of intersecting events (where both must happen either simultaneously or in specific sequence) for the result of interest to occur:
The probability of redundant events (where either or both may happen) for the result of interest to occur:
The probability of exclusively redundant events (where either may happen, but not simultaneously or in specific sequence) for the result of interest to occur:
In reliability engineering, it is important to be able to quantity the reliability (or conversely, the probability of failure) for common components, and for systems comprised of those components. As such, special terms and mathematical models have been developed to describe probability as it applies to component and system reliability.
Perhaps the first and most fundamental measure of (un)reliability is the failure rate of a component or system of components, symbolized by the Greek letter lambda (λ). The definition of “failure rate” for a group of components undergoing reliability tests is the instantaneous rate of failures per number of surviving components:
λ = Failure rate
Nf = Number of components failed during testing period
Ns = Number of components surviving during testing period
t = Time
The unit of measurement for failure rate (λ) is inverted time units (e.g. “per hour” or “per year”). An alternative expression for failure rate sometimes seen in reliability literature is the acronym FIT (“Failures In Time”), in units of 10−9 failures per hour. Using a unit with a built-in multiplier such as 10−9 makes it easier for human beings to manage the very small λ values normally associated with high-reliability industrial components and systems.
Failure rate may also be applied to discrete-switching (on/off) components and systems of discrete-switching components on the basis of the number of on/off cycles rather than clock time. In such cases, we define failure rate in terms of cycles (c) instead of in terms of minutes, hours, or any other measure of time (t):
Failure rate may be constant, or it may be subject to change over time, depending on the type and age of a component (or system of components). A common graphical expression of failure rate is the so-called bathtub curve showing the typical failure rate profile over time from initial manufacture (brand-new) to wear-out:
This curve profiles the failure rate of a large sample of components (or a large sample of systems) as they age. Failure rate begins at a relatively high value starting at time zero due to defects in manufacture. Failure rate drops off rapidly during a period of time called the burn-in period where defective components experience an early death. After the burn-in period, failure rate remains relatively constant over the useful life of the components. Any failures occurring during the “useful life” period are due to random mishaps. Toward the end of the components’ working lives when the components enter the wear-out period, failure rate begins to rise until all components eventually fail. The mean (average) life of a component (tm) is the time required for one-half of the components surviving up until the wear-out time (tw) to fail, the other half failing after the mean life time.
Several important features are evident in this “bathtub” curve. First, component reliability is greatest between the times of burn-in and wear-out. For this reason, many manufacturers of high-reliability components and systems perform their own burn-in testing prior to sale, so that the customers are purchasing products that have already passed the burn-in phase of their lives. An important measure of reliability is MTBF, or Mean Time Between Failure. If the component or system in question is repairable, the expression Mean Time To Failure (MTTF) is often used instead6. As shown on the bathtub curve, MTBF is the reciprocal of failure rate during the useful life period. This is the period of time where failure rate is at a constant, low value, thus making MTBF a rather large value. Whereas failure rate (λ) is measured in reciprocal units of time (e.g. “per hour” or “per year”), MTBF is simply expressed in units of time (e.g. “hours” or “years”).
R = Reliability as a function of time (sometimes shown as R(t)
e = Euler’s constant (≈ 2.71828)
λ = Failure rate (assumed to be a constant during the useful life period)
t = Time
Knowing that failure rate is the mathematical reciprocal of mean time between failures (MTBF), we may re-write this equation in terms of MTBF as a “time constant” (τ ) for random failures during the useful life period:
Thus, reliability exhibits the same asymptotic approach to zero over time that we would expect from a first-order decay process such as a cooling object (approaching ambient temperature) or a capacitor discharging to zero volts. A practical example of this equation in use would be the reliability calculation for a Rosemount model 1151 analog differential pressure transmitter (with a demonstrated MTBF value of 226 years as published by Rosemount) over a service life of 5 years following burn-in:
Reliability, as previously defined, is the probability that a component or system will perform as designed when needed. Like all probability values, reliability is expressed a number ranging between 0 and 1, inclusive. A reliability value of zero (0) means the component or system is totally unreliable (i.e. it is guaranteed to fail). Conversely, a reliability value of one (1) means the component or system is completely reliable (i.e. guaranteed to properly perform when needed). The mathematical complement of reliability is referred to as PFD, an acronym standing for Probability of Failure on Demand. Like reliability, this is also a probability value ranging from 0 to 1, inclusive. A PFD value of zero (0) means there is no probability of failure (i.e. it is guaranteed to properly perform when needed), while a PFD value of one (1) means it is completely unreliable (i.e. guaranteed to fail).
Obviously, a system designed for high reliability should exhibit a large R value (very nearly 1) and a small PFD value (very nearly 0). Just how large R needs to be (how small PFD needs to be) is a function of how critical the component or system is to the fulfillment of our human needs. The degree to which a system must be reliable in order to fulfill our modern expectations is often surprisingly high. Suppose someone were to tell you the reliability of electric power service to a neighborhood in which you were considering purchasing a home in was 99 percent (0.99). This sounds rather good, doesn’t it? However, when you actually calculate how many hours of “blackout” you would experience in a typical year given this degree of reliability, the results are seen to be rather poor (at least to modern American standards of expectation). If the reliability value for electric power in this neighborhood is 0.99, then the unreliability is 0.01:
99% doesn’t look so good now, does it? Let’s suppose an industrial manufacturing facility requires steady electric power service all day and every day for its continuous operation. This facility has back-up diesel generators to supply power during utility outages, but they are budgeted only for 5 hours of back-up generator operation per year. How reliable would the power service need to be in order to fulfill this facility’s operational requirements? The answer may be calculated simply by determining the unreliability (PFD) of power based on 5 hours of “blackout” per year’s time:
Thus, the utility electric power service to this manufacturing facility must be 99.943% reliable in order to fulfill the expectations of no more than 5 hours (average) back-up generator usage per year.
A common order-of-magnitude expression of desired reliability is the number of “9” digits in the reliability value. A reliability value of 99.9% would be expressed as “three nine’s” and a reliability value of 99.99% as “four nine’s.”
1To be honest, the coin could also land on its edge, which is a third possibility. However, that third possibility is so remote as to be negligible in the presence of the other two.
2In his excellent book, Reliability Theory and Practice, Igor Bazovsky describes the relationship between true probability calculated from ideal values and estimated probability ( ˆ P) calculated from experimental trials as a limit function: , where N is the number of trials.
3Most adults can recall instances where a weather forecast proved to be completely false: a prediction for rainfall resulting in a completely dry day, or visa-versa. In such cases, one is tempted to blame the weather service for poor forecasting, but in reality it has more to do with the nature of probability, specifically the meaninglessness of probability calculations in predicting singular events.
4As an example of this shift from basic loop control to enterprise optimization, consider the case of a highly automated lumber mill where logs are cut into lumber not only according to minimum waste, but also according to the real-time market value of different board types and stored inventory. Talking with an engineer about this system, we joked that the control system would purposely slice every log into toothpicks in an effort to maximize profit if the market value of toothpicks suddenly spiked!
5According to Bazovsky (pp. 275-276), the first reliability principle adopted by the design team was that the system could be no more reliable than its least-reliable (weakest) component. While this is technically true, the mistake was to assume that the system would be as reliable as its weakest component (i.e. the “chain” would be as strong as its weakest link). This proved to be too optimistic, as the system would still fail due to the failure of “stronger” components even when the “weaker” components happened to survive. After noting the influence of “stronger” components’ unreliabilities on overall system reliability, engineers somehow reached the bizarre conclusion that system reliability was equal to the mathematical average of the components’ reliabilities. Not surprisingly, this proved even less accurate than the “weakest link” principle. Finally, the designers were assisted by the mathematician Erich Pieruschka, who helped formulate Lusser’s Law.
6Since most high-quality industrial devices and systems are repairable for most faults, MTBF and MTTF are interchangeable terms.
7One could even imagine some theoretical component immune to wear-out, but still having finite values for failure rate and MTBF. Remember, λuseful and MTBF refer to chance failures, not the normal failures associated with age and extended use.
8For example, the Rosemount model 3051C differential pressure transmitter has a suggested useful lifetime of 50 years (based on the expected service life of tantalum electrolytic capacitors used in its circuitry), while its demonstrated MTBF is 136 years.