«Minimally Informative Prior Distributions for PSA PSAM-10 Dana L. Kelly Robert W. Youngblood Kurt G. Vedros June 2010 This is a preprint of a paper ...»
Prior Distributions for
Dana L. Kelly
Robert W. Youngblood
Kurt G. Vedros
This is a preprint of a paper intended for publication in a journal or
proceedings. Since changes may be made before publication, this
preprint should not be cited or reproduced without permission of the
author. This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, or any of their employees, makes any warranty, expressed or implied, or assumes any legal liability or responsibility for any third party’s use, or the results of such use, of any information, apparatus, product or process disclosed in this report, or represents that its use by such third party would not infringe privately owned rights. The views expressed in this paper are not necessarily those of the United States Government or the sponsoring agency.
Minimally Informative Prior Distributions for PSA Dana L. Kellya1, Robert W. Youngblooda, and Kurt G. Vedrosa a Idaho National Laboratory, Idaho Falls, ID USA Abstract: A salient feature of Bayesian inference is its ability to incorporate information from a variety of sources into the inference model, via the prior distribution (hereafter simply “the prior”).
However, over-reliance on old information can lead to priors that dominate new data. Some analysts seek to avoid this by trying to work with a minimally informative prior distribution. Another reason for choosing a minimally informative prior is to avoid the often-voiced criticism of subjectivity in the choice of prior. Minimally informative priors fall into two broad classes: 1) so-called noninformative priors, which attempt to be completely objective, in that the posterior distribution is determined as completely as possible by the observed data, the most well known example in this class being the Jeffreys prior, and 2) priors that are diffuse over the region where the likelihood function is non- negligible, but that incorporate some information about the parameters being estimated, such as a mean value. In this paper, we compare four approaches in the second class, with respect to their practical implications for Bayesian inference in Probabilistic Safety Assessment (PSA). The most commonly used such prior, the so-called constrained noninformative prior, is a special case of the maximum entropy prior. This is formulated as a conjugate distribution for the most commonly encountered aleatory models in PSA, and is correspondingly mathematically convenient; however, it has a relatively light tail and this can cause the posterior mean to be overly influenced by the prior in updates with sparse data. A more informative prior that is capable, in principle, of dealing more effectively with sparse data is a mixture of conjugate priors. A particular diffuse nonconjugate prior, the logistic-normal, is shown to behave similarly for some purposes. Finally, we review the so-called robust prior. Rather than relying on the mathematical abstraction of entropy, as does the constrained noninformative prior, the robust prior places a heavy-tailed Cauchy prior on the canonical parameter of the aleatory model.
Keywords: PRA, Bayesian inference, prior distribution.
1. INTRODUCTION A salient feature of Bayesian inference is its ability to incorporate information from a variety of sources into the inference model, via the prior distribution (hereafter simply “the prior”). Done properly, Bayesian inference integrates old information and new information into an evidence-based state-of-knowledge distribution. However, if the situation being evaluated is changing with time, then over-reliance on old information in formulating the prior can lead to priors that excessively dominate new data.
Some analysts seek to avoid this by trying to work with a minimally informative (less direct but synonymous terms are diffuse, weak, and vague) prior distribution. Another reason for choosing a minimally informative prior is to avoid the often-voiced criticism of subjectivity in the choice of prior.
Minimally informative priors fall into two broad classes: 1) so-called noninformative priors, which attempt to be completely objective, in that the posterior distribution is determined as completely as possible by the observed data. The most well known example in this class is the Jeffreys prior; 2) priors that are diffuse over the region where the likelihood function is non-negligible, but that incorporate some information about the parameters being estimated, such as a mean value. The reader is referred to (1) for a thorough review of prior distributions in the first class. In this paper, we compare four approaches in the second class, with respect to their practical implications for Bayesian inference in PSA. The most commonly used such prior, the so-called constrained noninformative 1 Dana.Kelly@inl.gov prior (CNIP) (2), is a special case of the maximum entropy prior, which is discussed by (3) and others.
The CNIP is formulated as a conjugate distribution for the most commonly encountered aleatory models in PSA, and is correspondingly mathematically convenient; but it has a relatively light tail, and is correspondingly somewhat unresponsive to updates with sparse data, an issue discussed in (4) in the context of the Mitigating System Performance Index. Other issues with maximum entropy priors are discussed by (5) and (6). A more informative prior that is capable, in principle, of dealing more effectively with sparse data is a mixture of conjugate priors, as discussed by (7) and (8). A particular diffuse nonconjugate prior, the logistic-normal, is shown to behave similarly for some purposes.
Finally, we review the so-called robust prior, first described by (5). Rather than relying on the mathematical abstraction of entropy, as does the constrained noninformative prior, the robust prior places a heavy-tailed Cauchy prior on the canonical parameter of the aleatory model.
2. CONSTRAINED NONINFORMATIVE PRIOR
The constrained noninformative prior (CNIP) is, as pointed out by (6), a type of maximum entropy prior distribution. Prior to the advent of the CNIP in (2), the most prevalent definition of entropy in
PSA was the straightforward extension of the Shannon entropy to the case of a continuous variable:
The CNI prior uses a definition of entropy due to Jaynes (3), which defines entropy as the negative of
the Kullback-Leibler distance between S(T) and the “natural” noninformative prior:2
There is ambiguity as to what the “natural” noninformative prior should be, and (2) adopted the Jeffreys prior for SNI(T). The attractiveness of the CNI prior was that it is, like the Jeffreys prior, invariant to reparameterization. However, like the maximum entropy prior under the extended Shannon definition, the CNI prior can fail to exist, even in simple models such as exponential time to failure. Also note that in the case of continuous distributions, entropy under either definition is often negative.
In the setting of a binomial aleatory model, the unknown parameter in the above equations, T, is equal to p, the probability of failure on demand in each Bernoulli trial. In this case, the CNI prior cannot be written down in closed form, but can be approximated well by a beta distribution with first parameter approximately equal to 0.5, and second parameter determined from the specified mean constraint. The maximum entropy prior under the extended Shannon definition can be written in closed form as a truncated exponential distribution, as given in (6), and for small values of p, it is approximately an exponential distribution with rate equal to the reciprocal of the specified mean constraint.3 The figure below shows these two maximum entropy prior densities for a mean of 0.001. Note the vertical asymptote at zero, which is characteristic of the CNI prior. The asymptote is inherited from the Jeffreys prior, which is a beta(0.5, 0.5) distribution.
2 Under this definition, the Shannon entropy is the negative of the Kullback-Leibler distance from a uniform distribution. Thus, the maximum entropy prior under the extended Shannon definition is as close as possible (in terms of K-L distance) to a uniform distribution.
3 Because of space constraints, we treat only the binomial model explicitly. Note that the CNI prior for the Poisson(Ot) model is a gamma distribution with shape parameter = 0.5 and rate parameter = 1/(2umean). Under the extended Shannon definition, the maximum entropy prior is exponential with rate = mean.
1200 1000 800 Density 600 400
Figure 1 Two maximum entropy densities for p, both with mean = 0.001, CNI prior displays vertical asymptote at p = 0. To specify the maximum entropy prior, one must first specify the definition of entropy.
So, as pointed out by (6), there is an unavoidable arbitrariness in “maximum entropy priors,” because there is no clear definition of entropy for a continuous random variable. However, entropy is a measure of uncertainty, and so one would expect a maximum entropy prior, which in a specialized mathematical sense is maximally uncertain, to be minimally “stiff” (maximally responsive) in terms of how it responds to data in Bayesian updating. But does this turn out to be the case?
2.1 Bayesian Updating of CNI Prior
Throughout this paper, we will take as a running example the failure to start of a motor-driven pump, assumed to have a mean failure probability of 0.001. We will assume that there are 50 demands on this pump, and that failures to start are described by a binomial distribution with parameters p and 50.
From (2), the beta distribution that approximates the CNI prior has parameters =.498 and E = 498.
The posterior distribution is beta( + x, E + 50 – x), where x is the number of failures observed in 50 demands.
3. MIXTURE PRIOR
Refs. (7) and (8) proposed the use of a mixture of two conjugate prior distributions, one representing performance of a degraded component, the other representing performance of a component in its normal state. Of these two mixture prior models, the one most applicable to the present situation is the “variable-constituent” prior, described in (8). This prior was originally formulated in the context of performance assessment; the presumption is that performance (e.g., fail-to-start probability) can vary with time, and the application of the prior is to assess current performance based on current data. As mentioned in the Introduction, in an application such as this, it is clearly inappropriate to bias the prior towards long-term average performance; such a bias would be a case of old information dominating the new.
For simplicity, the treatment will consider only two performance states: “good” and “degraded.” The implementation is straightforward even if more performance states are introduced (e.g., slightly degraded, average, …), but this is not warranted for purposes of illustration. The mixture prior is then formulated in terms of a probability distribution conditional on being in the good state, a probability distribution conditional on being in the degraded state, and a mixture parameter representing the
probability of being in the degraded state:
gmix(p) = (1 í )gconj(p ; + gconj(p ;
0, 0) 1, 1) (1) where S is the probability of being in the degraded state, gconj is the natural conjugate distribution (beta in this case, which is conjugate to the binomial distribution), and and are the parameters of the distribution, the subscript “0” denoting the distribution conditional on being in the “good” state and the subscript “1” denoting the parameters of the distribution conditional on being in the degraded state. Thus, the mixture prior contains five parameters:, 0, 0, 1, and 1. Data are observed, and then the five parameters are updated by Bayes’ theorem to give their posterior values.
The mixture prior is conjugate, in the sense that the posterior distribution has the same form as the prior distribution, differing only in the parameter values. Because the constituent distributions gi are updated, this model was called the variable-constituent model by (8). Ref. (7) describes a variant, the “fixed-constituent model,” in which the prior has the same form, but only the mixture parameter is updated. We will not discuss this variant further here.
The mixture prior was shown by (8) to behave differently from the CNI prior in examples of practical interest. For regions of parameter space explored in (8), the posterior mean expressed as a function of the observed (small) number of failures increased more slowly than the mean from updating the CNI prior, but increased more rapidly than the update of the CNI prior as the number of observed failures increased. This behavior, among other features, makes the mixture prior an interesting candidate for performance assessment applications, but quantification of its five parameters is seen as a significant practical disadvantage.
The parameters of the mixture prior were chosen using the approach described in (7). In particular, the overall mean was taken to be 0.001, and the mean of the degraded state was taken to be a factor of ten worse than the overall mean. The value of the mixing parameter, which is the prior probability of being in the degraded state, was taken to be 0.01. This allows us to calculate the mean of the good state, which is 9.1 u 10-4. To determine the shape of the beta distribution describing the uncertainty in p for each state, (7) used a CNI prior for the good state, and chose a beta distribution with first parameter equal to 1.5 for the degraded state, allowing the second parameter to be calculated from the known mean of 0.01. The value of 1.5 ensures that the mode of the degraded state will be much greater than mean of the good state. The parameter values of the mixture prior for the example problem are S = 0.01, 0 = 0.498, E0 = 547.3, 1 = 1.5, E1 = 148.5.
4. NONCONJUGATE DIFFUSE PRIORS