Mutation Research, 272 (1992) 133137 © 1992 Elsevier Science Publishers B.V. All rights reserved 01651161/92/$05.00
133
MUTENV 08839
Overdispersion of aggregated genetic data K.O. Bowman and M.A. Kastenbaum b a
a Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, TN 378316367, USA and b 16933 Timberlakes Drive, Fort Myers, FL 33908, USA (Received 13 December 1991) (Revision received 10 March 1992) (Accepted 28 April 1992)
Keywords: Aggregated data; Chromosomal aberrations; Overdispersion
Tables of sample sizes, published in this journal (Kastenbaum and Bowman [4]), have proved useful to geneticists who face the problem of testing the significance of observed increases in mutation frequencies. One assumption upon which these tables were derived is that the observed number of mutations is a Poisson variable. Yet this assumption may not be met in practice. Auerbach [1] noted that, more often than not, the reported number of mutations is, in fact, the sum of mutations observed in several replications of the same experiment. Such replication may be expected to introduce an additional source of variation that would not be accounted for if only the aggregate were to be considered in the analysis. Failure to account for this increased variation will affect statistical analyses that are employed
This research was supported under contract ERD89828 with the Center for Indoor Air Research by Martin Marietta Energy Systems, Inc., under contract DEAC0584OR21400 with the U.S. Department of Energy. Correspondence: Dr. K.O. Bowman, Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, TN 378316367, USA.
to determine the significance levels of differences between groups or experimental treatments. In particular, deviations from the Poisson, together with the use of Poissonbased statistical tests could lead to an excess of "false positive" results. To the radiation biologist, a "false positive" would mean that the effect of radiation may be judged more damaging than it actually is. To the environmentalist, it would mean that an apparent insult to the environment may be judged more dangerous than it actually is. Such false claims impact seriously on "exposed" populations, and they carry with them high emotional and societal costs. They should, therefore, be avoided or reduced, wherever possible. This problem may be addressed by assuming that the number of observations (mutations) in each replication (sampled cluster) follows a Poisson (or a binomial) distribution, and that each such distribution is characterized by its own parameter ~ (or p), as the case may be. Mathematical expressions for the distributions of such aggregates exist, and have been shown to be adequate characterizations of the aggregated data. More specifically, Greenwood and Yule [3] showed how a modified Poisson results in a negative binomial distribution, and Skellam [7] derived a betabinomial distribution from a modified binomial.
134
In the following sections we describe these two distributions, and present the formulas necessary for estimating the respective sets of parameters. We show how, once the parameters are estimated,'the corresponding probabilities and "expected" frequencies may be readily computed, with no greater grasp of mathematics than is required to fit data to a binomial or a Poisson distribution. These methods, applied to the data of Bender et al. [2] on spontaneous chromosomal aberrations in human blood lymphocytes, illustrate how effective the negative binomial distribution can be in characterizing overdispersed Poisson data.
It follows that (x+r
P(x)
1)! ( c ~ l ) ~ (
2'.6r  5i
_ (x+r1)!(
c )~ 1 7W ( + 1) '
=
(
mixture
Bender et al. counted the aberrations in 200 ceils from each of 429 individuals, and examined the distribution of aberrations per cell in the aggregate, as well as the distribution of cells with aberrations within the 429 "clusters" of 200 cells each. Their sampling format is such that each cluster of cells may be thought of as having been drawn from a different population   that is, 429 different individuals. In such a situation, if the underlying distribution of aberration frequencies in each individual is Poisson, there is good reason to expect that the aggregate will be "overdispersed"   that is, the variance of the aggregate will be larger than a Poisson variance. Just such a mixture of Poissons was considered by Greenwood and Yule. They supposed that each cluster in their sample was characterized by its own Poisson parameter, A, and that the distribution of these A's could be characterized by a gamma or PearsonType III distribution, with parameters r and c. The joint distribution of the aggregate then reduces to a negative binomial distribution in which the probability of x aberrations per cell may be expressed as:
)",
for x = 0, 1, 2 . . . . . c > 0, r > 0 (Kendall and Stuart [5]). This leads to the following recursive relationships, which can be seen to facilitate desk calculations; P(x=O)=P(O)
The Poisson
c
lc + l
P(x
r
1) = P ( 1 )  c + 1
P(x2)=P(2)
r(r+l)( 2(c + 1) 2
= c + ~P(O), c
)r
(r+ 1) 


2 ( c + 1)
P(1),
and so on. In general,
P(x + l)
=
(r+x) P(x), (c+l)(x+l)
for x = 0 , 1, 2 , . . . . The mean of this distribution is /x = r/c, and its variance is ~ 2 = r(1 + 1/c)/c. If tx is estimated by the sample mean, £, and ~r2 by the sample variance, s 2, the two parameters, r and c, may be estimated as follows:
=£/(s 2~),
and
P=£2/(s:£).
Applications
P( x) p
(x+r1)!
c
c+l"
x!(r
1)!
pr(1 p)*,
where
Bender et al. tested the adequacy of the Poisson distribution to characterize the various classes of aberrations they observed in their series. Data in Tables 2 and 3 of their paper suggest that a Poisson fit should be rejected for the classes of
135
aberrations designated as chromatid deletions, chromosome deletions, and rings and dicentrics. We reanalyzed these data on the assumption that they are aggregated clusters, better characterized as mixtures of Poissons. As such they would be expected to be overdispersed, and therefore more apt to follow the negative binomial distribution suggested by Greenwood and Yule. Example 1: Bender et al. [2], Table 3. (a) Chromatid deletions: $ = 1.284, s 2 = 1.643,
Example 2: Bender et al. [2], Table 2. (a) Chromatid deletions: 2  0.00634, s 2= 0.00676, d = 15.0952, and P = 0.0957. x
obs. freq.
P(x)
exp.
0 1 2 3
85274 510 14 2
0.99388 0.00591 0.00020 0.00001
85274.9 507.1 17.2 0.8
85800
1.00000
85800.0
freq.
X 2 = 2.4, d.f. = 1, and 0.7 < p < 0.8.
= 3.579, a n d P = 4.597. x
obs. freq.
P(x)
exp. freq.
( 0  E) ( 0  E)2/E
0 1 2 3 4 5+
143 132 85 39 18 12
0.322 0.323 0.198 0.095 0.039 0.023
138.1 138.6 84.9 40.8 16.7 9.9
+4.9 6.6 +0.1  1.8 + 1.3 +2.1
429
1.000
429.0
0.0
0.174 0.314 0.000 0.079 0.101 0.446 X 2 = 1.114 d.L = 3 0.7 < p < 0.8
P(0) = (3.579/4.579) 4.597 = 0.322 P(1) P(2) e(3) P(4) P(5)
= = = = =
$ = 0.562,
x
obs. freq.
P(x)
exp.
0 1 2 3 4 5+
267 110 33 14 3 2
0.625 0.248 0.086 0.028 0.009 0.004
268.1 106.4 36.9 12.0 3.9 1.7
429
1.000
429.0
s 2=
obs. freq.
P(x)
exp.
0 1 2 3
85571 220 8 1
0.99733 0.00255 0.00011 0.00001
85571.2 219.0 9.3 0.5
85800
1.00000
85800.0
freq.
X 2 = 0.69, d.f. = 1, and 0.7 < p < 0.8.
x
obs. ~eq.
P(x)
exp. ~eq.
0 1 2
85682 112 6
0.99861 0.00134 0.00005
85680.7 115.0 4.3
85800
1.00000
85800.0
s 2=
freq.
g 2 = 1.13, d.f. = 3, and 0.7 < p < 0.8.
(c) Rings and dicentrics: ~ = 0.289, s 2 = 0.374, = 3.39, a n d P = 0.980. x
obs. freq.
P(x)
exp.
0 1 2 3 4
334 73 17 3 2
0.776 0.173 0.039 0.009 0.003
332.9 74.2 16.7 3.9 1.3
429
1.000
429.0
g 2 = 0.61, d.f. = 2, and 0.7 < p < 0.8.
x
(c) Rings and dicentrics: . ~ = 0 . 0 0 1 4 5 , 0.00156, ~ = 12.5652 a n d P = 0.0182.
(0.322)(4.597)/4.579 = 0.323 (0.323)(5.597)/[2(4.579)] = 0.198 (0.198)(6.597)/[3(4.579)] = 0.095 (0.095)(7.597)/[4(4.579)] = 0.039 1  E 4 p ( x ) = 1  0.977 = 0.023
(b) Chromosome deletions: 0.796, ~ = 2.396, P = 1.346.
(b) Chromosome deletions: $ = 0.00279, s 2 = 0.00303, ~ = 11.2339 a n d P = 0.0313.
freq.
Papworth's ustatistic The data for rings and dicentrics (Example (2c)) leave no degrees of freedom for the Pearson X 2 test of the goodness of the negative binomial fit. The number of cells (3), less the number of estimated parameters (2), less the number of constraints (1), leave zero degrees of freedom for this test. Yet, it is obvious that the negative binomial provides a remarkably good characterization of these data. Bender et al. [2] use the Papworth test statistic, u, as a measure of the adequacy of the Poisson distribution to characterize these data. This statistic was shown (Savage [6]) to have a limiting Normal distribution, and it is independent of degrees of freedom. As a consequence, the probabilities associated with values of u as a
136
test statistic may be read from any standard table of the Normal distribution. We could have used the Papworth ustatistic to test the adequacy of the negative binomial as a characterization of these data. Instead we opted for the Pearson X 2 test. Our choice was not frivolous. Rather it was based on our investigations of the underlying mathematical theory which indicated that, in general and whenever both are applicable, the X 2 approximation does the job at least as well as the Normal approximation. The binomial mixture O n e alternative to the mixture of Poissons is the mixture of binomials considered by Skellam. Here each cluster of size n is characterized by its own binomial parameter, p. The distribution of these p's is characterized, in turn, by a beta distribution, with parameters a and b. The joint distribution of this aggregate reduces to a betabinomial, or Skellam distribution, in which the probability of x aberrations may be expressed as:
P(x)
=(n)F(a + x ) F ( b + n  x ) F ( a
+b)
r(a+b+n)r(a)r(b)
'
wherex=0,1,2 ..... n:a>0;b>0;and F(a)= (a  1)!. Calculation of successive probabilities may be simplified by applying the following recursive relationships: P(O)
(b + n  l ) ( b + n  2 ) . . . ( b
+ l)b
(a+b+n1)(a+b+n2)...(a+b+l)(a+b)' F/a
P(1)
(b+n1) (n
P(2) =
P(O),
1)(a + 1)
2(b + n  2)
~r ~ =
Discussion
In general, the sampling scheme by which data are collected plays an important role in the characterization of the aggregate. When the procedure involves a sampling of clusters (cluster sampiing), the aggregated data will be overdispersed. On the other hand, stratified sampling, like "blocking" in experimental design, is employed to reduce variability, and will, therefore, generally yield aggregated data that are underdispersed. Just as the binomial goes to the Poisson under certain limiting conditions on n and p, so the betabinomial tends, in the limit, to the negative binomial. In the limit, the characterizations of such mixtures may have broader implications under other data sampling conditions. For example, our recent investigations into the distributional properties of sisterchromatid exchanges (SCE) in cells from normal humans has led us to consider the classical Pearsontype distributions (Kendall and Stuart) as appropriate characterizations. Our preliminary observations on a large series of cells, some exhibiting up to 32 SCEs per cell, indicate that some Pearsontypes do remarkably well at characterizing these overdispersed data. (These SCE data appear in a paper by Michael A Bender and his collaborators in the 1992 volume of this journal.) That the hypergeometric distribution, which arises in sampling from a finite population, tends, in the limit, to the Pearsontype distributions is particularly intriguing. Currently we are examining these and other phenomena, and will report on our results in a later communication.
e(1), References
and so forth. In general
P ( x + 1) =
bution is/x = na/(a + b), and the variance nab(a + b + n)/[(a + b)2(a + b + 1)].
(nx)(a+x) (x + 1)(b + n  x 
1)
P(x),
for x = 0, 1, 2 , . . . , n  1. The mean of this distri
1 Auerbach, C. (1970) Remark on the "Tables for determining the statistical significance of mutation frequencies", Mutation Res., 10, 256. 2 Bender, M.A, R.J. Preston, R.C. Leonard, B.E. Pyatt and P.C. Gooch (1990) On the distribution of spontaneous chromosomal aberrations in human peripheral blood lymphocytes in culture, Mutation Res., 244, 215220.
137 3 Greenwood, M., and G.U. Yule (1920) An inquiry into the nature of frequency distributions of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents, J. Roy. Statist. Soc., Ser. A, 83, 255279. 4 Kastenbaum, M.A., and K.O. Bowman (1970) Tables for determining the statistical significance of mutation frequencies, Mutation Res., 9, 527549. 5 Kendall, M.G., and A. Stuart (1969) The Advanced Theory
of Statistics, Vol. 1, Distribution Theory, 3rd edn., Hafner, New York. 6 Savage, J.R.K. (1970) Sites of radiation induced chromosome exchanges, Current Topics Radiation Res., 6, 129194. 7 Skellam, J.G. (1948) A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials, J. Roy Statist. Soc., Ser. B, 10, 257261.