Overdispersion of aggregated genetic data

Overdispersion of aggregated genetic data

Mutation Research, 272 (1992) 133-137 © 1992 Elsevier Science Publishers B.V. All rights reserved 0165-1161/92/$05.00 133 MUTENV 08839 Overdispersi...

291KB Sizes 6 Downloads 16 Views

Mutation Research, 272 (1992) 133-137 © 1992 Elsevier Science Publishers B.V. All rights reserved 0165-1161/92/$05.00

133

MUTENV 08839

Overdispersion of aggregated genetic data K.O. Bowman and M.A. Kastenbaum b a

a Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6367, USA and b 16933 Timberlakes Drive, Fort Myers, FL 33908, USA (Received 13 December 1991) (Revision received 10 March 1992) (Accepted 28 April 1992)

Keywords: Aggregated data; Chromosomal aberrations; Overdispersion

Tables of sample sizes, published in this journal (Kastenbaum and Bowman [4]), have proved useful to geneticists who face the problem of testing the significance of observed increases in mutation frequencies. One assumption upon which these tables were derived is that the observed number of mutations is a Poisson variable. Yet this assumption may not be met in practice. Auerbach [1] noted that, more often than not, the reported number of mutations is, in fact, the sum of mutations observed in several replications of the same experiment. Such replication may be expected to introduce an additional source of variation that would not be accounted for if only the aggregate were to be considered in the analysis. Failure to account for this increased variation will affect statistical analyses that are employed

This research was supported under contract ERD-89-828 with the Center for Indoor Air Research by Martin Marietta Energy Systems, Inc., under contract DE-AC05-84OR21400 with the U.S. Department of Energy. Correspondence: Dr. K.O. Bowman, Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6367, USA.

to determine the significance levels of differences between groups or experimental treatments. In particular, deviations from the Poisson, together with the use of Poisson-based statistical tests could lead to an excess of "false positive" results. To the radiation biologist, a "false positive" would mean that the effect of radiation may be judged more damaging than it actually is. To the environmentalist, it would mean that an apparent insult to the environment may be judged more dangerous than it actually is. Such false claims impact seriously on "exposed" populations, and they carry with them high emotional and societal costs. They should, therefore, be avoided or reduced, wherever possible. This problem may be addressed by assuming that the number of observations (mutations) in each replication (sampled cluster) follows a Poisson (or a binomial) distribution, and that each such distribution is characterized by its own parameter ~ (or p), as the case may be. Mathematical expressions for the distributions of such aggregates exist, and have been shown to be adequate characterizations of the aggregated data. More specifically, Greenwood and Yule [3] showed how a modified Poisson results in a negative binomial distribution, and Skellam [7] derived a beta-binomial distribution from a modified binomial.

134

In the following sections we describe these two distributions, and present the formulas necessary for estimating the respective sets of parameters. We show how, once the parameters are estimated,'the corresponding probabilities and "expected" frequencies may be readily computed, with no greater grasp of mathematics than is required to fit data to a binomial or a Poisson distribution. These methods, applied to the data of Bender et al. [2] on spontaneous chromosomal aberrations in human blood lymphocytes, illustrate how effective the negative binomial distribution can be in characterizing overdispersed Poisson data.

It follows that (x+r-

P(x)-

1)! ( c ~ l ) ~ (

2'.-6r - 5i

_ (x+r-1)!(

c )~ 1 7W ( + 1) '

=

(

mixture

Bender et al. counted the aberrations in 200 ceils from each of 429 individuals, and examined the distribution of aberrations per cell in the aggregate, as well as the distribution of cells with aberrations within the 429 "clusters" of 200 cells each. Their sampling format is such that each cluster of cells may be thought of as having been drawn from a different population - - that is, 429 different individuals. In such a situation, if the underlying distribution of aberration frequencies in each individual is Poisson, there is good reason to expect that the aggregate will be "overdispersed" - - that is, the variance of the aggregate will be larger than a Poisson variance. Just such a mixture of Poissons was considered by Greenwood and Yule. They supposed that each cluster in their sample was characterized by its own Poisson parameter, A, and that the distribution of these A's could be characterized by a gamma or Pearson-Type III distribution, with parameters r and c. The joint distribution of the aggregate then reduces to a negative binomial distribution in which the probability of x aberrations per cell may be expressed as:

)",

for x = 0, 1, 2 . . . . . c > 0, r > 0 (Kendall and Stuart [5]). This leads to the following recursive relationships, which can be seen to facilitate desk calculations; P(x=O)=P(O)

The Poisson

c

l---c + l

P(x-

r

1) = P ( 1 ) - c + 1

P(x--2)=P(2)-

r(r+l)( 2(c + 1) 2

= c + ~-P(O), c

)r

(r+ 1) -

-

-

2 ( c + 1)

P(1),

and so on. In general,

P(x + l)

=

(r+x) P(x), (c+l)(x+l)

for x = 0 , 1, 2 , . . . . The mean of this distribution is /x = r/c, and its variance is ~ 2 = r(1 + 1/c)/c. If tx is estimated by the sample mean, £, and ~r2 by the sample variance, s 2, the two parameters, r and c, may be estimated as follows:

=£/(s 2-~),

and

P=£2/(s:-£).

Applications

P( x) p-

(x+r-1)!

c

c+l"

x!(r-

1)!

pr(1 -p)*,

where

Bender et al. tested the adequacy of the Poisson distribution to characterize the various classes of aberrations they observed in their series. Data in Tables 2 and 3 of their paper suggest that a Poisson fit should be rejected for the classes of

135

aberrations designated as chromatid deletions, chromosome deletions, and rings and dicentrics. We re-analyzed these data on the assumption that they are aggregated clusters, better characterized as mixtures of Poissons. As such they would be expected to be overdispersed, and therefore more apt to follow the negative binomial distribution suggested by Greenwood and Yule. Example 1: Bender et al. [2], Table 3. (a) Chromatid deletions: $ = 1.284, s 2 = 1.643,

Example 2: Bender et al. [2], Table 2. (a) Chromatid deletions: 2 -- 0.00634, s 2= 0.00676, d = 15.0952, and P = 0.0957. x

obs. freq.

P(x)

exp.

0 1 2 3

85274 510 14 2

0.99388 0.00591 0.00020 0.00001

85274.9 507.1 17.2 0.8

85800

1.00000

85800.0

freq.

X 2 = 2.4, d.f. = 1, and 0.7 < p < 0.8.

= 3.579, a n d P = 4.597. x

obs. freq.

P(x)

exp. freq.

( 0 - E) ( 0 - E)2/E

0 1 2 3 4 5+

143 132 85 39 18 12

0.322 0.323 0.198 0.095 0.039 0.023

138.1 138.6 84.9 40.8 16.7 9.9

+4.9 -6.6 +0.1 - 1.8 + 1.3 +2.1

429

1.000

429.0

0.0

0.174 0.314 0.000 0.079 0.101 0.446 X 2 = 1.114 d.L = 3 0.7 < p < 0.8

P(0) = (3.579/4.579) 4.597 = 0.322 P(1) P(2) e(3) P(4) P(5)

= = = = =

$ = 0.562,

x

obs. freq.

P(x)

exp.

0 1 2 3 4 5+

267 110 33 14 3 2

0.625 0.248 0.086 0.028 0.009 0.004

268.1 106.4 36.9 12.0 3.9 1.7

429

1.000

429.0

s 2=

obs. freq.

P(x)

exp.

0 1 2 3

85571 220 8 1

0.99733 0.00255 0.00011 0.00001

85571.2 219.0 9.3 0.5

85800

1.00000

85800.0

freq.

X 2 = 0.69, d.f. = 1, and 0.7 < p < 0.8.

x

obs. ~eq.

P(x)

exp. ~eq.

0 1 2

85682 112 6

0.99861 0.00134 0.00005

85680.7 115.0 4.3

85800

1.00000

85800.0

s 2=

freq.

g 2 = 1.13, d.f. = 3, and 0.7 < p < 0.8.

(c) Rings and dicentrics: ~ = 0.289, s 2 = 0.374, = 3.39, a n d P = 0.980. x

obs. freq.

P(x)

exp.

0 1 2 3 4

334 73 17 3 2

0.776 0.173 0.039 0.009 0.003

332.9 74.2 16.7 3.9 1.3

429

1.000

429.0

g 2 = 0.61, d.f. = 2, and 0.7 < p < 0.8.

x

(c) Rings and dicentrics: . ~ = 0 . 0 0 1 4 5 , 0.00156, ~ = 12.5652 a n d P = 0.0182.

(0.322)(4.597)/4.579 = 0.323 (0.323)(5.597)/[2(4.579)] = 0.198 (0.198)(6.597)/[3(4.579)] = 0.095 (0.095)(7.597)/[4(4.579)] = 0.039 1 - E 4 p ( x ) = 1 - 0.977 = 0.023

(b) Chromosome deletions: 0.796, ~ = 2.396, P = 1.346.

(b) Chromosome deletions: $ = 0.00279, s 2 = 0.00303, ~ = 11.2339 a n d P = 0.0313.

freq.

Papworth's u-statistic The data for rings and dicentrics (Example (2c)) leave no degrees of freedom for the Pearson X 2 test of the goodness of the negative binomial fit. The number of cells (3), less the number of estimated parameters (2), less the number of constraints (1), leave zero degrees of freedom for this test. Yet, it is obvious that the negative binomial provides a remarkably good characterization of these data. Bender et al. [2] use the Papworth test statistic, u, as a measure of the adequacy of the Poisson distribution to characterize these data. This statistic was shown (Savage [6]) to have a limiting Normal distribution, and it is independent of degrees of freedom. As a consequence, the probabilities associated with values of u as a

136

test statistic may be read from any standard table of the Normal distribution. We could have used the Papworth u-statistic to test the adequacy of the negative binomial as a characterization of these data. Instead we opted for the Pearson X 2 test. Our choice was not frivolous. Rather it was based on our investigations of the underlying mathematical theory which indicated that, in general and whenever both are applicable, the X 2 approximation does the job at least as well as the Normal approximation. The binomial mixture O n e alternative to the mixture of Poissons is the mixture of binomials considered by Skellam. Here each cluster of size n is characterized by its own binomial parameter, p. The distribution of these p's is characterized, in turn, by a beta distribution, with parameters a and b. The joint distribution of this aggregate reduces to a betabinomial, or Skellam distribution, in which the probability of x aberrations may be expressed as:

P(x)

=(n)F(a + x ) F ( b + n - x ) F ( a

+b)

r(a+b+n)r(a)r(b)

'

wherex=0,1,2 ..... n:a>0;b>0;and F(a)= (a - 1)!. Calculation of successive probabilities may be simplified by applying the following recursive relationships: P(O)

(b + n - l ) ( b + n - 2 ) . . . ( b

+ l)b

(a+b+n-1)(a+b+n-2)...(a+b+l)(a+b)' F/a

P(1)

(b+n-1) (n-

P(2) =

P(O),

1)(a + 1)

2(b + n - 2)

~r ~ =

Discussion

In general, the sampling scheme by which data are collected plays an important role in the characterization of the aggregate. When the procedure involves a sampling of clusters (cluster sampiing), the aggregated data will be overdispersed. On the other hand, stratified sampling, like "blocking" in experimental design, is employed to reduce variability, and will, therefore, generally yield aggregated data that are underdispersed. Just as the binomial goes to the Poisson under certain limiting conditions on n and p, so the beta-binomial tends, in the limit, to the negative binomial. In the limit, the characterizations of such mixtures may have broader implications under other data sampling conditions. For example, our recent investigations into the distributional properties of sister-chromatid exchanges (SCE) in cells from normal humans has led us to consider the classical Pearson-type distributions (Kendall and Stuart) as appropriate characterizations. Our preliminary observations on a large series of cells, some exhibiting up to 32 SCEs per cell, indicate that some Pearson-types do remarkably well at characterizing these overdispersed data. (These SCE data appear in a paper by Michael A Bender and his collaborators in the 1992 volume of this journal.) That the hypergeometric distribution, which arises in sampling from a finite population, tends, in the limit, to the Pearson-type distributions is particularly intriguing. Currently we are examining these and other phenomena, and will report on our results in a later communication.

e(1), References

and so forth. In general

P ( x + 1) =

bution is/x = na/(a + b), and the variance nab(a + b + n)/[(a + b)2(a + b + 1)].

(n-x)(a+x) (x + 1)(b + n - x -

1)

P(x),

for x = 0, 1, 2 , . . . , n - 1. The mean of this distri-

1 Auerbach, C. (1970) Remark on the "Tables for determining the statistical significance of mutation frequencies", Mutation Res., 10, 256. 2 Bender, M.A, R.J. Preston, R.C. Leonard, B.E. Pyatt and P.C. Gooch (1990) On the distribution of spontaneous chromosomal aberrations in human peripheral blood lymphocytes in culture, Mutation Res., 244, 215-220.

137 3 Greenwood, M., and G.U. Yule (1920) An inquiry into the nature of frequency distributions of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents, J. Roy. Statist. Soc., Ser. A, 83, 255-279. 4 Kastenbaum, M.A., and K.O. Bowman (1970) Tables for determining the statistical significance of mutation frequencies, Mutation Res., 9, 527-549. 5 Kendall, M.G., and A. Stuart (1969) The Advanced Theory

of Statistics, Vol. 1, Distribution Theory, 3rd edn., Hafner, New York. 6 Savage, J.R.K. (1970) Sites of radiation induced chromosome exchanges, Current Topics Radiation Res., 6, 129-194. 7 Skellam, J.G. (1948) A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials, J. Roy Statist. Soc., Ser. B, 10, 257-261.