Signal Processing 83 (2003) 777 – 788 www.elsevier.com/locate/sigpro
Deduction of a gene regulatory relationship framework from gene expression data by the application of graphical Gaussian modeling Sachiyo Aburatania , Satoru Kuharaa , Hiroyuki Tohb;∗ , Katsuhisa Horimotoc;∗ a Laboratory
of Molecular Gene Technics, Graduate School of Genetic Resources Technology, Kyushu University, Hakozaki 6101, Higashiku, Fukuoka 8128581, Japan b Laboratory of Genome Informatics, Bioinformatics Center, Institute of Chemical Research, Kyoto University, Uji, Kyoto 6110011, Japan c Laboratory of Biostatistics, Human Genome Center, Institute of Medical Science, University of Tokyo, 461 Shirokanedai, Minatoku, Tokyo 1088639, Japan Received 11 June 2002; received in revised form 16 August 2002
Abstract Recently, we developed an automatic system for deducing a framework of regulatory relationships from gene expression data on a genomic scale. One of the merits of our system is that it simultaneously performs the gene classi1cation and the relation inference from a large amount of gene expression data by combining the graphical Gaussian modeling with standard multivariate statistical techniques. In this paper, we describe our modi1cations of the previous system to estimate the cluster boundaries, by setting a userde1ned threshold for measuring the linear relationship between the pro1les of clusters. The modi1ed system was applied to a whole set of yeast expression pro1les of 6152 genes available at a web site, and the results obtained by the present analyses were statistically evaluated by a simulation that calculates the chance probability of the inferred framework of regulatory relationships, in addition to the statistical evaluation of the clusters by a previous method. In particular, the feasibility of the modi1ed system is demonstrated by the comparison between two frameworks inferred from two sets of clusters that were estimated by distinctive thresholds. ? 2002 Elsevier Science B.V. All rights reserved.
1. Introduction Advances in microarray techniques have enabled us to monitor the expression levels of thousands or tens of thousands of genes simultaneously under
∗
Corresponding authors. Tel.: +81354495467 (H. Toh), +81774383092 (K. Horimoto). Email addresses:
[email protected] (H. Toh),
[email protected] (K. Horimoto).
multiple conditions [7,16,19]. To elucidate the underlying information in the complex expression data, various types of methods have been proposed, in terms of two analytical approaches. One is gene classi1cation by detecting similar pro1le patterns, and the other is network inference by modeling the expression changes of the genes over the measured conditions. Cluster analyses and other approaches are usually adopted in the classi1cation of genes sharing similar pro1le patterns. The techniques of cluster analysis include hierarchical and nonhierarchical types
01651684/03/$  see front matter ? 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S01651684(02)004760
778
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
[2,5,8,21,22]. One of the important issues in gene classi1cation is the estimation of gene groups. In hierarchical clustering, the cluster number is determined by visual inspection and consideration of the gene information, while the cluster number is estimated by some statistical measurements in a heuristic way in nonhierarchical clustering. An approach that combines cluster analysis with sequence motif detection has also been proposed to estimate the gene groups [25]. Since some ambiguity in the cluster analysis exists in determining the gene groups, distinctive methods from clustering techniques have also been recently proposed [3,18]. For the direct inference of a genetic network from the pro1les, some mathematical models have been adopted: Boolean networks [1,20], diDerential equations [6], and Bayesian networks [10,13]. Apart from the gene classi1cation, most direct inferences are applied to smaller selected sets of gene expression pro1les, rather than the actual data on a genomic scale. This is partly due to the lengthy computational time needed to infer the relationships between combinational pairs of many genes. Furthermore, the diEculty of inferring regulatory networks aDects the composition of the expression pro1les. In many pro1le data sets, the expression changes of more than thousands of genes have been measured under at most two hundreds of conditions. For example, the expression changes of about 6200 genes in yeast have been measured under about 200 diDerent conditions. The situation indicates that many genes show quite similar patterns of expression pro1les under such restricted number of measured conditions. Thus, the features of the present pro1le data on a genomic scale principally prevent us from inferring the explicit relationship between each gene. In other words, the preprocessing of the pro1les on a genomic scale may be expected to be the 1rst step to reveal the whole regulatory relationship. Recently, we developed two methods for expression pro1le analysis [15,23]: one is a method for the automatic estimation of cluster boundaries in a hierarchical clustering, and the other is a method for the inference of a genetic network by the application of graphical Gaussian modeling (GGM) [27]. These two newly developed methods have been synthesized into a system, named Automatic System for Inferring A Network (ASIAN), for automatically deduc
ing the framework of a genetic network, by only the input of a large amount of expression pro1le data [14,24]. In this paper, we describe new modi1cations to the previous system to increase its ability to utilize biological information about the genes whose expression levels are monitored. The classi1cation and the inferred framework estimated by the modi1ed system are evaluated statistically by the calculation of the chance probability of gene classi1cation in the previous work [22], and the simulation designed for the inferred framework with reference to the known regulatory systems. The feasibility of the system is demonstrated here by analyzing the expression pro1les of whole genes available at a web site. 2. Expression prole data The gene expression pro1le data analyzed here were cited from Gasch et al. [11] (http://genomewww5. stanford.edu/MicroArray/SMD/). The data comprise the expression pro1les of 6152 yeast (Saccharomyces cerevisiae) genes that were measured under 173 conditions. Although some values were missing among the 173 conditions, the missing values in the sets of pro1les were replaced by the averages over the existing values under the same conditions. 3. System for deducing a framework from expression data 3.1. Architecture of system The two methods that we recently developed for microarray analysis [15,23] have been synthesized into a system, named ASIAN [14,24]. The system is divided into the following three parts: (I) the hierarchical clustering of pro1les, (II) the estimation of cluster number along the dendrogram in (I), and (III) the application of GGM to the averaged pro1les over the members of each cluster obtained in (II). Here, we describe the essential parts of each step. Step I: A hierarchical clustering is applied to the gene expression pro1les. The metric for measuring the distance between the pro1les is the Euclidean distance
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
between the Pearson correlation coeEcients, i.e., n (1) dij = (ril − rjl )2 ; l=1
where n is the total number of genes, and rij is the Pearson correlation coeEcient between the i and j genes of the expression pro1les that are measured under nc conditions, pik , (k = 1; 2; : : : ; nc ): nc (pik − pi )(pjk − pj ) rij = k=1 ; (2) nc nc 2· 2 (p − p ) (p − p ) ik i jk j k=1 k=1 where pJ i is the arithmetic average of pik over nc conditions. Thus, the similarity between the patterns of expression pro1les is de1ned based on the correlation coeEcient in the present system. In contrast to the distance only between pro1les, which is frequently adopted in gene clustering [8], the present metric is robust for both missing and abnormal values. Then, the metric de1ned in Eq. (1) is analyzed by a standard hierarchical clustering technique, the unweighted pairgroup method using arithmetic averages (UPGMA) [12]. Step II: The cluster boundaries along the dendrogram obtained in step I are estimated based on the statistical properties of the pro1les. The variance inLation factor (VIF) is utilized to estimate the linear relationship between the pro1les in the clusters along the dendrogram. In the multiple regression analysis, the existence of a high linear relationship among the explanatory variables is known as multicollinearity, and the variables that are involved in the multicollinearity are diagnosed by the VIF, as follows: VIFi = rii−1 ;
(3)
where rii−1 is the ith diagonal element of the inverse of the correlation coeEcient matrix (CCM) between explanatory variables [9]. In a CCM between m explanatory variables, therefore, m VIFs are calculated. The magnitude of the VIF increases when the correlation among any independent variable increases. Since the VIF is utilized in exploratory analyses with no signi1cance test, any cutoD value must be based on practical considerations. When the explanatory variables in the Eq. (3) correspond to the gene pro1les, the VIF expresses the degree of linear relationship between
779
the pro1les. Given that the threshold in the diagnosis of multicollinearity is set to be vc , the linear relationship of the ith variable exists when VIFi is larger than vc : in the previous system, the vc value was 1xed to be a popular threshold of 10.0 [9]. The mVIFs are assessed with the following condition: max{VIFi } ¡ vc
for i = 1; 2; : : : ; m:
(4)
If the condition is satis1ed, then no linear relationship exists in the m sets of pro1les, and in contrast, the linear relationship still exists in the pro1les, if the condition is not satis1ed. Thus, the maximum number of clusters with no linear relationship is searched for along the dendrogram. As described above, the estimation of cluster boundaries is based on only the CCM between the pro1les, regardless of the clustering technique. Step III: A network between the clusters whose boundaries were determined in step II is inferred by the application of GGM. As a preprocessing of the application of GGM, the averaged pro1les of the jth condition for each cluster obtained in step II are calculated, i.e., pJ j(k) =
1 nk
nk
pij ;
(5)
i∈clusterk
where nk is the number of members in the kth cluster. Then, the averaged pro1les over nc conditions in k clusters are subjected to GGM by a stepwise and iterative algorithm [26], in order to evaluate which pair of clusters is conditionally independent. A partial correlation coeEcient matrix (PCCM) is calculated from the inverse of CCM between the averaged pro1les, and an element in PCCM with a minimum absolute value is replaced to be zero. Subsequently, the CCM restored from the PCCM is tested with the original CCM by calculating the deviance, which is expressed as follows: dev = nc ln
Ri+1  ; Ri 
(6)
where Ri+1 and Ri are the determinants of the CCMs at the i + 1th and ith iteration steps. The dev is known approximately to follow a 2 distribution with one degree of freedom. The iteration of the replacement of minimum values with 0 value in PCCM is stopped in the present study, when the probability of deviance is
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
less than 0.05, which is a familiar signi1cance level in the statistical test. Finally, a model of the PCCM is obtained. When the partial correlation coeEcient for a cluster pair is equal to 0, the cluster pair is conditionally independent, indicating no connection between the clusters in the network. 3.2. Modi;cation of the system In the previous system, the threshold for VIF was 1xed to a value of 10.0, which is a familiar cutoD value in the multiple regression analysis [9]. In the present system, we designed a userde1ned threshold for VIF in estimating the cluster number. The new modi1cation of the threshold for VIF is expected to increase the performance of the system. First, the variety of cluster numbers enables us to compare several sets of connections between clusters. The networks between the clusters can be investigated from a hierarchical viewpoint with any amount of biological knowledge on the gene classes and the gene regulatory systems. Secondly, since the pro1le data sets are composed of diDerent numbers of genes, measured under diDerent conditions from diDerent organisms, the userde1ned threshold for VIF is useful for explanatory analyses of the pro1le data. Thirdly, since the cluster number increases as the VIF is set to a larger value, a variable setting of the VIF threshold also serves to estimate the maximum number of clusters derived from the pro1le data analyzed. Notably, the gene groups obtained by the present procedure are composed of the gene members that show a similar linear relationship between the pro1les, whenever any VIF thresholds are set. Although one parameter for the threshold of VIF is added to the system, the improvement enables us to perform a heuristic search for the gene classi1cation and the gene regulatory systems with reference to the biological information, as described below. 4. Analysis of yeast expression data 4.1. Estimation of cluster boundaries Along the dendrogram constructed by UPGMA, the cluster boundaries were estimated by the calculation of VIF with various thresholds. The cluster numbers
60
Number of clusters
780
50 40 30 20 10 0 0
50
100
Threshold of VIFs Fig. 1. Cluster numbers estimated by userde1ned thresholds for VIF.
estimated with six thresholds for VIF are plotted in Fig. 1. In the conventional threshold of 10.0, the cluster number is estimated to be 30, and the cluster numbers increase as the thresholds for VIF increase. In particular, the curve of the thresholds for VIF vs. the cluster number saturates in the threshold of about 50.0: the cluster number is 60 when the threshold for VIF is 50.0 and the number is 61 when the threshold is 100.0. The saturation of the curve suggests a limitation of the division into the clusters that mutually show no linear relationship, in the present 6182 gene pro1les measured under 173 conditions. With these situations in mind, the genes are classi1ed into 30 and 60 clusters in the present study. In the 30 clusters, the numbers of members range from 6 to 727, while they are from 3 to 727 in the 60 clusters. In the comparison between the two clusters, the clusters with a large number of members in the 30 clusters tend to keep their members in the 60 clusters, while the small clusters are divided into even smaller clusters in the 60 clusters. The detailed comparison between the two sets of clusters is discussed in terms of the gene function in the next section. The lists of all of the members in the two cases are available at our web sites (http://www.ged.sagamed.ac.jp/horimoto/ microarray3 or http://www.beri.co.jp/∼protein).
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
781
4.2. Inference of network between clusters
5. Evaluation of the present analyses
The probabilities of dev in the 30 and 60 clusters were iteratively calculated (see web supplement for detail). In the present study, when the probability of dev to stop the iteration was set to 0.05, as the standard signi1cance level in the test, the iteration in the 30 clusters was stopped when the probability at the 180th step was 0.004, and the iteration in the 60 clusters was stopped when the probability at the 822nd step was 0.04. The numbers of iteration steps before stopping the iteration in the two clusters corresponded to the numbers of elements of PCCM that were replaced with 0.0. The results of the application of GGM to the 30 and the 60 clusters are schematically shown in Fig. 2. The inferred networks are shown, focusing on the zero or nonzero features of the elements of the obtained PCCMs, instead of the explicit values of the partial correlation coeEcients. This is because the values of the partial correlation coeEcients did not precisely reLect the experimentally observed degrees of regulation, by the calculation of the averaged pro1les over the members in the same clusters. Therefore, the accuracy of our approach will be examined, based only on the presence or absence of connections between the clusters. Out of 435 elements in the 30 clusters and 1770 elements in the 60 clusters, 179 (41.1%) and 821 elements (46.4%) were replaced with 0.0, respectively, by the iterative procedure of GGM. Thus, 179 and 821 connections were broken oD, and the remaining 256 (58.9%) and 949 (53.6%) connections were established. In the 30 clusters, the maximum number of connections of a cluster was 22 in cluster 5, while the minimum number was 11 in cluster 12. In the 60 clusters, the maximum number was 43 in cluster 30, while the minimum number was 20 in cluster 41. Notably, the numbers of connections in each cluster have little correlation with the numbers of members in the corresponding cluster: the correlation coeEcients were −0:278 and 0.206 in the 30 and the 60 clusters, respectively. Thus, the connections are established or broken oD, depending on the properties of the pro1les, not on the number of members.
5.1. Comparison of cluster boundaries in terms of gene function The clustering of the genes by the present procedure was evaluated by the correspondence between the members of each cluster and the biological information of gene function. For this purpose, each member of the clusters was mapped to 120 functional categories at the second level in the Martinsried Institute of Protein Science functional classi1cation scheme (MIPS) database [17], and the chance probability for observing the frequencies of genes in particular functional categories within a cluster was estimated with the use of the hypergeometric distribution by Tavazoie et al. [22]. Clusters that were signi1cantly enriched for genes with similar functions are compared between the 30 and the 60 clusters in Table 1. In the 30 clusters, 31 gene groups with a signi1cant probability were found in 12 clusters, and in the 60 clusters, 34 groups were found in 14 clusters. In both cases, not all of the clusters showed a signi1cant enrichment for gene function. This is partly because the members of such clusters may participate in multiple classically de1ned processes. Note that the number of clusters with signi1cant enrichment and the corresponding gene groups in the 30 clusters are almost the same as those in the 60 clusters. As a 1rst approximation, therefore, the variation of threshold for VIF causes little diDerence in the gene classi1cation. Furthermore, the pro1les of genes belonging to a category are well classi1ed, regardless of the cluster number in the present system. The enrichment of clusters may also be evaluated in terms of the genes involved in environmental changes [11]. One of the notable features of the present classi1cation is that the genes involved in the environmental stress response (ESR), which are indicated in Figs. 3 and 4 of Gasch et al. [11], were concentrated in a few clusters, consistent with the previous work. The highest numbers of genes repressed in the ESR, 552 genes, were found in cluster 12 of the 30 clusters and in cluster 15 of the 60 clusters, and the most genes induced in the ESR were found in clusters 9, 13, and 14 of the 30 clusters, and clusters 12, 17, and 19 of the 60 clusters. Another feature is that diDerentially expressed isoenzymes, which are shown in Fig. 5 of Gasch et al. [11],

1

16

2

3

4

49

5

13 47 58

6

7

9

23 54

8

36 42 10 12 18 25 11 48 14 15 17 24 19 20 26 27 56 21 41 44 59 22 29 37 28 30

60 31 32 35 53 33 50 34 38 46 39 40 43 52 45 51 57 55
Fig. 2. Inferred networks between clusters by GGM. The connections between clusters are shown in the 30 clusters (the lower part), and in the 60 clusters (the upper part). Open and closed cells indicate the zero and the nonzero partial correlation coeEcients between the clusters, respectively. The solid and broken lines indicate the cluster boundaries in the 30 and the 60 clusters.
30
29
28
27
25 26
24
23
22
21
20
19
18
17
16
15
14
13
11 12
10
9
8
7
6
5
4
2 3
1
782 S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
401
457
176 129
202
347
727
426
656
1
2
3 4
7
9
12
13
14
42
29 14 25 70 27 14 6 8
15
Ccompound and carbohydrate metabolism Tricarboxylicacid pathway Respiration Sugar and carbohydrate transporters Peroxisomal transport Peroxisomal organization
84 10 28 18 8 19
Nucleotide metabolism 40 rRNA transcription 57 rRNA synthesis 19 rRNA processing 38 tRNA transcription 24 tRNA synthesis 12 Ribosomal proteins 125 Translation 37 tRNAsynthetases 19 Ccompound and carbohydrate metabolism 52 Metabolism of energy reserves 12
Mitochondrial organization
Protein modi1cation Nuclear transport Vesicular transport Organization of cytoplasm Organization of endoplasmatic reticulum Vesicular transport Cytokinesis Nuclear transport —
Nitrogen and sulfur metabolism
24 8.E10 19 8.E05 2.E08 2.E07 5.E05 2.E09
1 16 2.E05 2 9.E06 5.E07 3.E06 2.E05 3.E06 3 3.E05 4 7.E06 49 — 7 9 23 54 4.E06 10 12 18 25 3.E08 15 2.E15 1.E08 2.E15 1.E05 4.E06 8.E16 7.E16 4.E09 9.E06 17 4.E06
6.E05
115 656
311
176 51 78 118 20 51 13 70 89 92 96 727
244 157 457
— 2.E07 2.E05 9.E06 5.E07 3.E06 2.E05 3.E06 — — 6.E05 — — — — — — — 3.E08 2.E15 1.E08 2.E15 1.E05 4.E06 8.E16 7.E16 4.E09 1.E07 1.E06 3.E05 — 8.E10 8.E05 2.E08 2.E07 5.E05 2.E09
Members P in a category
— Nitrogen and sulfur metabolism 12 Protein modi1cation 29 Nuclear transport 14 Vesicular transport 25 Organization of cytoplasm 70 Organization of endoplasmatic reticulum 27 Vesicular transport 14 — — Nuclear organization 30 — — — — — — — Nucleotide metabolism 40 rRNA transcription 57 rRNA synthesis 19 rRNA processing 38 tRNA transcription 24 tRNA synthesis 12 Ribosomal proteins 125 Translation 37 tRNAsynthetases 19 Ccompound and carbohydrate metabolism 46 Metabolism of energy reserves 11 Stress response 22 — Ccompound and carbohydrate metabolism 84 Tricarboxylicacid pathway 10 Respiration 28 Sugar and carbohydrate transporters 18 Peroxisomal transport 8 Peroxisomal organization 19
Cluster Members MIPS category number in a cluster
Cluster Members MIPS category number in a cluster
Members P in a category
60 clusters
30 clusters
Table 1 Comparison between clusters with statistical signi1cance for functional category enrichment in 30 and 60 clusters
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788 783
120
215
21
33
17
21
23
26
6 5 6 4 8 13
26
3.E07 3.E07 2.E07 6.E05 1.E05 7.E05
—
1.E05
—
16
53
40
33
21
101
35
34
154 75 41 21 143 14 24 13 49 32 39 98
20 26 27 56 21 41 44 59 22 29 37 32
— — — — Recombination and DNA repair — — — mRNA transcription — — Glycolysis and gluconeogenesis Proteolysis Organization of endoplasmatic reticulum Proteolysis Organization of endoplasmatic reticulum Proteolysis Organization of endoplasmatic reticulum Respiration Transport ATPases ABC transporters Mitochondrial transport Homeostasis of other ions Mitochondrial organization Nuclear organization
MIPS category
6 13 13 19 15 5 5 6 5 6 4 8 13
15
10
Members in a category
— — — — 4.E05 — — — 1.E05 — — 1.E05 4.E07 7.E07 1.E12 2.E08 2.E05 3.E05 3.E07 3.E07 2.E07 6.E05 1.E05 7.E05
P
The P values were calculated by using the cumulative hypergeometric probability distribution for 1nding at least the observed genes from a particular functional category with a cluster size, according to Tavazoie et al. [22]. Since 120 MIPS functional categories were tested for each cluster, P values greater than 8 × 10−4 are not listed, as their total expectation within the cluster would be greater than 0.01.
Respiration Transport ATPases ABC transporters Mitochondrial transport Homeostasis of other ions Mitochondrial organization Nuclear organization
—
mRNA transcription
—
194
1.E06
16
51
mRNA transcription
291
15
Members in a cluster
Cluster number
P
MIPS category
Cluster Members number in a cluster
Members in a category
60 clusters
30 clusters
Table 1 (continued)
784 S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
are classi1ed in separate clusters, which is also consistent with the previous work. The genes induced in the ESR are found in the clusters indicated above, while their counterparts were found in clusters distinct from those involved in the induced genes: clusters 2, 3, 8, 22 and 25 of the 30 clusters, and clusters 2, 3, 26, 33, and 42 of the 60 clusters. Although a more detailed evaluation is beyond the present work, the framework of the classi1cation of genes involved in ESR is described well in these analyses. 5.2. Comparison of two networks between clusters The connections between the frameworks of the 30 and the 60 clusters are compared in Fig. 2. When a cluster in the 30 clusters is divided into multiple clusters in the 60 clusters, the connections in the framework of the 60 clusters change. For example, cluster 1 (nitrogen and sulfur metabolism, see in Table 1) and cluster 13 (Ccompound and carbohydrate metabolism and metabolism of energy reserves) in the 30 clusters are each divided into two clusters in the 60 clusters (clusters 1 and 16, and clusters 17 and 24). The divided clusters are connected in the two cases. Given that neither of clusters in the two cases (clusters 1 and 24 in the 60 clusters) has been characterized in terms of gene function, the clusters are expected to include the genes that are related with the other clusters characterized by each gene function. Although detailed investigations of the members in each cluster are needed to attain a 1nal result in the above cases, a comparison of the frameworks may provide clues to investigate the gene regulatory relationship. The present framework was further evaluated with reference to the gene relationships involved in the environmental changes [11]. Interestingly, the separate clusters, including the various isoenzymes described in the above section, show a remarkable feature between the inferred networks of the 30 and 60 clusters in Fig. 2. Out of the nine types of isoenzymes that are shown in Fig. 5 of Gasch et al. [11], seven types are connected in the network of 30 clusters, and in contrast, only two types are found in the connected clusters of the 60 clusters. For example, the isoenzymes of hexokinase, phosphoglucomutase, and 6Pglucose dehydrogenase are found in clusters 2 and 13 of the 30 clusters and clusters 2 and 17 of the 60 clusters. In
785
the inferred network between the 30 clusters, clusters 2 and 13 are connected, while clusters 2 and 17 are not connected in the network between the 60 clusters. The diDerence is directly attributable to the division of cluster 13 of the 30 clusters into clusters 17 and 24 of the 60 clusters, as shown in Table 1. Since clusters 2 and 24 are connected in the network of 60 clusters, one of the clusters including the genes induced by ESR, cluster 17, might be related with cluster 2, including transportrelated genes, through cluster 24, which is not characterized in terms of gene function. Although the comparison in the present study was limited, in terms of the relationship between the clusters including isoenzymes, a more systematic comparison of the frameworks between the diDerent clusters estimated by the present system will be investigated in the near future. 5.3. Regulatory relationship in inferred frameworks The inferred networks were evaluated by the correspondence of the connections with the genes in known regulatory relationships. In the correspondence, we examined whether the connection was established between the clusters to which the genes in a regulatory system belong. In the present pro1le data, we found 94 transcription factors and their regulating genes, which were experimentally identi1ed and were divided into four stages according to the degree of direct interaction with a transcription factor, with reference to the Yeast Protein Database (http://www.proteome.com/) (see the web supplement for detail). In total, 1263 pairs of genes with regulatory relationships were contained in the present data. In the inferred networks in the 30 and 60 clusters, 825 (65.3%) and 815 (64.5%) pairs, respectively, were correctly found in the same cluster or connected clusters. Although the percentages of correctly inferred relationships depend on the fraction of connected clusters in the inference, the percentages of correctness are larger than the percentages of connected clusters (58.9% and 53.6%, respectively) in both clusters. To estimate the correctness of the present analyses further, the chance probability for the correspondence was calculated by the following simulation. One hundred matrices were generated, in which the elements were randomly composed of 0 and 1, indicating a connected cluster and a notconnected one,
786
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
respectively. In each matrix, the matrix size and the number of elements with a value of 1 are equal to the size of the original matrix and the number of connections established by the present system, respectively. The matrices thus generated were corresponded with the known systems, and the numbers of correctly inferred connections were counted. Finally, the number of correctly inferred connections was transformed into a probability, under the assumption that one hundred of the numbers of correctly inferred connections in the hypothetical matrices approximate a normal distribution. In the above simulation, the probabilities of correctness of the total cases in the 30 and the 60 clusters are 0.014 and 0.001, respectively, which are signi1cant, especially for the 60 clusters. To evaluate the nature of the inferred frameworks further, the probabilities of the four stages were estimated in Table 2. A remarkable feature of the probabilities in the four stages is that the probabilities decrease as the degree of direct interaction of the transcription factor with the regulated genes increases. The present system, therefore, reveals that the expression changes by the direct interaction between the transcription factor and the regulated genes are reLected well in the pro1le data. In other words, the present system might discriminate the direct from indirect interactions of regulated genes with the transcription factor, and serve as a strategy to 1nd the unknown genes regulated by a transcription factor, apart from the sequence analyses of the upstream regions of the genes. At any rate, both probabilities in the inferred frameworks between the clusters strongly indicate that the present analyses appropriately reconstruct the regulatory relationships of the transcription factors. 5.4. Comparison with the standard test of the partial correlation coe=cient We evaluated the inferred framework by GGM further, from a statistical viewpoint. Since the chance probability of a partial correlation coeEcient can be estimated by the ttest for a null hypothesis of no partial correlation [4], the elements of PCCM obtained by stepwise procedures of GGM were compared with the elements of the initial PCCM with the same signi1cance level by the ttest. The partial
correlation coeEcient with the probability of 0.05 is expressed by r=
t0:05; nc −m 2 t0:05; nc −m
+ (nc − m)
;
(7)
where t0:05; nc −m is the t value with the probability of 0.05 and the (nc − m) degree of freedom. The comparison between the elements with a 0.05 signi1cance level in GGM and the ttest is shown in Table 3. As for the nonconnected elements in the PCCMs, all of the elements replaced with 0 in GGM have less than a 0.05 signi1cance level by the ttest in the 30 and 60 clusters, while there are no elements with more than a 0.05 signi1cance level. In contrast, the connected elements in both PCCMs clearly show no separation at the 0.05 signi1cance level. This indicates that the ttest is more severe than GGM in terms of the evaluation of signi1cance. Indeed, the total numbers of elements at the 0.05 signi1cance level are 306 and 1457 in the 30 and 60 clusters, respectively. Given that about 70.3% (306 connections) and 82.3% (1457) of the connections are broken oD from the standard test, however, the chance probabilities are estimated to be 0.186 and 0.226, respectively, which are not signi1cant at a conventional level. Consequently, GGM may be more suitable for inferring a genetic network than the standard test. The standard test may be useful to reduce the computational time in the operation of GGM. In the restoration of CCM from PCCM, the computational time increases as the iteration steps proceed. As a preprocessing step before the application of GGM, the elements of PCCM in a signi1cance level determined by the ttest are replaced by a 0 value, and then the PCCM thus preprocessed is subjected to GGM. In the present case, however, the probabilities of the clusters being connected by GGM but not by the ttest were estimated to be 0.335 and 0.415 in the 30 and 60 clusters, respectively, and subsequently, the corresponding steps were only 63 out of 179 steps and 201 out of 822 steps. Furthermore, the restoration of the corresponding CCM from the preprocessed PCCM failed, due to the small value in the numerical calculation. Thus, the reduction of the computational time in the operation of GGM remains as a future endeavor.
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
787
Table 2 Agreement of inferred frameworks with known regulatory systems
1st
2nd
3rd
4th
No. of regulatory pairs
940
216
69
38
30 clusters
Correctly inferred pairs P
607 (90) 0.015
132 (23) 0.145
54 (21) 0.197
32 (18) 0.233
60 clusters
Correctly inferred pairs P
607 (73) 0.002
128 (16) 0.122
52 (19) 0.111
28 (18) 0.593
The genes regulated by a transcription factor are divided into four groups, with reference to the Yeast Protein Database (http://www.proteome.com/). The ‘1st’ group indicates that the genes are directly regulated by the transcription factor, and the ‘2nd’ group indicates that the genes are regulated by the gene products in the ‘1st’ group. The ‘3rd’ and ‘4th’ groups indicate that the genes are regulated by the gene products in the preceding groups, respectively. The numbers in parentheses are the number of genes that belong to the same cluster as a transcription factor. Table 3 Comparison between the numbers of clusters with signi1cant probability by GGM and the ttest
ttest (30 clusters)
GGM
Connected Not connected
ttest (60 clusters)
0:05 6
0:05 ¿
0:05 6
0:05 ¿
129 0
127 179
313 0
636 821
6. Concluding remarks An automatic throughput for gene classi1cation and relation inference between gene classes is realized by the present system from a large amount of expression pro1les, with promising results in terms of the biological information about gene function and gene regulation. In particular, the improvement of the gene classi1cation, by setting a userde1ned threshold for VIF, allows the user to compare frameworks between diDerent clusters. Although the present system cannot infer the gene regulatory network that refers to the connections between the genes, the present system is a powerful tool for providing frameworks of regulatory relationships as a preprocessing step of a more detailed analysis for inferring a whole regulatory network between genes. Acknowledgements This study was partially supported by the Research for the Future Program of the Japan Society for the Promotion of Science. One of the authors (K. H.) was
supported in part by a GrantinAid for Scienti1c Research on Priority Areas (C) “Genome Information Science” from the Ministry of Education, Science, Sports, and Culture of Japan (grant 14015229). References [1] T. Akutsu, S. Miyano, S. Kuhara, Algorithms for inferring qualitative models of biological networks, Pac. Symp. Biocomput. 5 (2000) 290–301. [2] U. Alon, N. Barkai, D.A. Notterman, G. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. USA 96 (1999) 6745–6750. [3] O. Alter, P.O. Brown, D. Botstein, Singular value decomposition for genomewide expression data processing and modeling, Proc. Natl. Acad. Sci. USA 97 (2000) 10101–10106. [4] T.W. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd Edition, Wiley, New York, 1984. [5] A. BenDor, R. Shamir, Z. Yakhini, Clustering gene expression patterns, J. Comput. Biol. 6 (1999) 281–297. [6] T. Chen, H.L. He, G.M. Church, Modeling gene expression with diDerential equations, Pac. Symp. Biocomput. 4 (1999) 17–28.
788
S. Aburatani et al. / Signal Processing 83 (2003) 777 – 788
[7] J. DeRisi, V. Iyer, P. Brown, Exploring the metabolic genetic control of gene expression on a genomic scale, Science 278 (1997) 680–686. [8] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genomewide expression patterns, Proc. Natl. Acad. Sci. USA 95 (1998) 14863–14868. [9] R.J. Freund, W.J. Wilson, Regression Analysis, Academic Press, San Diego, 1998. [10] N. Friedman, M. Linial, I. Nachman, D. Pe’er, Using Bayesian networks to analyze expression data, J. Comput. Biol. 7 (2000) 601–620. [11] A.P. Gasch, P.T. Spellman, C.M. Kao, O. CarmelHarel, M.B. Eisen, G. Storz, D. Botstein, P.O. Brown, Genomic expression programs in the response of yeast cells to environmental changes, Mol. Biol. Cell 11 (2000) 4241–4257. [12] A.D. Gordon, Classi1cation, Chapman & Hall, London, 1981. [13] A.J. Hartemink, D.K. GiDord, T.S. Jaakkola, R.A. Young, Combining location and expression data for principled discovery of genetic regulatory network models, Pac. Symp. Biocomput. 7 (2002) 437– 449. [14] K. Horimoto, S. Aburatani, S. Kuhara, H. Toh, ASIAN— automatic system for inferring a network from gene expression pro1les, Res. Commun. Biochem. Cell Mol. Biol. (2003), in press. [15] K. Horimoto, H. Toh, Statistical estimation of cluster boundaries in gene expression pro1le data, Bioinformatics 17 (2001) 1143–1151. [16] D.J. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, E.L. Brown, DNA expression monitoring by hybridization to high density oligonucleotide arrays, Nat. Biotechnol. 14 (1996) 1657–1680. [17] H.W. Mewes, D. Frishman, C. Gruber, B. Geier, D. Haase, A. Kaps, K. Lemcke, G. Mannhaupt, F. PfeiDer, C. Schuller, et al., MIPS: a database for genomes and protein sequences, Nucl. Acids Res. 28 (2000) 37–40.
[18] E. Segal, B. Taskar, A. Gasch, N. Friedman, D. Koller, Rich probabilistic models for gene expression, Bioinformatics 17 (2001) S243–S252. [19] D. Shalon, S.J. Smith, P.O. Brown, A DNA microarray system for analyzing complex DNA samples using twocolor Luorescent probe hybridization, Genome Res. 6 (1996) 639–645. [20] R. Somogyi, C.A. Shiegoski, Modeling the complexity of genetic networks: understanding multigene and pleiotropic regulation, Complexity 1 (1996) 45–63. [21] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, T.R. Golub, Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic diDerentiation, Proc. Natl. Acad. Sci. USA 96 (1999) 2907–2912. [22] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, G.M. Church, Systematic determination of genetic network architecture, Nat. Genet. 22 (1999) 281–285. [23] H. Toh, K. Horimoto, Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling, Bioinformatics 18 (2002) 287–297. [24] H. Toh, K. Horimoto, System for automatically inferring a genetic network from expression pro1les, J. Biol. Phys. 28 (2002) 449– 464. [25] J. Vilo, A. Brazma, I. Jonassen, A. Robinson, E. Ukkonen, Mining for putative regulatory elements in the yeast genome using gene expression data, in: Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, 2000, pp. 384 –394. [26] N. Wermuth, E. Scheidt, Fitting a covariance selection to a matrix. Algorithm AS 105, Appl. Statist. 26 (1977) 88–92. [27] J. Whittaker, Graphical Models in Applied Multivariate Statistics, Wiley, Chichester, 1990.