- Email: [email protected]

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples Jinlong Shi , Zhigang Luo School of Computer, National University of Defense Technology, Changsha 410073, Hunan, China

a r t i c l e in fo

abstract

Article history: Received 20 November 2009 Accepted 30 June 2010

Gene expression data are the representation of nonlinear interactions among genes and environmental factors. Computing analysis of these data is expected to gain knowledge of gene functions and disease mechanisms. Clustering is a classical exploratory technique of discovering similar expression patterns and function modules. However, gene expression data are usually of high dimensions and relatively small samples, which results in the main difﬁculty for the application of clustering algorithms. Principal component analysis (PCA) is usually used to reduce the data dimensions for further clustering analysis. While PCA estimates the similarity between expression proﬁles based on the Euclidean distance, which cannot reveal the nonlinear connections between genes. This paper uses nonlinear dimensionality reduction (NDR) as a preprocessing strategy for feature selection and visualization, and then applies clustering algorithms to the reduced feature spaces. In order to estimate the effectiveness of NDR for capturing biologically relevant structures, the comparative analysis between NDR and PCA is exploited to ﬁve real cancer expression datasets. Results show that NDR can perform better than PCA in visualization and clustering analysis of complex gene expression data. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Gene expression Nonlinear dimensionality reduction Visualization Clustering analysis Cancer tissue

1. Introduction High-density microarray can measure genome-wide level of gene activities, and has generated large amounts of gene expression data, which is believed as the representation of complex interactions between genes and environmental factors. Many mathematical models and algorithms have been developed for the computing analysis and interpretation of these data. Classiﬁcation and clustering are the two most common techniques which have been widely used in the analysis of gene expression data. Golub et al. [1] showed that classiﬁcation of gene expression data could distinguish cancer classes of human acute leukemias. Alon et al. [2] applied clustering analysis of oligonucleotide arrays to separate tumor and normal tissues. Bittner et al. [3] studied the molecular classiﬁcation of cutaneous malignant melanoma. Eisen et al. [4] applied hierarchical clustering (HC) to identify groups of co-expression or co-regulated genes. Tavazoie et al. [5] used K-means algorithm to uncover new regulons and their putative cis-regulatory elements. These successful applications inspired an explosion of such analysis, many new algorithms have been developed for cancer diagnosis [6], classiﬁcation of tumor samples [7,8], gene selection [9], visualization of expression patterns [10] and so on. Corresponding author. Tel.: + 86 0731 84575835. Corresponding author.

E-mail addresses: [email protected] (J.L. Shi), [email protected] (Z.G. Luo). 0010-4825/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiomed.2010.06.007

A common problem for the computing analysis of gene expression data is called ‘curse of dimensionality’, which restricts most of classical statistical algorithms to generate more meaningful results. The problem is derived from the fact that the number of genes is usually much larger than that of samples in high-throughput experiments. To effectively exploit these data, a preprocessing step must be applied to select representative factors by dimensionality reduction. Many papers used PCA to achieve this purpose, Hornquist [11] used PCA to avoid overﬁtting their computing models, Howley [12] showed that PCA could increase the classiﬁcation accuracy of high-dimensional spectral data. Sharov [13] further developed a web-based tool of PCA for the analysis of microarray data. Although PCA can effectively preserve the most variance of data, Yeung and Ruzzo [14] has proven that clustering algorithms combined with PCA do not necessarily improve, and often degrades the cluster quality. Essentially, PCA performs a linear transformation of data based on the Euclidean distance between expression proﬁles, thus, it cannot character the nonlinear interactions between genes and environmental factors. Compared with PCA, algorithms of NDR such as isometric mapping (ISOMAP), can reveal nonlinear geometric distribution of data via embedding high-dimensional data into a low-dimensional space. Nilsson et al. [15] exploited ISOMAP to capture the biologically relevant structures in microarray data. Dawson et al. [16] applied ISOMAP to three large highdensity oligonucleotide microarray datasets to character sample phenotype clusters. Weng et al. [17] has shown that ISOMAP is

724

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

makes dimensionality reduction an essential step for further analysis of expression data. This paper studies the distribution of cancer tissue samples, thus genes are regarded as different variables and expression levels in a sample becomes an observation of the experiments.

helpful in revealing the nonlinear structural knowledge. Lee et al. [18] further investigated the efﬁcacy in classifying gene and protein expression data for ISOMAP and several other nonlinear dimensionality reduction methods. This paper aims to study the effectiveness of NDR in capturing cluster structures and visualization of gene expression data. ISOMAP is exploited to reduce the dimensionality of gene expression data, and then hierarchical clustering (HC) and K-means are applied to the reduced feature spaces. In order to evaluate the effectiveness of NDR for capturing cluster structures, the comparative analysis between ISOMAP and PCA is exploited to ﬁve real cancer datasets. The rest of the paper is organized as below. Section 2 gives a brief introduction to the basic theories and algorithms for PCA and ISOMAP and discusses the criteria of clustering quality. Section 3 details the experimental results and Section 4 gives the conclusions and some necessary discussions.

2.1. PCA PCA is a classical statistical technique to deal with the multivariate data, which projects the main features into a lowdimensional space while preserving the most variance. It can integrate those related variables into several irrelevant new variables via essential linear transformations to obtain a simpler data representation. Here, the basic theory and algorithm of PCA are described as below, details can be seen in book [19]. Given a data matrix X with dimension of N M, rows are corresponding to observations and columns to variables. The algorithm of ﬁnding principal components can be implemented in three computing steps:

2. Methods Mathematically, microarray data can be represented as a numerical matrix of N M, which records the expression of N genes under M different samples. Usually, N bM holds which

1. To eliminate the effect of different measure metric, standardization process is applied to get relevant matrix X~ of original

Table 1 Gene expression datasets of human tumors. Dataset

SRBCT DLBCL Nine tumors Brain tumors Leukemia

Simple descriptions

Number of

Four distinct diagnostic categories of small, round blue-cell tumors Diffuse large B-cell lymphomas and follicular lymphomas Cell line chemosensitivity of nine categories of human tumors Five categories of embryonal tumors of central nervous system Human acute leukemias consisting of AML and ALL (ALL-T, ALL-B)

2 PCs for PCA analysis

Samples

Genes

83 77 60 99 72

2308 6817 6817 5920 5327

Khan et al. [26] Shipp et al. [27] Staunton et al. [28] Pomeroy et al. [29] Golub et al. [1]

3 PCs for PCA analysis

40 20 3rd PC

20 2nd PC

Reference

0 -20

0 -20 -40 50

-40 -60 -20

0

20 1st PC

40

50

2nd 0 PC

60

2-dimensional embedding for ISOMAP analysis

-50 -50

0 1st PC

3-dimensional embedding for ISOMAP analysis

100

100

50

0

Dim3

Dim2

150

0 -50

-100 -200 200

-100

200

0 -150 -200

-100

0 Dim1

100

Dim

200 EWS

RMS

2 -200 -200

BL

0 Dim 1

NB

Fig. 1. Visualization of SRBCT dataset using PCA and ISOMAP. As can be seen, ISOMAP generates better visualization than PCA, three of four subtypes, RML, BL and NB, are distinctly grouped together.

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

matrix X. X~ ¼ ðx~1 , x~2 , . . . , x~M Þ x1 x1 x2 x2 xM xM ¼ pﬃﬃﬃﬃﬃﬃﬃﬃ , pﬃﬃﬃﬃﬃﬃﬃﬃ , . . . , pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

s11

s22

sMM

Here, x1 ,x2 , . . . ,xM and s11 , s22 , . . . , sMM are separately the mean values and variances for corresponding variable vectors. 2. Compute the covariance matrix of X~ , then make spectral decomposition to get the eigenvalues and its corresponding eigenvectors. C ¼ X~ uX~ ¼ U LU Here L ¼ diagðl1 , l2 , . . . , lM Þ, ðl1 Z l2 Z Z lM Þ, U¼(u1,u2,y, uM). li and ui are separately the ith eigenvalue and corresponding eigenvector for covariance matrix C. 3. Compute the cumulative energy contents and determine the number of principal components based on the preconcerted value. Supposing the number to be K, the ith principal component can be computed as X~ ui , and the reduced dimensional subspace is X~ UK ðN KÞ. From the last computing step, it can be seen that each of the principal components is just the linear combination of raw variables. 2.2. ISOMAP ISOMAP [20] is one of the most widely used nonlinear dimensionality reduction algorithms in recent years. Replacing Euclidean distance with geodesic distance which can effectively

725

characterize the global geometric structure of data, ISOMAP extends the standard multidimensional scale (MDS) to deal with the nonlinear relationships of data points based on the manifold theory. Since it preserves the relative intrinsic relationship between data points, ISOMAP is able to ﬁnd the essential lowdimensional embedding manifold and represents the nonlinear distribution of dataset in a low-dimensional space which is convenient for visualization and further analysis. The key of ISOMAP is to determine the geodesic distance between data points. Given a data matrix X with dimension of N M, rows can be regarded as coordinate vectors for data points, approximate geodesic distances are then calculated to represent their relative relationships. In Tenenbaum’s [20] original paper, the geodesic distance is obtained by differentiating the neighboring points and faraway points. Tenenbaum’s algorithm can be summed up into three steps, each of which is brieﬂy described in the following. More details about the operating scheme of the algorithm are shown in algorithm 1. 1. Construct neighborhood graph G. Via deﬁning the neighboring points with k nearest neighbors or within a certain Euclidean distance around each point, the topological relationship can be well deﬁned. Further, connect the neighboring points and set the edge length with corresponding Euclidean distance, a weighted graph can be constructed as neighborhood graph G. 2. Compute the geodesic distance for each pair of points. Given two points Pi and Pj, the geodesic distance dG(Pi,Pj) is computed as the shortest path on graph G. For neighboring points, dG(Pi,Pj) is equal to the Euclidean distance d(Pi,Pj), while for faraway points, the shortest path can be determined via

2 PCs for PCA analysis

3 PCs for PCA analysis

x 104 2

1

0

0.5

3rd PC

2nd PC

4

-2 -4 -6 -8 -10 -2

-1

0 1st PC

1

2 x 105

x 105

0 -0.5 -1 1

0 x 105 2 nd PC

2-dimensional embedding for ISOMAP analysis

-1 -2

2 0 x 105 C 1st P

3-dimensional embedding for ISOMAP analysis

x 105 x 105

2

2

1

0

Dim3

Dim2

3

0 -1 -2 -3 -4

-2

0 Dim1

2

4 x 105 DLBCL

-2

-4 -5 x 10V D 5 im2

0 -5

5 0 x 105 Dim1

FL

Fig. 2. Visualization of DLBCL dataset using PCA and ISOMAP. It can be seen that ISOMAP performs better than PCA.

726

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

2.3. Comparison of PCA and ISOMAP

classical graph algorithms such as Dijkstra algorithm [21]. Further, the geodesic distance matrix can be obtained: DG ¼[d2G (Pi,Pj)]N i,j ¼ 1. 3. Construct d-dimensional embedding. The low-dimensional embedding manifold is achieved by applying MDS algorithm [22] to the geodesic distance matrix DG. The proper number of dimensions can be selected under special requirements.

PCA and ISOMAP are two of the most widely used algorithms for dimensionality reduction. The former is a representative algorithm for linear techniques, and the latter reﬂects the main strategies of NDR. There are many differences between them:

PCA uses Euclidean distance to measure the similarity between

Algorithm 1. ISOMAP algorithm Input: Gene expression matrix of N M, neighboring parameter k, embedding dimension K. Output: Reduced data matrix N K. 1: Assign k neighbors for each data point 2: Connect a point to its neighbors to construct the ! neighborhood graph G, set edge length for Pi Pj with Euclidean distance d(Pi,Pj) 3: Initialize geodesic distances. For any two neighboring points, dG(Pi,Pj) ¼d(Pi,Pj), otherwise, set dG ðPi ,Pj Þ ¼ 1 4: Apply Dijkstra algorithm [21] to ﬁnd the shortest path Pu,y,Pk y,Pv between any each two points Pu, Pv 5: Add the edge length on the shortest path to obtain geodesic distance dG(Pu,Pv) 6: Compute the geodesic distance matrix as DG ¼[d2G(Pu,Pv)]N u,v ¼ 1 7: Apply MDS algorithm [22] to DG to obtain the K-dimensional embedding.

gene expression proﬁles, while ISOMAP is based on the geodesic distance. From the study of differential geometry [23], geodesic distance is the generalization of Euclidean distance on the high-dimensional manifold space, which can better character the relative location between data points. Gene expression data are the quantitative representation of cellular inner states and represents the interactions between genes which are obviously high-dimensional, complex nonlinear processes. Compared with Euclidean distance, geodesic distance can better reveal the direct and indirect interactions between genes. PCA extracts the main orientations where the data cloud is more extended through a linear transformation, and ﬁnds several PCs orthogonal to each other which are linear combinations of original variables [24]. While ISOMAP tries to ﬁnd a low-dimensional embedding space via applying MDS to the geodesic distance matrix, and all the data points are scattered on a manifold surface. With the distance between any two points unchanged, the transformation can preserve global geometric structure of gene expression data [15], which promises good cluster structures for the reduced feature space.

2 PCs for PCA analysis x

3 PCs for PCA analysis

104

2

x 104 2 3rd PC

2nd PC

1 0 -1

1 0 -1 -2 5

-2

x 104

-3 -2

-1

0

1 1st PC

2

3

2nd

x 104

2-dimensional embedding for ISOMAP analysis

0 PC

-5 -2

2 0 C P t s 1

4 x 104

3-dimensional embedding for ISOMAP analysis

x 105 1

x 104 5 Dim3

Dim2

0.5 0

-5 -10 1

-0.5 -1 -15

0

x 105 Dim 0 2 -10

NSCLC

-5 Dim1 Colon

Breast

0

5

-1 -2

-1

0 Dim1

1 x 105

x 104 Ovary Leukemia Renal

Melanoma

Prostate

CNS

Fig. 3. Visualization of nine tumors dataset using PCA and ISOMAP. Compared with PCA, ISOMAP distinguishes the subtype of melanoma.

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

From the viewpoint of computing complexity, PCA depends on

the spectral decomposition of empirical covariance matrix. For the application of visualization and clustering of tissue samples, PCA has to deal with the decomposition of the covariance matrix with large dimension. Our numerical experiments also prove that, for the same dataset, PCA runs much slower than ISOMAP. PCA analysis is a non-parametric, deterministic process, while ISOMAP has two parameters, number of neighborhood and embedding dimensions. Especially, well-chosen of the neighbors will produce amazing results [20].

2.4. Clustering algorithms To evaluate the effectiveness of NDR for capturing cluster structures, two classical clustering algorithms, HC and K-means, are applied to the reduced feature spaces. These two algorithms are representatives for two kinds of widely used clustering approaches. HC uses agglomerative or divisive strategies and divides data into a sequence of nested partitions, where partitions at one level are joined as clusters at the next level. The number of clusters can be determined immediately at special level upon requirements. While, K-means is one of the most widely used center-based clustering algorithms which uses a partitioning strategy to assign objects into ﬁxed clusters. The algorithm regards data vectors as a point set in a high-dimensional space. According to the input clustering number, K-means randomly selects centroid points for each cluster and allocates each of data point into one of these clusters based on its minimum distance to

these centroid points. After necessary optimizing steps, K-means can generate a good clustering solution. 2.5. Agreement between two clustering results In order to evaluate the quality of clustering assignments produced by a clustering algorithm, some evaluating criteria must be exploited. A common static is rand index [25]. Given two partitions C and P for the same dataset S, C ¼{C1,C2,y,Cm} and Sm Sn P¼{P1,P2,y,Pn}, such that 1 Ci ¼ S ¼ 1 Pj and Ci \ Cj ¼ Pi \ Pj ¼ | for i a j, 1 r ir m and 1 r j rn. Four statistics must be determined in order to calculate the rand index. Let a be the number of pairs of objects within the same cluster in both partitions, d be the number of pairs of objects within different clusters in both partitions, and b indicates the number of pairs of objects within the same cluster in partition C and within different clusters in partition P, c is the number of pairs within different clusters in partition C and within the same cluster in partition P. Rand index can be computed as ða þdÞ=ða þ b þc þ dÞ, which indicates the fraction of agreement between two clustering partitions. In this paper, partitions C and P represent separately external criteria and clustering results.

3. Experimental results 3.1. Datasets In this section, ﬁve real tumor datasets are exploited to design several numerical experiments, see Table 1. These datasets have been used widely as benchmarks in the computing analysis of microarray

2 PCs for PCA analysis

3 PCs for PCA analysis

x 105 x 105

1

1

0.5

0

3rd PC

2nd PC

1.5

0 -0.5

-1

0

1 1st PC

2

3 x 105

2-dimensional embedding for ISOMAP analysis x 105

-2

-2

4

2 0 C P t s 1

x 105

5

x 10

4

4

3

2

2 1 0

0 -2 -4 5

-1 -2

nd 0 PC

3-dimensional embedding for ISOMAP analysis

Dim3

Dim2

5

-1

-2 2 x 105 2

-1 -1.5

727

x 105 Dim02 -6

-4

-2

0 Dim1

2

4

x 105 MED MAL AT/RT

-5

-10

-5 Dim 1

0

5 x 105

Normal PNET

Fig. 4. Visualization of brain tumors dataset using PCA and ISOMAP. ISOMAP clearly separates normal samples from tumors.

728

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

are used to evaluate the effectiveness of NDR for capturing cluster structures. For the convenience of comparison and evaluation, these datasets are visualized and preprocessed by ISOMAP and PCA separately, then clustering algorithms of HC and K-means are applied to the reduced feature spaces.

data. Alexander [8] evaluated multicategory classiﬁcation methods for gene expression cancer diagnosis. Liu et al. [10] tested the performance of nonnegative matrix factorization. Lee et al. [18] investigated the efﬁcacy of NDR for classiﬁcation of gene expression data. After ﬁltering missing values and standardization, these datasets

2 PCs for PCA analysis x 104

2

10

0

5

3rd PC

2nd PC

4

3 PCs for PCA analysis

x 104

-2 -4

0

-5 1 x 105

-6 -8 -1

-0.5

0 1st PC

0.5

1 x 105

2nd 0 PC

2-dimensional embedding for ISOMAP analysis 3

1 0 C 1st P

-1 -1

x 105

3-dimensional embedding for ISOMAP analysis

x 105

x 105 2

2 Dim3

Dim2

1 1

0

0

-1

-1

-2 5 x 105 Dim0 2

-2 -2

-1

0

1

2

Dim1

3 x 105

ALL-B

ALL-T

-5 -2

0

4 2 5 Dim1 x 10

AML

Fig. 5. Visualization of leukemia dataset using PCA and ISOMAP. AML and ALL can be better separated by ISOMAP than PCA, and the rough distinction of ALL-T and ALL-B can also be revealed.

Variance curve for PCA analysis Residual variance

0.3 HC-PCA Kmeans-PCA HC-ISOMAP Kmeans-ISOMAP HC-Raw Kmeans-Raw

0.25

150 100 50 0

0.2

0

5

0

5

10 15 20 25 # of different PCs Variance curve for ISOMAP analysis

30

0.5

0.15

Residual variance

Rand Index

200

SRBCT dataset

0.1

0.05 0

5

10 15 20 # of different PCs/dimensions

25

30

0.4 0.3 0.2 0.1 0 10 15 20 # of different dimensions

25

30

Fig. 6. Rand index and variance curves for PCA and ISOMAP analysis of SRBCT dataset. The left displays the values of rand index under different PCs/dimensions for HC and K-means clustering of the reduced feature spaces and raw data, the right shows residual variance of PCA and ISOMAP analysis.

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

arrays. Staunton et al. [28] used this dataset to predict the chemosensitivity of chemical compounds and divided these tissues into nine diagnostic categories. Brain tumors dataset represents gene expression levels of embryonal tumors of the central nervous system. Pomeroy et al. [29] built a classiﬁcation system to distinct the different molecular characters of ﬁve tumor categories such as medulloblastoma (MED), malignant glioma (MAL), atypical teratoid/ rhabdoid (AT/RTs), primitive neuroectodermal (PNET) and normal cerebellum. Leukemia dataset is a well-known benchmark for classiﬁcation and clustering of cancer expression data. It contains samples of acute myelogenous leukemia (AML) and acute lymphoblastic leukemia (ALL). The latter can be further divided into T and B subtypes. Based on this dataset, Golub et al. [1] demonstrated the

The small, round blue-cell tumors (SRBCT) dataset is a compendium of four tumor types with similar appearance on routine histology, which includes neuroblastoma (NB), rhabdomyosarcoma (RMS), burkitt lymphomas (BL) and the Ewing family of tumors (EWS). Khan et al. [26] exploited these data to train and test the performance of an artiﬁcial neural networks (ANN) classiﬁer. Diffuse large B-cell lymphoma (DLBCL) is a common lymphoid malignancy in adults. To construct the clinical outcome models, Shipp et al. [27] collected gene expression levels from patients under different chemotherapy, and classiﬁed these patients into two categories, DLBCL and follicular lymphoma (FL). The dataset of nine tumors is a collection of gene expression levels under different drug treatments, which are measured based on a panel of 60 human cancer cell lines via oligonucleotide

2.5

DLBCL dataset 0.7

0.55 0.5

Residual variance

Rand Index

HC-PCA Kmeans-PCA HC-ISOMAP Kmeans-ISOMAP HC-Raw Kmeans-Raw

0.45 0.4 0.35

0

5

10 15 20 # of different PCs/dimensions

25

Variance curve for PCA analysis

x 109

2

Residual variance

0.65 0.6

729

1.5 1 0.5 0 0

5

0

5

10

15 20 25 # of different PCs Variance curve for ISOMAP analysis

30

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

30

10 15 20 # of different dimensions

25

30

Fig. 7. Rand index and variance curves for PCA and ISOMAP analysis of DLBCL dataset. The left displays the values of rand index under different PCs/dimensions for HC and K-means clustering of the reduced feature spaces and raw data, the right shows residual variance of PCA and ISOMAP analysis.

10 Residual variance

9 TUMORS dataset 0.11 0.1 0.09 0.08

Variance curve for PCA analysis

x 107

8 6 4 2

HC-PCA Kmeans-PCA HC-ISOMAP Kmeans-ISOMAP HC-Raw Kmeans-Raw

0.06 0.05 0.04

Residual variance

Rand Index

0 0.07

0.03 0.02 0.01 0

5

10 15 20 # of different PCs/dimensions

25

30

0

5

0

5

10

15 20 25 # of different PCs Variance curve for ISOMAP analysis

30

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 15 20 # of different dimensions

25

30

Fig. 8. Rand index and variance curves for PCA and ISOMAP analysis of nine tumors dataset. The left displays the values of rand index under different PCs/dimensions for HC and K-means clustering of the reduced feature spaces and raw data, the right shows residual variance of PCA and ISOMAP analysis.

730

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

feasibility of cancer classiﬁcation of gene expression data to predict new cancer subtypes.

visualization of the other four datasets, as can be seen, ISOMAP can partition different subtypes and performs much better than PCA.

3.2. Visualization 3.3. Clustering the reduced feature spaces To evaluate the effectiveness of NDR for capturing cluster structures of gene expression data, ISOMAP and PCA are applied separately to these ﬁve datasets for visualization analysis. The high-dimensional gene expression data are reduced into 2- and 3-dimensional spaces, which can clearly reveal the distribution of sample points. Fig. 1 shows the visualization of SRBCT dataset. Compared to PCA, ISOMAP produces better visualization, all four subtypes are partitioned into different clusters, and BL and NB are distinctly grouped together. Figs. 2–5 separately display the

Residual variance

0.45

HC-PCA Kmeans-PCA HC-ISOMAP Kmeans-ISOMAP HC-Raw Kmeans-Raw

0.3 0.25 0.2 0.15 0.1 0

5

10 15 20 # of different PCs/dimensions

25

Variance curve for PCA analysis

2.5 2 1.5 1 0.5 0

Residual variance

Rand Index

0.4 0.35

x 109

3

BRAIN TUMOR dataset

0.5

Further, HC and K-means are applied to cluster the reduced feature spaces with different dimensions generated by PCA or ISOMAP. The values of reduced dimensions are set to 2–30 for both PCA and ISOMAP, and the neighboring parameter of ISOMAP is set as k¼3. Rand index is computed for each clustering, and values for different dimensions are plotted as curves. For the convenient of comparison, rand indexes of each dataset are plotted in one ﬁgure for different analysis. HC-PCA and

0

5

0

5

10

15 20 25 # of different PCs Variance curve for ISOMAP analysis

30

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

30

10 15 20 # of different dimensions

25

30

Fig. 9. Rand index and variance curves for PCA and ISOMAP analysis of brain tumors dataset. The left displays the values of rand index under different PCs/dimensions for HC and K-means clustering of the reduced feature spaces and raw data, the right shows residual variance of PCA and ISOMAP analysis.

10 Residual variance

LEUKEMIA dataset 0.6

HC-PCA Kmeans-PCA HC-ISOMAP Kmeans-ISOMAP HC-Raw Kmeans-Raw

0.55 0.5

8 6 4 2 0 0

0.4

5

10 15 20 25 # of different PCs Variance curve for ISOMAP analysis

0.35 0.3

Residual variance

Rand Index

0.45

Variance curve for PCA analysis

x 108

0.25 0.2 0.15 0.1 0

5

10 15 20 # of different PCs/dimensions

25

30

30

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10 15 20 # of different dimensions

25

30

Fig. 10. Rand index and variance curves for PCA and ISOMAP analysis of leukemia dataset. The left displays the values of rand index under different PCs/dimensions for HC and K-means clustering of the reduced feature spaces and raw data, the right shows residual variance of PCA and ISOMAP analysis.

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

Table 2 Rand index for raw data, PCA-reduced and ISOMAP-reduced feature spaces. Dataset

731

Conﬂict of interest statement None declared.

Average rand index Hierarchical clustering

K-means

ISOMAP

PCA

Raw

ISOMAP

PCA

Raw

0.098 0.429 0.478 0.667 0.262

0.064 0.402 0.367 0.617 0.204

0.067 0.406 0.351 0.617 0.253

0.079 0.202 0.467 0.414 0.191

0.031 0.167 0.206 0.354 0.111

0.033 0.134 0.252 0.343 0.112

Acknowledgments Nine tumors Brain tumors Leukemia DLBCL SRBCT

Kmeans-PCA indicate HC and K-means clustering of PCA-reduced feature spaces, HC-ISOMAP and Kmeans-ISOMAP denote HC and K-means clustering of ISOMAP-reduced feature spaces, and HC-Raw, Kmeans-Raw indicate the clustering to raw data with different clustering algorithms separately. Figs. 6–10 are the respective results for SRBCT, DLBCL, nine tumors, brain tumors and leukemia datasets. On each ﬁgure, the left plot displays the values of rand index, and the right plots show the residual variances of PCA and ISOMAP analysis. Table 2 presents the average values of rand index for all the clustering results with different dimensions. As can be seen from these table and ﬁgures, ISOMAP performs much better than PCA, either based on HC clustering or K-means clustering.

4. Conclusions and discussions ‘Large variables and small samples’ is always a prominent character for gene expression data, which makes dimensionality reduction an essential step for further analysis. This paper compares the performance of PCA and ISOMAP for visualization and clustering of cancer tissue samples. Five real cancer tissue expression datasets are used to evaluate these two approaches of dimensionality reduction. As we can see, ISOMAP produces much better visualization of expression data than PCA, and reveals different subcategories of cancers. Also, further clustering analysis to the reduced feature spaces and raw data show that dimensionality reduction via ISOMAP can preserve the cluster structures of gene expression data, preprocessing with ISOMAP can improve the performance of clustering analysis of gene expression data. Genes interact with each other and form complex regulatory networks and as a consequence, the functional relations between genes are full of nonlinearity, so linear statistical techniques such as PCA cannot generate a good model. Even some variants can perform better than primal PCA, such as generalized PCA [30,31] which can segment data points onto several linear hyperplanes. However, the intrinsic linearity prevents PCA from extracting nonlinear patterns within gene expression data. Comparatively, ISOMAP is a more attractive approach. Supposing all the gene expression states are distributed on a high-dimensional manifold surface, ISOMAP can embed the high-dimensional data into a lowdimensional manifold space with nonlinear geometric property of data preserved. The visualization and clustering analysis of these ﬁve real cancer datasets have shown that ISOMAP performs much better than PCA in revealing the category structures of cancer tissue samples. Although ISOMAP is sensitive to data noises which are prevailing in microarray experiments, the subsequent clustering of reduced feature spaces does not signiﬁcantly improve the clustering accuracy for cancer tissue samples, it has shown the beneﬁt and importance of taking nonlinearity into account, which will enlighten the following studies to pay more attentions to nonlinear properties of gene expression data.

This work was supported by the National Natural Science Foundation of China (Grant no. 60673018), the National High-Tech Development 863 Program of China (Grant no. 2007AA01Z106). References [1] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomﬁeld, E. Lander, Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring, Science 296 (15) (1999) 531–537. [2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy Sciences of the United States of America 96 (6) (1999) 6745–6750. [3] M. Bittner, P. Meltzer, et al., Molecular classiﬁcation of cutaneous malignant melanoma by gene expression proﬁling, Nature 406 (6795) (2000) 536–540. [4] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proceedings of the National Academy Sciences of the United States of America 95 (12) (1998) 14863–14868. [5] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, G.M. Church, Systematic determination of genetic network architecture, Nature Genetics 22 (7) (1999) 281–285. [6] A.M. Bagirov, B. Ferguson, S. Ivkovic, G. Saunders, J. Yearwood, New algorithms for multi-class cancer diagnosis using tumor gene expression signatures, Bioinformatics 19 (14) (2003) 1800–1807. [7] R. Alexandridis, S. Lin, M. Irwin, Class discovery and classiﬁcation of tumor samples using mixture modeling of gene expression data—a uniﬁed approach, Bioinformatics 20 (16) (2004) 2545–2552. [8] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S. Levy, A comprehensive evaluation of multicategory classiﬁcation methods for microarray gene expression cancer diagnosis, Bioinformatics 21 (5) (2005) 631–643. [9] C. Zheng, D.S. Huang, et al., Tumor clustering using nonnegative matrix factorization with gene selection, IEEE Transaction on Information Technology in Biomedicine 13 (4) (2009) 599–607. [10] W. Liu, K. Yuan, D. Ye, Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis, Jounal of Biological Informatics 41 (3) (2008) 602–606. [11] M. Hornquist, J. Hertz, M. Wahde, Effective dimensionality for principal component analysis of time series expression data, BioSystems 65 (2–3) (2002) 147–156. [12] T. Howley, M.G. Madden, M.-L. O’Connell, A.G. Ryder, The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data, Knowledge-Based Systems 19 (2006) 363–370. [13] A.A. Sharov, D.B. Dudekula, M.S.H. Ko, A web-based tool for principal component and signiﬁcance analysis of microarray data, Bioinformatics 21 (10) (2005) 2548–2549. [14] K.Y. Yeung, W.L. Ruzzo, Principal component analysis for clustering gene expression data, Bioinformatics 17 (9) (2001) 763–774. [15] J. Nilsson, T. Fioretos, M. Hoglund, M. Fontes, Approximate geodesic distances reveal biologically relevant structures in microarray data, Bioinformatics 20 (6) (2004) 1176–1183. [16] K. Dawson, R.L. Rodriguez, W. Malyj, Sample phenotype clusters in highdensity oligonucleotide microarray data sets are revealed using isomap, a nonlinear algorithm, BMC Bioinformatics (6) (2005) 195–211. [17] S. Weng, C. Zhang, Z. Lin, X. Zhang, Mining the structural knowledge of highdimensional medical data using isomap, Medical & Biological Engineering & Computing 43 (3) (2005) 410–412. [18] G. Lee, C. Rodriguez, A. Madabhushi, Investigating the efﬁcacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies, IEEE/ACM Transactions on Computational Biology and Bioinformatics 5 (3) (2008) 368–384. [19] I. Jolliffe, Principal Component Analysis, second ed., Springer, 2002. [20] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (22) (2000) 2319–2323. [21] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, The MIT Press, 2001. [22] R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, Prentice-Hall, 2007. [23] S.S. Chern, W.H. Chen, K.S. Lam, Lectures on Differential Geometry, World Scientiﬁc, Beijing, 2006. [24] I.K. Fodor, A survey of dimension reduction techniques, Technical Report, U.S. Department of Energy, 2002. [25] L. Hubert, P. Arabie, Comparing partitions, Journal of Classiﬁcation 1 (2) (1985) 193–218.

732

J. Shi, Z. Luo / Computers in Biology and Medicine 40 (2010) 723–732

[26] J. Khan, J.S. Wei, M. Ringne´r, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, P.S. Meltzer, Classiﬁcation and diagnostic prediction of cancers using gene expression proﬁling and artiﬁcial neural networks, Nature Medicine 7 (6) (2001) 673–679. [27] M. Shipp, K. Ross, P. Tamayo, A. Weng, J. Kutok, R. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. Pinkus, et al., Diffuse large B-cell lymphoma outcome prediction by gene-expression proﬁling and supervised machine learning, Nature Medicine 8 (1) (2002) 68–74. [28] J. Staunton, D. Slonim, H. Coller, P. Tamayo, M. Angelo, J. Park, U. Scherf, J. Lee, W. Reinhold, J. Weinstein, et al., Chemosensitivity prediction by transcriptional proﬁling, Proceedings of the National Academy Sciences of the United States of America 98 (19) (2001) 10787–10792. [29] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L. Sturla, M. Angelo, M. McLaughlin, J. Kim, L. Goumnerova, P. Black, C. Lau, Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature 415 (6870) (2002) 438–442. [30] R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (gpca), IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (12) (2005) 1945–1959.

[31] Y. Ma, H. Derksen, W. Hong, J. Wright, Segmentation of multivariate mixed data via lossy data coding and compression, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (9) (2007) 1–17.

Jinlong Shi was born in Henan province, China, in 1980. He received B.Sc. and M.Sc. degrees in National University of Defense Technology in 2002 and 2004, respectively. He is now a Ph.D. candidate in School of Computer in National University of Defense Technology. His research interests include bioinformatics, manifold learning and data mining.

Zhigang Luo was born in Hunan province, China, in 1962. He received his Ph.D. degree in School of Computer in National University of Defense Technology. He is now a professor and supervisor of Ph.D. candidates in School of Computer. His research interests include parallel computing and bioinformatics.