- Email: [email protected]

S1532-0464(19)30277-1 https://doi.org/10.1016/j.jbi.2019.103357 YJBIN 103357

To appear in:

Journal of Biomedical Informatics

Received Date: Revised Date: Accepted Date:

29 May 2019 27 November 2019 12 December 2019

Please cite this article as: Rafique, O., Mir, A.H., A Topological Approach for Cancer Subtyping from Gene Expression Data, Journal of Biomedical Informatics (2019), doi: https://doi.org/10.1016/j.jbi.2019.103357

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Inc.

A Topological Approach for Cancer Subtyping from Gene Expression Data Omar Rafique1 , A.H.Mir2 1,2

Machine Learning Lab, National Institute of Technology, Srinagar, JK

Abstract Background: Gene expression data contains key information which can be used for subtyping cancer patients. However, computational methods suffer from ’curse of dimensionality’ due to very high dimensionality of omics data and therefore are not able to clearly distinguish between the discovered subtypes in terms of separation of survival plots. Methods: To address this we propose a framework based on Topological Mapper algorithm. The novelty of this work is that we suggest a method for defining the filter function on which the mapper algorithm heavily depends. Survival analysis of the discovered cancer subtypes is carried out and evaluated in terms of minimum pairwise separation between the Kaplan-Meier plots. Furthermore, we present a method to measure the separation between the discovered subtypes based on hazard ratios. Results: Five cancer genomics datasets obtained from The Cancer Genome Atlas portal have been used for comparisons with Robust Sparse Correlation-Otrimle (RSC-Otrimle) and Similarity Network Fusion (SNF). Comparisons show that the minimum pairwise life expectancy difference (in days) between the discovered subtypes for lung, colon, breast, glioblastoma and kidney cancers is 107, 204, 20, 88 and 425 days, respectively, for the proposed methodology whereas it is only 69, 43, 6, 61 and 282 days for RSC-Otrimle and 9, 95, 18, 60 and 148 days for SNF. Hazard ratio analysis also shows that the proposed methodology performs better in four of the five datasets. A visual inspection of Kaplan-Meier plots reveals that the proposed methodology achieves lesser overlap in Kaplan-Meier plots especially for lung, breast and kidney cases. Furthermore, relevant genetic pathways for each subtype have been obtained and pathways which can be possible targets for treatment have been discussed. Conclusion: The significance of this work lies in individualized understanding of cancer from patient to patient which is the backbone of precision medicine. Keywords: Topological Mapper; t-distributed Stochastic Neighbour Embedding; Kaplan-Meier survival plots; Hazard ratio; Pathway analysis.

1 Corresponding

Author: [email protected] , omarrafique [email protected]

Preprint submitted to Journal of LATEX Templates

December 16, 2019

Abstract Background: Gene expression data contains key information which can be used for subtyping cancer patients. However, computational methods suffer from ’curse of dimensionality’ due to very high dimensionality of omics data and therefore are not able to clearly distinguish between the discovered subtypes in terms of separation of survival plots. Methods: To address this we propose a framework based on Topological Mapper algorithm. The novelty of this work is that we suggest a method for defining the filter function on which the mapper algorithm heavily depends. Survival analysis of the discovered cancer subtypes is carried out and evaluated in terms of minimum pairwise separation between the Kaplan-Meier plots. Furthermore, we present a method to measure the separation between the discovered subtypes based on hazard ratios. Results: Five cancer genomics datasets obtained from The Cancer Genome Atlas portal have been used for comparisons with Robust Sparse Correlation-Otrimle (RSC-Otrimle) algorithm and Similarity Network Fusion(SNF). Comparisons show that the minimum pairwise life expectancy difference (in days) between the discovered subtypes for lung, colon, breast, glioblastoma and kidney cancers is 107, 204, 20, 88 and 425 days, respectively, for the proposed methodology whereas it is only 69, 43, 6, 61 and 282 days for RSC-Otrimle and 9, 95, 18, 60 and 148 days for SNF. Hazard ratio analysis also shows that the proposed methodology performs better in four of the five datasets. A visual inspection of Kaplan-Meier plots reveals that the proposed methodology achieves lesser overlap in Kaplan-Meier plots especially for lung, breast and kidney cases. Furthermore, relevant genetic pathways for each subtype have been obtained and pathways which can be possible targets for treatment have been discussed. Conclusion: The significance of this work lies in individualized understanding of cancer from patient to patient which is the backbone of Precision Medicine. Keywords: Topological Mapper; t-distributed Stochastic Neighbour Embedding; Kaplan-Meier survival plots; Hazard ratio; Pathway analysis.

1. Introduction One of the aims of Precision Medicine is to achieve better prognosis of chronic diseases such as cancer, diabetes, epilepsy and obesity by prescribing treatment based on individualized understanding of disease from patient to patient [1, 2]. A group of patients suffering from seemingly the same disease is further divided into subtypes [2] where patients sharing some common attributes belong to the same subtype. With the advent of high throughput technologies, extensive and large molecular cancer databases have been created such as the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas Preprint submitted to Journal of LATEX Templates

November 26, 2019

(TCGA). Microarray techniques are used to measure gene expression which is represented as a matrix, in which each element gives the expression level of a gene for a particular patient [3, 4]. The genetic data so formed contains key information which can be used to understand the variability of disease from patient to patient and computational tools have been able to extract this key information for subtyping cancer patients [5, 6, 7]. These days, statistical and machine learning approaches based on dimensionality reduction [8, 9] matrix factorization [10], k-means, hierarchical clustering and spectral clustering [11, 12, 13, 14, 15] have been used to identify subtypes of patients for improved precision oncology. Furthermore, attempts have been made to integrate multiple omic data such as DNA genome sequence, RNA expression and DNA methylation for patient subtype discovery [15, 16, 17]. Similarity network fusion (SNF) [17] is one such method that combines multiple omic datasets by constructing networks of patients and then SNF iteratively updates each of the networks with information from the other networks, making them more similar resulting in a final singular dataset which represents the combined information of all the constituent datasets. However, gene expression data is collected by measuring thousands of gene expressions from a relatively small number of patients [18, 19]. This imbalance in number of genes and patients leads to computational issues, some of which are explained below are listed below: 1) Loss of qualitative significance of a neighbourhood: With increase in the number of dimensions, the contrast between the distances of the farthest point and the nearest point (from another point) becomes qualitatively indistinguishable(1) [20]. Given any three points x1 , x2 , x3 where d(x1 , x2 ) d(x1 , x3 ) then lim

dim→∞

d(x1 , x2 ) − d(x1 , x3 ) →0 d(x1 , x3 )

(1)

Where d() is a distance metric and dim is dimensionality. Common distance metrics like Minkowski and Euclidean metrics are not good distance measures in high dimensional scenarios [20, 21]. Due to these issues the notion of a neighbourhood looses its qualitative significance [22]. Moreover, Clustering algorithms based on the aforementioned proximity measures are not guaranteed to work well with high dimensional data.

2) Distortion of spectral components of covariance matrix: Consider a matrix Xn,p where p genes (columns) are measured for n samples (rows). Then covariance between j th and k th genes is given by: n

sjk =

1 X (xij − x ¯j )(xik − x ¯k ) n − 1 i=1

(2)

Here x ¯j is the mean of j th column. The sample covariance matrix S = [sjk ], is a pxp matrix, composed of all sjk , as its entries. With p n in gene expression data the spectral components of S cannot be accurately determined due to lack of enough samples required to predict the behaviour of the covariance matrix S. Spurious spectral components emerge and biases are introduced into the spectral components [7, 23]. As the aforementioned issues arise due to high dimensionality, dimensionality reduction techniques like Principal Component Analysis (PCA) are used prior to any kind of clustering algorithms. PCA is a statistical 2

technique that assumes that maximum information is in the direction of maximum variance of data. These directions of large variances are estimated from the eigenvectors of the covariance matrix of data and all those directions which do not contain significant information are not retained [24, 25, 26]. PCA has been shown to successfully extract multivariate patterns in gene expression from brain tissue samples[27]. However, due to the functional relationship between genes being non-linear (leading to a non-linear shape of genetic data manifold), PCA cannot guarantee extraction of clinically and statistically relevant patterns form genetic data due to its linearity assumption [28]. Moreover, distortion of spectral components due to high dimensionality influences the outcome of PCA, therefore, other manifold learning techniques such as Locally Linear Embedding (LLE) [29] and Isometric Mapping (ISOMAP) [30] have been used to better understand the structure of gene expression data. For instance, LLE was used for microarray classification by removing redundant genes [31] and ISOMAP was used to extract information about intrinsic dimensionality of gene expression data thus revealing structural knowledge [32]. Once dimensionality is reduced, gene expression data is analyzed using clustering algorithms such as k-means and hierarchical clustering. [11] used an intelligent k-means clustering on gene expression which does not require prior information about the number of clusters and [12] proposed a genetic weighted kmeans algorithm for clustering of gene expression data. Authors in [13] have proposed an improved k-means for cancer subtyping based on gene expression data. Similarly, the average-linkage hierarchical clustering algorithm was successfully used to cluster similarity matrices of gene expression data [14]. All these methods use some form of a similarity measure are sensitive to noise and do not guarantee a good separability of the discovered subtypes in terms of Kaplan-Meier plots and hazard ratios. Recently, this approach of dimensionality reduction followed by clustering for cancer subtyping was adapted by [33] in an attempt to discover clinically and statistically significant and robust cancer subtypes having increased separability between the cancer subtypes measured in terms of separation among KaplanMeier survival plots [34]. They used the Robust and Sparse Correlation matrix estimator method (RSC) [7] to estimate the correlation matrix between samples of high dimensional data. The RSC has been shown to mitigate the effects of high-dimensionality and data contamination. The eigenvalues and eignvectors of this matrix are obtained and data is then projected onto an eigen space spanned by the eigenvectors with the largest eigenvalues. In this reduced space, clustering is done using Otrimle clustering algorithm [35, 36] which attempts to cluster data using multivariate Gaussian distributions and uses a small constant density to capture outliers and noise. Clusters thus obtained subtype patients into groups having maximum separation in terms of Kaplan-Meier survival curves, validated with log rank p-values < 0.05. However, [33] is not a good choice when the data is distributed around a complex shaped manifold in higher dimensional space because of its dependence on eigenvalues and eigenvectors of covariance matrix. This is because: a) Projection of data onto largest eigenvectors may not be ideal, because clusters may be present along eigenvectors with smaller eigenvalues and b) The underlying assumption is that the largest spectral components contain the 3

maximum information about the data. In other words, directions along large variances are considered as significant information and directions along smaller variances are considered as noise, which may not always be true. In this paper, we attempt to achieve greater separation and less overlap between survival plots whilst avoiding the issues due to high dimensionality by implementing a framework based on Topological Mapper algorithm borrowed from Topological Data Analysis [37, 38] and a non-linear-dimensionality reduction and visualization technique t-distributed Stochastic Neighbour Embedding (t-SNE) [39, 40] which does not depend on the covariance matrix of data. Mapper aims to preserve the topological and geometric details of high dimensional data and represents it in the form of a graph. Apart from being able to efficiently produce a compressed representation of high dimensional data and being coordinate independent, it is also robust to noise and small distortions in the input because of its coordinate deformation independence property [38]. Moreover, both, Mapper and the t-SNE preserve the local neighbourhood of data-points instead of the global neighbourhood which is desirable in high-dimensional space where, as discussed above, large distances do not carry much meaning (1) [20]. In this work use Barnes Hut implementation of t-SNE to define the filter function which lies at the core of the Mapper algorithm which, in turn, is used as a clustering algorithm. The rest of the paper is structured as follows: We briefly explain the t-distributed Stochastic Neighbour Embedding and Topological mapper. Then we present an outline of our methodology and propose a novel way to define the filter function on which the output of Mapper heavily depends. We also propose a novel way to quantify separation of the discovered cancer subtypes. What follows is the result, comparisons and pathway analysis for identifying biological characteristics of each subtype.

2. Preliminary Theory In this section we describe t-SNE and the Mapper Algorithm. t-SNE is a dimensionality reduction and data visualization technique and it has been used to define the filter function on which the topological mapper heavily depends. 2.1 t-Distributed Stochastic Neighbor Embedding Filter (t-SNE). t-SNE is a non-linear dimensionality reduction technique [39] which, unlike PCA, does not depend on the covariance matrix, does not assume the directions of maximum variance contains maximum information and unlike the linear dimensionality reduction methods, it is able to preserve the structure of a non-linear manifold. For the gene expression data Xn,p , with n patients and p genes, we consider the case when p >> n. t-SNE maps it to a lower dimensional representation Yn,k , such that the neighboring points in high-dimensional-space are forced to remain neighboring points in the low-dimensional space, thus preserving local structure. Let Xn,p = {x1 , x2 , x3 , ....., xn }, where each xi is p dimensional and Yn,k = {y1 , y2 , y3 , ....., yn }, where each yj is k dimensional and k << p. t-SNE learns a k dimensional representation of every point in p 4

dimensions. t-SNE defines two symmetric probabilities pij and qij where (pij = pji ) and (qij = qji ) ∀ i, j, in order to measure the similarities of samples in high and low dimensions respectively.

pij = Where exp( pi|j = P

m6=i

pi|j + pj|i 2n

(3)

−|xi −xj |2 ) 2σi2

; i 6= j

2

−xm | exp( −|xi2σ ) 2

(4)

i

Here σi (known as perplexity) controls how many nearest neighbours are taken into account when constructing the embedding in the low-dimensional space (i.e. k dimensional space). In other words it is the variance of the Gaussian centered at data point xi . Indices of data points are given by i, j and m. To measure the similarities between data points in lower (k) dimensional space, a normalized Student-t distribution distribution is used. The similarity between two points yi and yj in the lower dimensional space is given as: 1/(1 + |yi − yj |2 ) qij = P 1

(5)

m6=i 1+|yi −ym |2

pji is the probability that xi chooses xj as its neighbour and qji is the probability that yi chooses yj as its neighbour. To find the locations of the points in low-dimensional space i.e. yi and yj , the distance to be minimized between these two densities is measured using the KL Divergence between the joint distributions P and Q. Therefore, the cost function is given as: min(yi ,yj ) KL(P ||Q) =

X i6=j

pji log

pji qji

(6)

We have used Barnes-Hut t-Distributed Stochastic Neighbour Embedding (bht-SNE) [40] implementation of the above discussed t-SNE. bht-SNE is an implementation of t-SNE which reduces the complexity of t-SNE, form O(n2 ) to O(n log n) by approximating the infinitesimal probabilities by zero in the high dimensional space using vantage-point-trees and secondly, uses the Barnes-Hut algorithm to approximate the large distances in the lower dimensional embedding. This approximation is controlled by a parameter known as θ. Though a technical discussion on θ is given in [40], for the sake of continuity, here we provide only an intuitive explanation of θ. Consider a set of n samples Xc = {xc,1 , xc,2 ..., xc,j , ..xc,n } and an individual sample xs which is not in Xc . We want to compute dist{xs , xc,i , ∀ xc,i ∈ Xc }. In order to reduce the number of computations, one of the approximations used is that the set of samples Xc is approximated by a single point if the set of samples Xc is significantly farther away from xs Let v be the representative spread of the set of samples Xc and W be the distance of xs form the representative center (xc ) of Xc . If v/W << 1 then xs is significantly far away from the set of samples Xc . (Here it may be helpful to imagine the set of samples Xc as a cluster of points). Small v and large W means that 5

the spread of the cluster is sufficiently small and therefore the distance of xs from each point in the cluster can be approximated as W . i.e. dist{xs , xc,i , ∀ xc,i ∈ Xc } ≈ W Infact, if we let v/W = 0, it will corresponds to no approximation in the distance measurements and v/W = 1 corresponds to maximum approximation. Therefore, if v/W is represented by θ then the range of θ is given as 0 ≤ θ ≤ 1. (Making an approximation here means that the instead of making n computations to calculate n distances in dist{xs , xc,i , ∀ xc,i ∈ Xc }, we approximate all the n distances as W thus reducing the number of calculations). The reader is advised to refer to [40] for a detailed discussion on θ. The output of bht-SNE changes unpredictably with the degree of approximation controlled by hyper-parameter θ and this makes tuning θ difficult. To illustrate this, we used lung gene expression data consisting of 107 patient samples and 12042 genes (downloaded from TGCA portal) as input to bht-SNE algorithm. Keeping σi and output dimensionality fixed at 5 and 1 respectively, θ is changed from 0 to 1 in steps of 0.001. Corresponding mean, standard deviation and skewness of the one dimensional output of bht-SNE are measured for every value of θ. In figure 1, plots A, B and C show the variation in the mean, standard deviation and skewness of the output of bht-SNE. It is clear that the aforementioned statistical measures are quite sensitive to variation

C 0.4

B

0.2 −0.2

0.0

Skewness

20 15 10

Std. Deviation

25

2e−15 0e+0 1e−15 2e−15

A

5

−3e−15

Mean

of θ and change seems random.

0.0

0.2

0.4

Theta0.6

0.8

1.0

0.0

0.2

0.4

Theta

0.6

0.8

1.0

0.0

0.2

0.4

Theta

0.6

0.8

1.0

Figure 1: Figure1. A, B and C show the variation of mean, standard deviation and skewness with Θ for 1-dimensional embedding obtained by applying bht-SNE on lung gene expression data.

2.2. Topological Mapper. Mapper algorithm [37, 38] aims to preserve the topological and geometric details of high dimensional data and represents it in the form of a graph. The data can be a set of points represented as a point cloud in IRn . The Mapper takes this data and transforms it into a graph where the vertices are clusters of data points and two vertices are connected if they have common data points between them. An example of how a Mapper can be used to gain insight into a point cloud data is shown in figure 2 (adapted from [38]). We start with a point cloud in the shape of a hand (figure 2.A). Then we define a filter function as the vertical distance of a data point from the center of the hand. This vertical distance is visualized in the 6

C

A

D

B

Figure 2: The TDA Mapper. A. Data as a point cloud in shaped as a hand. B. The one dimensional filter function colour coded as per the distance from the center of the hand. C. Data points coloured according to the filter function and divided among Np number of consecutive overlapping intervals known as bins. D. The final graphical representation of the data. Vertices correspond to data clusters and two vertices are connected if they share one or more data points.

form of colour coding in figure 2.B. The data points in the hand are coloured according to the filter function and the hand divided into Np number of consecutive overlapping intervals known as bins (or covers) (figure 2.C). Here the bins are 6 overlapping portions of the hand. In each of these bins, hierarchical clustering is performed. This step is also known as local or partial clustering as it is performed only on a portion of data. Here it is shown that clustering gives three clusters for the bin which contains three finger tips. These are represented by three red vertices in figure 2.D. These three red vertices are not connected because they don’t share any data points between them but they are connected to the vertices below because they share data points with the vertices below, which is possible because the covers were allowed to overlap This process obtains a graph which represents the topological and geometrical structure of data (figure 2.D). The vertices are clusters of data and two vertices are connected if they share one or more data points. The resolution of the final graph depends on the parameters of the mapper i.e. the number of overlapping bins (Np ) and the percentage overlap among them (Op ). If Np = 2 then the hand would be divided into two bins where all the fingers would be clustered into one vertex and the thumb would be clustered with the palm into another single vertex, thus leading to loss of the topological structure of the hand. A badly chosen filter 7

function and the percent overlap also lead to distortion of topological structure. Topological methods are robust to small deformations and therefore are able to handle noisy data [41, 42] which is important in the context of gene expression data.

3. Proposed Methodology The proposed methodology is based on the Mapper and t-SNE algorithms. Given a gene expression data set Xn,p with n rows corresponding to patients and p columns corresponding to genes we attempt to cluster the n samples such that greater separation in terms of Kaplan-Meier survival plots and hazard ratios is achieved compared to the existing works in literature [33, 17]. The proposed methodology consists of following steps: (1) Calculation of distance matrix D: For any two rows (samples) x and y of Xn.p , let xd = [x1 − x ¯, x2 − x ¯, ..., xk − x ¯]0 and yd = [y1 − y¯, y2 − y¯, ..., yk − y¯]0 with xi and yi being the ith coordinates and x ¯ and y¯ being the respective means. Pearson’s correlation is given by rP earson (x, y) =

x0d yd |xd ||yd |

(7)

The distance measure, d(x, y) used in this work given as d(x, y) =

1 − rP earson (x, y) 2

(8)

The distance matrix is given as D = [d(x, y)] is an nxn matrix with all d(x, y) as its entries. (2) Defining the filter function f: D is given as input to bht-SNE and a lower dimensional representation, {IRdim ; dim = 1, 2, 3} is obtained by varying the hyper-parameter θ of bht-SNE. We chose the case where dim = 1 as the filter function. (3) Clustering using TDA Mapper: Clustering using the mapper algorithm is done as follows: a. After mapping Xn,p onto one dimensional space using the filter function defined in step 2, this output of bht-SNE is partitioned into Np number of overlapping bins. The number of covers (Np ) is chosen to be equal to number of clusters C (as reported by [22,31]), and the percentage overlap, Op = 10%. b. In each bin, partial hierarchical clustering is carried out and each cluster is represented by a vertex in final graph. Two vertices are connected if they share a data point. (4) Evaluating clinical and statistical significance of obtained clusters: Survival and hazard ratio analysis is carried out and the cancer subtypes are evaluated in terms of their separation calculated using minimum restricted life expectancy difference (RLEDmin ) and minimum hazard ratio (HRmin ) described below. Pathway over-representation analysis is also carried out to find potential genetic targets for treatment. 8

Patients

Genes t-SNE dimensionality reduction

B. t-SNE preserves the local neighbourhood. Thus similar points are placed nearby in the lower dimensional space.

C. The data in the filter space is divided into overlapping covers (or bins).

Clustering

A. Gene expression data represented as a point cloud. Colours are representative of different patient clusters in high dimensional space.

Binning

Graphical representation

Evaluation

F. On the clusters thus obtained, survival and pathway over-representation analyses are carried out.

E. The output of the Mapper is a graph and each vertex is treated as a cluster.

D. Hierarchical clustering is carried out in each bin. This is also known as local/partial clustering. .

Figure 3: The proposed methodology is constituted of the following steps: A. Gene expression matrix is viewed as a point cloud. Here colours represent different patient clusters in high dimensional space. B. The point cloud is fed as input to t-SNE to obtain the filter function. t-SNE reduces dimensionality of data whilst preserving the local neighbourhood of data points. Therefore, the data points with same colour are shown forming clusters. (For illustration purposes we have shown a two dimensional output of dimensionality reduction technique. In the proposed methodology it is one.) C. Next, the data in the lower dimensional space is fragmented into overlapping bins. D. Hierarchical clustering is carried out in each bin and each cluster represents a node in the final graph in E. Each node is a cluster and two nodes are connected if the underlying clusters share one or more data points. F. Evaluation is carried out with survival and pathway over-representation analyses.

Figure 3 illustrates the proposed methodology and it is explained in detailed below. Compared to other distance metrics like euclidean, we preferred Pearson correlation because its value is 9

between -1 and 1. Thus the problem of loss of contrast discussed in the introduction can be very significant if euclidean distance is used but if correlation is used, this range is compressed between −1 to 1 and so difference between two distances will not be large and thus can address the loss of contrast problem. Moreover, Pearson correlation is supported in literature to work well with gene expression data [43, 44, 45]. The next step is to obtain the filter function for which we use the dimensionality reduction method bht-SNE. In the context of this work, the dimensionality of data was reduced to one i.e. a one dimensional output of bht-SNE is used as a filter. The number of bins, Np , used to partition the data is predefined to be equal to the number of clusters in [33] Therefore, for lung we have Np = number of vertices in the graph = 4. It is 5 for breast, and 3 for colon and GBM datasets. However, Np = 2 for kidney dataset and not 3 as [33] has reported. A close examination of [33] reveals that the number of noiseless clusters for kidney dataset is only two and out of the 3 clusters reported by them, one cluster is exclusively made up of noise samples i.e. those samples which their algorithm was unable to put into a subtype. Then partial clustering using hierarchical clustering is carried out in each bin. This identifies local clusters the interaction among which is represented by connecting them. As such the proposed methodology allows a patient sample to be shared between two local clusters (vertices of the mapper output graph). For kidney, GBM, colon, lung and breast datasets, there are 6, 21, 8, 9, and 6 such samples, respectively. It is pertinent to mention that there are two ways to deal with such samples in the proposed methodology before testing the alternate hypothesis which is that there are more than one subtypes. 1.) Removing the samples shared among different subtypes would lead to a more significant p-value because the null hypothesis is that there are no underlying separate clusters in the given data (the separation of which is measured by RLEDmn and HRmin ) which is strengthened if there are common samples among different clusters. 2.) We have reported all the results for the case in which the shared samples are not removed. Though this weakens our comparisons, the proposed methodology still outperforms other comparative methodologies in the comparisons. For all the datasets except the breast cancer dataset, the mapper produced exactly Np number of vertices, however, for breast cancer data for which the number of clusters has been reported as 5 in [17, 33], the Mapper did not produce a graph with exactly 5 vertices. There were spurious vertices which contained less than three data points and which could not be considered as significant clusters on the basis of the number of data points they contained. However, for 5 values of θ, the mapper produced a graph which had at least 5 vertices containing more than three patient samples. (The number of data points in each cluster for all data sets for all methods is reported in section 4 of the supplementary material). The output graph of the Mapper is obtained and each vertex is taken as a cluster. This clustering is validated through hypothesis testing and separation is calculated in terms of minimum restricted life expectancy (RLEDmin ) proposed in [33] and a novel way to measure the separation in terms of minimum hazard ratio (HRmin ). They are explained below. 10

1. Minimum restricted life expectancy: After Kaplan-Meier survival plots have been plotted for each cluster, the cluster separation in terms of minimum restricted life expectancy difference (RLEDmin ) proposed in [33] is used to obtain a measure of separation between Kaplan-Meier survival plots. Let the Kaplan-Meier survival plot estimate for the ith cluster be represented by sˆi (t). This estimate is a function of time. Since this is the estimate of the survival as a function of time, the mean survival time (M ST ) of an individual in the ith cluster will be given by T

Z M STi =

sˆi (t)dt 0

If we assume that in each cluster the individual survives for at least T = T ∗ days, then we define restricted mean survival time (RMST) of an individual in ith cluster is given as Z T∗ RM STi = sˆi (t)dt 0

The difference between the restricted mean survival times of ith and j th survival plots is called the restricted life expectancy difference (RLED) and is given by RLEDij = |RM STi − RM STj | RLED gives a measure of the separation between the two survival plots and is equal to the difference between the areas under the ith and j th survival plots estimated using the Kaplan-Meier estimator. A large value of RLED represents a large separation between the two survival plots. Thus, this measure is clinically significant because it basically relates to the difference in time periods for which individuals belonging to two cancer subtypes (corresponding to the survival plot estimates sˆi (t) and sˆj (t)) are expected to live. If the number of survival curves is greater than two then RLED is calculated for each pair and the minimum value of RLED i.e. RLEDmin is taken as the overall cluster separation, i.e. RLEDmin = min {RLEDij ; i, j = 0, 1, ...C and i 6= j}

(9)

Here RLEDmin is the minimum restricted life expectancy difference and C is the total number of survival plots (equals the number of discovered subtypes/clusters). 2. Minimum hazard ratio: Here we present a method to measure the separation between the discovered subtypes in terms of a simple one number summary i.e. minimum hazard ratio (HRmin ). We use hazard ratio to measure the magnitude of separation between two curves in each Kaplan-Meier plot described in figure 4. A greater hazard ratio would imply greater separation between the survival curves. Hazard rate is the instantaneous rate of death of an individual in a subtype. It is defined as P r(t ≤ T < t + dt) dt→0 dt.S(t)

λ(t) = lim

(10)

Where S(t) is the survival function defined as the probability that the survival time is greater than t i.e. S(t) = P r(T > t), where T denoted the time of event. Hazard ratio λ(l1 , l2 ) is the ratio of two hazards 11

of two subtypes i.e. λ(l1 , l2 ) = λ1 (t)/λ2 (t). For more C > 2 subtypes, we define minimum hazard ratio (HRmin ) as: HRmin = min {λ(li , lj ) ; i, j = 0, 1, 2, ..., C and i 6= j}

(11)

To use the hazard ratio as a measure of separation between the proposed subtypes, it is required that the hazard ratio does not change with time. For this, the proportional hazard assumption should hold. We use Schoenfeld residual analysis to check this assumption. The assumption is tested graphically, or better, with a non-significant p-value. In the Schoenfeld tests, a non-significant p-value or a flat solid line in the residual plot would imply uncorrelatedness between disease progression of a subtype and the follow-up time of the patient. This is desirable because for dependable long term diagnosis the proposed subtypes should not vary with time.

4. Results and Discussions RLEDmin , HRmin and visual inspection of plots to check for less overlap have been used for comparisons with [17, 33]. Greater RLEDmin is desirable as it is related to the difference between life expectancies of two patients belonging to two closest subtypes. RLEDmin is required to be supported by a significant log rank p-value. For the purpose of this paper it is measured in terms of number of days. Greater HRmin is desirable as it is related to the death rates of two patients belonging to two closest subtypes. Associated is the global Schoenfeld test p-value which is required to be insignificant.Visual inspection of the Kaplan-Meier plots gives an idea of the overlap among the survival curves. 4.1. Data. We validated out method by applying it on five datasets for lung, kidney, colon, brain and breast cancers available at The Genomics Data Commons website (https://portal.gdc.cancer.gov/). The datasets correspond to LSCC (lung), KRCCC (kidney), COAD (colon), GBM (glioblastoma) and BIC (breast) cancers. Each dataset contains information about the DNA genome sequence, RNA expression and DNA methylation and survival data. Imputation and normalization of dataset was performed and all those patients having survival times greater than 5 years were not used for analysis (i.e. T ∗ = 1825 days) which reduced the number of samples to 96, 89, 92, 205 and 89 for lung, kidney, colon, glioblastoma and breast respectively. Five-year survival is a standard metric used in cancer research and treatment. The usefulness of five-year survival for measuring progress in cancer treatment has been broadly discussed in literature [46, 47, 48, 49]. Moreover, beyond five years death may occur due to reasons other than the disease [33] Also, no feature engineering was applied prior to the proposed methodology. Table 1 gives a brief summary of the datasets.

12

Table 1: A brief summary of datasets

Dataset

Samples

Dimensionality

LSCC

107

12042

KRCCC

122

17899

COAD

92

17814

GBM

215

12042

BIC

105

17814

4.2. Hyper-parameter Tuning. The proposed methodology is based on the Mapper algorithm which uses tSNE to generate the filter function. The hyper-parameters to be pre-adjusted in the proposed framework are percentage overlap (Op ), number of bins (Np ), theta (θ), and perplexity (σi ). The first two are contributed by the mapper and are explained in section 2.2 and the latter two by the t-SNE algorithm which are explained in section 2.1. The hyper-parameters are tuned on gene expression data of lung, colon, breast, kidney and glioblastoma cancers. The aim was to find the values of these parameters such that RLEDmin , HRmin were large and the overlap among the Kaplan-Meier plots was as minimum. A simple grid search could be implemented to find the values of Op , Np , θ, and σi for obtaining the largest possible values of RLEDmin and HRmin , but finding the Kaplan-Meier plots with least overlap requires manual observation as there is not metric to measure the overlap. Moreover, there is no formal method for parameter selection of the mapper method either. The proposed methodology was found to be fairly robust to changes in all the hyper-parameters except θ to which, as discussed in section 2.1, the methodology was found to be very sensitive. The values for Np , Op and perplexity were fixed as C (C is the number of clusters as proposed by [33]), 10%, and 5 respectively for all datasets. For each gene expression dataset fed into the t-SNE algorithm, θ was varied as 0 < θ < 1 in steps of 0.001. Every corresponding output of t-SNE was tested as a filter function for the Mapper. Changing θ led to change in the filter function which resulted in variation in subtypes at the output of the Mapper. For all the statistically significant subtypes (calculated as log-rank p-values ≤ 0.005), RLEDmin was recorded. Statistically significant subtypes were obtained at multiple values of θ. Figure 4 shows the variation of RLEDmin with θ for all gene expression datasets. RLEDmin for all the subtypes which were not statistically significant have been suppressed to zero. Statistically significant results were obtained at multiple values of θ and those values of θ for which RLEDmin was ≥ RLEDmin of Otrimle were checked greater values of HRmin and visually less overlap among the survival plots. Though, changing the hyper-parameters led to change in the comparison parameters (e.g. HRmin ) at the output, the comparison parameters of the proposed methodology remained significantly better than those of RSC-Otrimle and SNF 13

for (at least), the following ranges of the hyper-parameters.

Np = C & (5 < Np < 30) ; δ = 5 5 < Op < 30 ; δ = 5 1 < σi < 15 ; δ = 1 Where δ is the step size. The results were compared with state of art RSC-Otrimle algorithm [33] and Similarity Network Fusion (SNF) [17]. For RSC-Otrimle the optimal values for two parameters: number of components of the spectral decomposition m, and the eigenratio constraint parameter γ were selected as reported by the authors. SNF was implemented using the CancerSubtypes package in R programming. The number of neighbours n was varied through (number of clusters -1) to (number of samples - number of clusters + 1) and α (hyperparameter in scaled exponential similarity kernel) was varied from 0.1 to 1 with step size of 0.1. In all the cases only those outputs were retained for which the log-rank p-value was significant (≤ 0.05). The results have been reported in table 2, figure 5, 6 and 7. Referring to table 2, under the RLED Analysis heading we have reported the values of RLEDmin and the corresponding log-rank p-values. Comparisons show that among the discovered subtypes for colon, lung, breast, kidney and GBM cancers, the two closest subtypes have a separation of 204, 107, 20, 425 and 88 days , respectively, for the proposed methodology whereas it is only 43, 69, 6, 282 and 61 days for RSC-Otrimle and 95, 9, 18, 148 and 60 days for SNF. We have reported only those results for which the log-rank p-values are < 0.05 for all the datasets and for all the methodologies. The proposed methodology achieves the above mentioned separation with more significant p-values than RSC-Otrimle and SNF for all except the GBM dataset where the result of RSCOtrimle is statistically more significant with a p-value of 0.0247 whereas, the proposed methodology has a p-value of 0.031. In table 2, under the HR Analysis heading we have reported the HRmin values and the corresponding Global Schoenf eld T est (GST ) p-values. Note that we desire p-values ≤ 0.05 and > 0.05 for statistically validating RLEDmin and HRmin respectively. Again the proposed methodology achieves better (greater) HRmin values for all except the GBM dataset where SNF performs better. The significance of HRmin is that the minimum pairwise difference between the rate of death per unit time among the discovered subtypes. A graphical comparison of Kaplan-Meier plots for all the datasets is shown in figure 5, 6 and 7 for lung, colon and breast datasets respectively. The Kaplan-Meier plots for kidney and glioblastoma are provided in the supplementary document (supplementary document figures 1 and 2). A visual inspection of the plots reveals that for the proposed methodology the Kaplan-Meier plots are better separated achieving lesser overlap. A lesser overlap is clinically significant because it means that the disease progression of an individual in a group can be clearly delineated without considering the interference from other subtypes. 14

200

B

100

0

0

50

50

RLED(min)

RLED(min)

100

150

150

A

0

0.2

0.4

Theta

0.6

0.8

0

1.0

0.20

0.40

C

0.60

0.80

1.0

0.6

0.8

1.0

Theta

60

RLED(min)

40

10

0

0

20

5

RLED(min)

15

80

20

D

0

0.2

0.4

Theta

0.6

0.8

0

1.0

0.2

0.4

Theta

200 0

100

RLED(min)

300

400

E

0

0.2

0.4

Theta

0.6

0.8

1.0

Figure 4: Plots A, B, C, D and E show the variation of RLEDmin with theta for lung, colon, breast, GBM and kidney gene expression datasets. Only those values are considered for which log rank p-values for Kaplan-Meier plots are statistically significant (less than 0.05). All the other values of RLEDmin have been suppressed to zero in the above plots.

15

Table 2: Comparison with RSC-Otrimle and SNF. Under the RLED Analysis heading are the values of RLEDmin and the corresponding log-rank p-values.

Under the HR Analysis heading are the HRmin values and the corresponding

Global Schoenf eld T est (GST ) p-values. Note that we desire p-values ≤ 0.05 and > 0.05 for statistically validating RLEDmin and HRmin respectively. The corresponding values of the parameters is are given under Hyperparameters column. Clus is the number of clusters/subtypes.

Data Method RLED Analysis HR Analysis Hyper-parameters Clus ∗ RLEDmin p − val HRmin

Lung

GST

θ

m γ n α

Proposed 107.477 0.00033 1.5845

0.8017 0.715 - -

RSC-Otri 68.9026

0.0304

1.3326

0.3601 -

SNF

0.0039

1.3824

0.7998 -

0.014

2.9649

0.8903 0.072 - -

9.4762

Colon Proposed 204.1 RSC-Otri 43.1970

0.0168 1.22x10−8 0.6896 -

SNF

0.0259

95.4090

- -

4

11 10 - -

4

- - 56 0.1

4

- -

3

28 3 - -

3

1.817

0.5596 -

1.718

0.806 0.868 -

- -

5

1.263

0.7877 -

2 5

- -

5

1.328

0.9873 -

- -

GBM Proposed 88.5006 0.031

1.3156

0.3783 0.305 -

RSC-Otri 61.1765

0.0247

1.1628

0.8925 -

SNF

0.0351

1.7646

0.2637 -

Kidney Proposed 425.414 0.0019

3.1319

0.1551 0.557 - -

RSC-Otri 282.2017 0.0023

2.836

0.609 -

SNF

1.5466

0.7242 -

Breast Proposed 20.2313 0.013 RSC-Otri 6.2321 SNF

0.0449

18.45833 0.044

60.2256

148.4316 0.0327

- - 13 0.9

44 0.2

3

5

- -

3

17 inf - -

3

- -

32 0.1

3

- -

2

7 20 - -

3

- - 34 0.6

3

*Measured in days Figure 8 shows the Schoenfeld Residual plots for lung dataset obtained with the proposed methodology. The plots for all other datasets are given in supplementary document. The non-significant p-values and a flat solid line in the residual plots implies the desired uncorrelatednes between disease progression of a subtype and the follow-up time of the patient. From the residual plots and the non-significant Global Schoenfeld Test (GST) p-values for all datasets it is concluded that the disease progression of the discovered subtypes and the follow-up time are uncorrelated thus satisfying the desired requirements. The proposed framework provides statistically significant subtypes with large RLEDmin and HRmin for multiple values of θ. Aiming clinically and statistically significant outputs, we have chosen those outcomes where RLEDmin and HRmin are large, the survival curves are least overlapping and p-values calculated using log-rank test are significant (≤ 0.05) for RLED analysis and insignificant (> 0.05) for HR analysis. 16

++ ++ ++ + ++ + + + ++ + +

C

1.0

Strata: + C=1 + C=2 + C=3 + C=4

++ ++ + +++

++ + ++ ++

0.75

+

+

+

+

+

++ + + +

+

+

+

+

+ + +

0.25

+ ++ +++ +

+

+

+

++

0.5

+

+

+ +

+++ +

+

+

+

+

++ + +

0.5

+ + ++ +

0.25

p = 0.03

+

+ +++ + ++ + +++ ++ ++

0.75

++

+

0.25

+

p = 0.00033

+ + ++

Strata: + C=1 + C=2 + C=3 + C=4

+ +

++

0.75

0.5

1.0

Strata: + C=1 + C=2 + C=3 + C=4

Survival Probability

Survival Probability

B

A

++ + ++

Survival Probability

1.0

+

p = 0.004

+

+

0.0

0.0 500

0

1000

1500

Time(Days)

2000

0.0 500Time(Days)1000

0

2000

1500

500Time(Days)1000

0

1500 1750 2000

Figure 5: Plots A, B and C represent the Kaplan-Meier plots for proposed, Otrimle and SNF algorithms respectively for lung dataset. The plots represent the variation of survival probability of each subtype with time (in days). For the proposed methodology θ = 0.715.

+++ ++ + + ++

+

+ +++ +++ +

1.0

+

+++ ++ + ++ ++

+ +

0.5 +

Strata:+ C=0 + C=1 + C=2

++

+ ++++ + +

0.0 500 Time(Days)1000

+

+

+

+

+

0.5

Strata: + C=1 + C=2 + C=3

+ + ++++ ++

+

+

+

++ +

+

+

++

0.5

++

+

0.25

p = 0.017

1500

+

0.75

0.0 0

++ +++ ++

+

+

0.75

0.25

p = 0.014

1.0

++++

+

Survival Probability

Survival Probability

++ + ++ + +

+

0.25

+

++ +

0.75

C

B

Strata: + C=1 + C=2 + C=3

+

Survival Probability

A 1.0

p = 0.026 0.0

500 Time(Days)1000

0

1500

0

500 Time(Days) 1000

1500

Figure 6: Plots A, B and C represent the Kaplan-Meier plots for proposed, Otrimle and SNF algorithms respectively for colon dataset. The plots represent the variation of survival probability of each subtype with time (in days). For the proposed methodology θ = 0.072.

+

B

Strata: + C=1 + C=2 + C=3 + C=4 + C=5 + + ++ +

+++

1.0

+

+

+ + + ++

+

+ + +

Survival probability

0.5 + +

0.25

+

0.0

++

+ ++ +

+

+ ++++ + ++ + +

+

p = 0.045

1500

2000

+

+

+

+

+ +

+ ++

+ ++

+

+++

0.75

0.5

0.25

+

p = 0.044 0.0

0.0 500 Time(Days)1000

++

++ ++

+

0.5

0.25

p = 0.013

+ +

Strata:+ C=1+ C=2 + C=3 + C=4 + C=5

+++ ++ + + +++ ++ ++ ++++ + ++ ++++ + + ++++ ++++++ +++++

+ ++ +

0.75

0.75

0

C

+

1.0 +

+ +

+

+++ + ++++

+++

Survival probability

Strata: C=1 C=2 C=3 + C=4 + C=5

+ ++ + +++ + +++++++ + ++++++++ ++ ++ ++ +++ ++ ++ + ++ + +

++

Survival Probability

A

++ ++ + + +++ + ++ ++++ + + +++ ++++ + + ++ ++ + 1.0 +++++ + +++++ ++++++++++ + + + ++ + +

0

500 Time(Days)1000

1500

2000

0

500

1000

Time(Days)

1500 1750 2000

Figure 7: Plots A, B and C represent the Kaplan-Meier plots for proposed, Otrimle and SNF algorithms respectively for breast dataset. The plots represent the variation of survival probability of each subtype with time (in days). For the proposed methodology θ = 0.868.

17

Beta(t) for C4

31 63 4 SIT p: 0.7074 2 0 -2 -4 -6 31 63 10 5 0 -5

Beta(t) for C2

5

Beta(t) for C3

700 780 Time

1100

1400

1700

430

700 780 Time

1100

1400

1700

63

430

700 780 Time

1100

1400

1700

B

SIT p: 0.2195

GST p: 0.3601

0 -5 31 2.5 0.0 -2.5 -5.0

5

63

500

700

790 Time

1200

1400

1700

500

700 790 Time

1200

1400

1700

500

700 790 Time

1200

1400

1700

SIT p: 0.2384

31

Beta(t) for C4

GST p: 0.8017

430

SIT p: 0.924

31

63

SIT p: 0.9096

0 -5

Beta(t) for C2

20 15 10 5 0

Beta(t) for C3

31

10

63

C

SIT p: 0.7229

31

63

GST p: 0.7998

500

700 790 Time

1200

1400

1700

500

700 790 Time

1200

1400

1700

500

700 790 Time

1200

1400

1700

SIT p: 0.4501

5 0 31

Beta(t) for C4

A

SIT p: 0.6626

-2.5 -5.0

Beta(t) for C3

Beta(t) for C2

2.5 0.0

40 20 0

63

SIT p: 0.5867

31

63

SIT = Schoenfeld Individual Test, GST = Global Schoenfeld Test Figure 8: Plots A, B and C are the Schoenfeld residual plots of the proposed methodology, Otrimle and SNF respectively for lung dataset. The insignificant p-value and a flat solid line are desired for proportional hazard assumption to hold. Though all the methodologies produce the desired results, for the lung data, the p values for the proposed methodology are desirably more insignificant.

18

5. Pathway Analysis After obtaining the subtypes of each cancer biological characteristics of each subtype (cluster) were identified as follows. R package Limma [50] was used to find the up-regulated and down-regulated genes (DEGs) in each subtype. Then pathway over-representation analyses were performed using KEGG database and pathways related to the up-regulated and down-regulated genes in each subtype were obtained. Statistically significant pathways are over-represented in different subtypes have been discussed as they can be possible treatment targets. Figure 9, 10 and 11 show some of the most statistically significant pathways obtained from KEGG pathway over-representation analyses for the proposed methodology on lung, colon and breast datasets respectively. Due to space reasons the corresponding figures for kidney and glioblastoma are provided in the supplementary material (Supplementary document figure 7 and 8). Lung Cancer: For lung cancer 431 over-represented pathways were reported out of which 90, 40, 90 and 41 pathways were from down-regulated genes and 33, 88, 34 and 15 pathways were from up-regulated genes in subtypes 1 to 4 respectively. Figure 9 shows the most significant over-represented pathways for lung cancer. Endocytosis, which is a potential target in lung cancer treatment [51, 52, 53]is from down-regulated genes in first subtype and up-regulated in second subtype. Calcium signaling pathway from up-regulated genes in first and from down-regulated genes in third cluster is influenced by the expression of specific Ca2+ channels which are characterizing features of lung cancer. An alteration can lead to cell proliferation or apoptosis [54, 55]. Cytokine-cytokine receptor interaction comes from up-regulated genes in second cluster and down-regulated genes in the third cluster. Cytokines are proteins involved in inflammatory processes and have been helpful in risk prediction of non-small cell lung cancer among women [56]. Cell adhesion molecules (CAMs) over-represented in second and third clusters have been found to be preferentially expressed in human small cell lung cancer (SCLC) [57] The study revealed that 145-kDa protein is the isoform of neural-CAM that is expressed in SCLC cell lines. Thyroid hormone signaling pathway is altered in the fourth cluster by up-regulated genes and by down-regulated genes in the first cluster. Thyroid hormones (TH) are key hormones cellular processes involving apoptosis, and metabolism. Physiological concentrations of hormone Tetraiodothyronine (T4), synthesized in the thyroid gland is related to in-vitro proliferation of human non-small cell and small cell cancer cells [58, 59, 60]. Neuroactive ligand-receptor interaction which is associated with cancer [61] is significantly over-represented from up-regulated genes in 1st and 4th clusters and from down-regulated genes in 2nd and 3rd clusters. Sphingolipid signaling pathway from down-regulated genes in first cluster is shown to be severely altered in lung cancer [62, 63]. Colon Cancer: For colon cancer 201 over-represented pathways were discovered out of which 17, 50, 79 pathways were from down-regulated genes and 23, 11 and 21 pathways were from up-regulated genes in subtypes 1 to 3 respectively. Figure 10 shows the most significant over-represented pathways for colon cancer. For example, Oxidative Phosporylation (OXPHOS) comes from down-regulated genes in first cluster and 19

Cl1−others

Cl2−others

Cl3−others

Cl4−others

Endocytosis Lysosome Fc gamma R−mediated phagocytosis Yersinia infection Sphingolipid signaling pathway Hepatocellular carcinoma Phagosome Prostate cancer Hepatitis B Non−small cell lung cancer Thyroid hormone signaling pathway Axon guidance NF−kappa B signaling pathway Ferroptosis Cell cycle RNA transport Huntington disease Oxidative phosphorylation Fanconi anemia pathway Signaling pathways regulating pluripotency of stem cells Thermogenesis Parkinson disease Proteasome Homologous recombination Taste transduction Carbon metabolism DNA replication Alzheimer disease Tyrosine metabolism Retrograde endocannabinoid signaling Neuroactive ligand−receptor interaction Glutathione metabolism Olfactory transduction Alcoholism Cytokine−cytokine receptor interaction Hematopoietic cell lineage Complement and coagulation cascades Viral protein interaction with cytokine and cytokine receptor Inflammatory bowel disease (IBD) Type I diabetes mellitus Cell adhesion molecules (CAMs) Glutamatergic synapse Allograft rejection Rheumatoid arthritis cAMP signaling pathway Calcium signaling pathway Systemic lupus erythematosus Graft−versus−host disease Leishmaniasis Asthma Cholinergic synapse Viral myocarditis Fat digestion and absorption Bile secretion Ribosome Amino sugar and nucleotide sugar metabolism Peroxisome Mismatch repair

GeneRatio 0.025 0.050 0.075 0.100

p.adjust 0.04 0.03 0.02 0.01

down

up

down

up

down

up

down

up

Figure 9: KEGG pathways over-representation analysis for the proposed methodology on lung dataset. The gray shade of each point gives the relevance of the corresponding pathway in terms of p-values. The size of the point is related to the ratio of genes in that pathway. Since a pathway can result due to down or up regulated genes, the dots are placed in the appropriate column for each cluster.

up-regulated genes in third cluster. OXPHOS inhibitors could therefore be used as a potential treatment [64]. cAM P signaling pathway is altered in first subtype and unaltered in any other subtype. Intracellular 20

Cl1−others

Cl2−others

Cl3−others

Herpes simplex virus 1 infection Ribosome Oxidative phosphorylation Olfactory transduction Ribosome biogenesis in eukaryotes SNARE interactions in vesicular transport Peroxisome Primary bile acid biosynthesis Mitophagy − animal RNA polymerase Homologous recombination Systemic lupus erythematosus Metabolism of xenobiotics by cytochrome P450 Neuroactive ligand−receptor interaction Chemical carcinogenesis Graft−versus−host disease Cytokine−cytokine receptor interaction Alcoholism Retinol metabolism Tyrosine metabolism Fat digestion and absorption Glycolysis / Gluconeogenesis Amyotrophic lateral sclerosis (ALS) Kaposi sarcoma−associated herpesvirus infection Osteoclast differentiation Hepatitis B Human T−cell leukemia virus 1 infection TNF signaling pathway Human cytomegalovirus infection Epstein−Barr virus infection NOD−like receptor signaling pathway Apoptosis Protein processing in endoplasmic reticulum Focal adhesion PD−L1 expression and PD−1 checkpoint pathway in cancer Spliceosome Fanconi anemia pathway Arrhythmogenic right ventricular cardiomyopathy (ARVC) HIF−1 signaling pathway cAMP signaling pathway Fructose and mannose metabolism Biosynthesis of amino acids MAPK signaling pathway Non−homologous end−joining Protein digestion and absorption Mismatch repair Aminoacyl−tRNA biosynthesis

GeneRatio 0.05 0.10 0.15 0.20

p.adjust 0.03 0.02 0.01

down

up

down

up

down

up

Figure 10: KEGG pathways over-representation analysis for the proposed methodology on colon dataset. The gray shade of each point gives the relevance of the corresponding pathway in terms of p-values. The size of the point is related to the ratio of genes in that pathway. Since a pathway can result due to down or up regulated genes, the dots are placed in the appropriator column for each cluster.

levels of cAMP have been shown to be related with proliferation of colon cancer cells [65]. MAPK signaling pathway is another potential target for treatment of patients in the first subtype as comes form up-regulated 21

genes only in the first subtype. Similarly, herpes simplex virus one infection is comes form up-regulated genes only in the second subtype and TNF signaling pathway and Epstein-Barr virus pathways are from down-regulated genes only in the third subtype. Breast Cancer: For breast cancer pathway over-representation analysis resulted in 367. Out of these 24, 81, 58, 39 and 39 pathways were from down-regulated genes and 57, 24, 35, 4 and 6 pathways were from up-regulated genes in subtypes 1 to 5 respectively. Figure 11 shows the most significant over-represented pathways for breast cancer. Some of the altered pathways in first cluster are Herpes simplex virus 1 (HSV 1) (from up-regulated genes), thermogenesis (from down-regulated genes in clusters 1,4 and 5 and from up-regulated genes in cluster 2), cytokine-cytokine receptor interaction (from up-regulated genes in cluster 1 and from from down-regulated genes in clusters 2 and 3), systematic lupus erythematosus, endocytosis and circadian rhythm are unique only to the third cluster (from up-regulated genes). In the fourth cluster there are four over-represented pathways which have a gene ration of 0.5. They are protein digestion and absorption, ECM-receptor interaction, apoptosis and p53 signaling pathways (all from up-regulated genes). Shigellosis and TGF-beta signaling pathways are over-represented from up-regulated genes only in the fifth cluster. These pathways can be potential targets for treatment for the subtypes in which they are overrepresented. A discussion relating the discovered subtypes with genetic pathways in breast cancer which have been established in literature has been given below. HSV 1 can be orally or sexually transmitted and breast cancer cells are found to be sensitive to HSV1 and has been detected in human breast cancer tissues [66, 67]. Cytokine-cytokine receptor interaction is identified as a significant pathway underlying breast cancer cell line MCF-7 while treated with 17-estradiol (E2) [68] and genes in cytokine-cytokine receptor interaction are found to be significantly enriched in basal like breast cancer. [69]. Circadian rhythm is a 24-hour cycle in the biochemical, physiological, or behavioral processes of all living things. It is related to regulating the expression of genes in the breast and is well known to promote cancer if the rhythm is altered [70, 71]. In fact, apart from breast cancer, circadian rhythm is related to cancer of various types like colorectal cancer, osteosarcoma, pancreatic adenocarcinoma [72]. Protein digestion and absorption has been found to be related to over-expressed IGFBP5 which is a prominent regulatory protein in breast cancer progression [73]. The high gene ratio (0.5) of the ECM receptor interaction pathway is consistent with findings in [74], wherein all the genes in this pathway are found to be altered in breast cancer. p53 signaling pathway is often altered in cancer. It is believed that p53 prevents tumor development by the triggering of apoptosis [75, 76] and any alteration of p53 activity can affect apoptosis which can in turn lead to tumor growth. One of the pathways over-expressed only in fifth subtype and so can be a potential treatment target for patients of this subtype is the TGS-beta signaling pathway. It is involved in process like cell growth, cell differentiation, apoptosis and cellular homeostasis and has been related to human breast cancer [77]. TGF-beta is known to act as a tumor suppressor at early stages of cancer, however the role is changed to a promoter as the cancer progresses [78, 79, 80]. 22

Cl1−others

Cl2−others

Cl3−others

Cl4−others

Cl5−others

Herpes simplex virus 1 infection Thermogenesis Oxidative phosphorylation Huntington disease Alcoholism Autophagy − animal Parkinson disease Retrograde endocannabinoid signaling Synaptic vesicle cycle Insulin signaling pathway Fanconi anemia pathway Nucleotide excision repair Glycosylphosphatidylinositol (GPI)−anchor biosynthesis Longevity regulating pathway Cytokine−cytokine receptor interaction Viral protein interaction with cytokine and cytokine receptor Viral myocarditis Malaria Graft−versus−host disease Osteoclast differentiation Cell adhesion molecules (CAMs) Staphylococcus aureus infection Natural killer cell mediated cytotoxicity Allograft rejection Focal adhesion Inflammatory bowel disease (IBD) TNF signaling pathway Amoebiasis Type I diabetes mellitus Protein digestion and absorption Rheumatoid arthritis NF−kappa B signaling pathway Antigen processing and presentation Complement and coagulation cascades ECM−receptor interaction Human T−cell leukemia virus 1 infection RNA transport DNA replication Cell cycle Olfactory transduction Protein processing in endoplasmic reticulum Aminoacyl−tRNA biosynthesis RNA polymerase Ribosome biogenesis in eukaryotes Mismatch repair Oocyte meiosis Endocrine resistance Pyrimidine metabolism HIF−1 signaling pathway MicroRNAs in cancer mTOR signaling pathway Signaling pathways regulating pluripotency of stem cells Apoptosis Alanine, aspartate and glutamate metabolism p53 signaling pathway Systemic lupus erythematosus Endocytosis Circadian rhythm Shigellosis TGF−beta signaling pathway

p.adjust 0.04 0.03 0.02 0.01

GeneRatio 0.1 0.2 0.3 0.4 0.5

down

up

down

up

down

up

down

up

down

up

Figure 11: KEGG pathways over-representation analysis for the proposed methodology on breast dataset. The gray shade of each point gives the relevance of the pathway in terms of p-values. The size of the point gives the ratio of genes in that pathway. Since a pathway can result due to down or up regulated genes, the dots are placed in the appropriate column for each cluster.

23

6. Conclusion We propose a novel methodology for achieving clinically and statistically significant patient subtypes from gene expression data. It depends on the mapper algorithm - a tool from mathematical topology which analysis data by studying the connected components and the voids in the shape of data. The mapper is combined with the manifold learning technique, Barne’s Hut implementation of t-SNE which has been used to define the filter function on which the mapper algorithm heavily depends. The methodology has been validated through comparisons with SNF and RSC-Otrimle methodologies on five standard datasets from TCGA database. The comparisons were done by evaluating cluster separation with minimum restricted life expectancy difference (RLEDmin ) and minimum hazard ratio (HRmin ). It was found that the proposed methodology performs better in terms of RLEDmin in all datasets and better in terms of HRmin in four of the datasets. Visual inspection of the Kaplan-Meier plots also suggests that the overlap is lesser in case of proposed methodology. Pathway over-representation analysis was carried out to identify biological mechanisms characterizing each subtype. The significance of this work lies in individualized understanding of cancer from patient to patient which is the backbone of Precision Medicine. Acknowledgement: The authors thank National Institute of Technology Srinagar for supporting this work. References [1] L. Hood, S. H. Friend, Predictive, personalized, preventive, participatory (p4) cancer medicine, Nature reviews Clinical oncology 8 (3) (2011) 184. [2] S. Saria, A. Goldenberg, Subtyping: What it is and its role in precision medicine, IEEE Intelligent Systems 30 (4) (2015) 70–75. [3] O.-P. Kallioniemi, U. Wagner, J. Kononen, G. Sauter, Tissue microarray technology for high-throughput molecular profiling of cancer, Human molecular genetics 10 (7) (2001) 657–662. [4] A. Ben-Dor, N. Friedman, Z. Yakhini, Class discovery in gene expression data, in: Proceedings of the fifth annual international conference on Computational biology - RECOMB 01, ACMPress, 2001, ACM Press, 2001. [5] C. M. Perou, T. Sørlie, M. B. Eisen, M. Van De Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen, et al., Molecular portraits of human breast tumours, nature 406 (6797) (2000) 747. [6] P. D’haeseleer, How does gene expression clustering work?, Nature biotechnology 23 (12) (2005) 1499. [7] A. Serra, P. Coretto, M. Fratello, R. Tagliaferri, Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data, Bioinformatics 34 (4) (2017) 625–634. [8] P. Chen, Y. Fan, T. kwong Man, Y. Hung, C. C. Lau, S. T. Wong, A gene signature based method for identifying subtypes and subtype-specific drivers in cancer with an application to medulloblastoma, BMC Bioinformatics 14 (Suppl 18) (2013) S1. [9] C. Yu, M. Arcos-Burgos, J. Licinio, M.-L. Wong, A latent genetic subtype of major depression identified by whole-exome genotyping data in a mexican-american cohort, Translational Psychiatry 7 (5) (2017) e1134. [10] J.-P. Brunet, P. Tamayo, T. R. Golub, J. P. Mesirov, Metagenes and molecular pattern discovery using matrix factorization, Proceedings of the National Academy of Sciences 101 (12) (2004) 4164–4169. [11] T. Handhayani, L. Hiryanto, Intelligent kernel k-means for clustering gene expression, Procedia Computer Science 59 (2015) 171–177.

24

[12] F.-X. Wu, Genetic weighted k-means algorithm for clustering large-scale gene expression data, BMC Bioinformatics 9 (S6). [13] N. Nidheesh, K. A. Nazeer, P. Ameer, An enhanced deterministic k-means clustering algorithm for cancer subtype prediction from gene expression data, Computers in Biology and Medicine 91 (2017) 213–221. [14] M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proceedings of the National Academy of Sciences 95 (25) (1998) 14863–14868. [15] L. Jiang, Y. Xiao, Y. Ding, J. Tang, F. Guo, Discovering cancer subtypes via an accurate fusion strategy on multiple profile data, Frontiers in Genetics 10. [16] A. Serra, M. Fratello, V. Fortino, G. Raiconi, R. Tagliaferri, D. Greco, MVDA: a multi-view genomic data integration methodology, BMC Bioinformatics 16 (1). [17] B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains, A. Goldenberg, Similarity network fusion for aggregating data types on a genomic scale, Nature Methods 11 (3) (2014) 333–337. [18] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson, J. R. Marks, J. R. Nevins, Predicting the clinical status of human breast cancer by using gene expression profiles, Proceedings of the National Academy of Sciences 98 (20) (2001) 11462–11467. [19] S. Dudoit, J. Fridlyand, T. P. Speed, Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American statistical association 97 (457) (2002) 77–87. [20] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is nearest neighbor meaningful?, in: International conference on database theory, Springer, 1999, pp. 217–235. [21] C. C. Aggarwal, A. Hinneburg, D. A. Keim, On the surprising behavior of distance metrics in high dimensional space, in: International conference on database theory, Springer, 2001, pp. 420–434. [22] S. L. France, J. D. Carroll, H. Xiong, Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization, Information Sciences 184 (1) (2012) 92–110. [23] A. Serra, P. Coretto, R. Tagliaferri, On the noisy high-dimensional gene expression data analysis, in: Proceedings of the Conference of the Italian Statistical Society, Firenze University Press, 2017. [24] I. Jolliffe, Principal component analysis, Technometrics 45 (3) (2003) 276. [25] M. Ringn´ er, What is principal component analysis?, Nature biotechnology 26 (3) (2008) 303. [26] R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (gpca), IEEE transactions on pattern analysis and machine intelligence 27 (12) (2005) 1945–1959. [27] H. Todorov, D. Fournier, S. Gerber, Principal components analysis: theory and application to gene expression data analysis, Genomics and Computational Biology 4 (2) (2018) e100041–e100041. [28] J. Shi, Z. Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples, Computers in biology and medicine 40 (8) (2010) 723–732. [29] S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science 290 (5500) (2000) 2323–2326. [30] J. B. Tenenbaum, V. De Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, science 290 (5500) (2000) 2319–2323. [31] J. Xu, H. Mu, Y. Wang, F. Huang, Feature genes selection using supervised locally linear embedding and correlation coefficient for microarray classification, Computational and Mathematical Methods in Medicine 2018 (5490513) (2018) 11. [32] S. Weng, C. Zhang, Z. Lin, X. Zhang, Mining the structural knowledge of high-dimensional medical data using isomap, Medical & Biological Engineering & Computing 43 (3) (2005) 410–412. [33] P. Coretto, A. Serra, R. Tagliaferri, Robust clustering of noisy high-dimensional gene expression data for patients subtyping, Bioinformatics. [34] E. L. Kaplan, P. Meier, Nonparametric estimation from incomplete observations, Journal of the American Statistical

25

Association 53 (282) (1958) 457–481. [35] P. Coretto, C. Hennig, Robust improper maximum likelihood: Tuning, computation, and a comparison with other methods for robust gaussian clustering, Journal of the American Statistical Association 111 (516) (2016) 1648–1659. [36] P. Coretto, C. Hennig, Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering, Journal of Machine Learning Research 18 (142) (2017) 1–39. [37] G. Singh, F. Memoli, G. Carlsson, Topological methods for the analysis of high dimensional data sets and 3d object recognition (2007). [38] P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, G. Carlsson, Extracting insights from the shape of complex data using topology, Scientific Reports 3 (1). [39] L. van der Maaten, G. Hinton, Visualizingdatausingt-sne, Journal of Machine Learning Research. [40] L. Van Der Maaten, Barnes-hut-sne, arXiv preprint arXiv:1301.3342. [41] M. Nicolau, A. J. Levine, G. Carlsson, Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival, Proceedings of the National Academy of Sciences 108 (17) (2011) 7265–7270. [42] M. Saggar, O. Sporns, J. Gonzalez-Castillo, P. A. Bandettini, G. Carlsson, G. Glover, A. L. Reiss, Towards a new approach to reveal dynamical organization of the brain using topological data analysis, Nature communications 9 (1) (2018) 1399. [43] J. Yang, A. I. Su, W.-H. Li, Gene expression evolves faster in narrowly than in broadly expressed mammalian genes, Molecular biology and evolution 22 (10) (2005) 2113–2118. [44] K. D. Makova, W.-H. Li, Divergence in the spatial pattern of gene expression between human duplicate genes, Genome research 13 (7) (2003) 1638–1645. [45] L. Huminiecki, K. H. Wolfe, Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse, Genome research 14 (10a) (2004) 1870–1879. [46] R. Sankaranarayanan, R. Swaminathan, H. Brenner, K. Chen, K. S. Chia, J. G. Chen, S. C. Law, Y.-O. Ahn, Y. B. Xiang, B. B. Yeole, H. R. Shin, V. Shanta, Z. H. Woo, N. Martin, Y. Sumitsawan, H. Sriplung, A. O. Barboza, S. Eser, B. M. Nene, K. Suwanrungruang, P. Jayalekshmi, R. Dikshit, H. Wabinga, D. B. Esteban, A. Laudico, Y. Bhurgri, E. Bah, N. Al-Hamdan, Cancer survival in africa, asia, and central america: a population-based study, The Lancet Oncology 11 (2) (2010) 165–173. [47] H. G. Welch, Are increasing 5-year survival rates evidence of success against cancer?, JAMA 283 (22) (2000) 2975. [48] F. M. Yosef E. M, Min Tang, On the validity of using increases in 5-year survival rates to measure success in the fight against cancer, 2014doi:10.1371/journal.pone.0083100.g001. [49] H. E. Karim-Kos, L. A. Kiemeney, M. W. Louwman, J. W. W. Coebergh, E. de Vries, Progress against cancer in the netherlands since the late 1980s: An epidemiological evaluation, International Journal of Cancer 130 (12) (2011) 2981–2989. [50] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, G. K. Smyth, limma powers differential expression analyses for RNA-sequencing and microarray studies, Nucleic Acids Research 43 (7) (2015) e47. doi:10.1093/nar/gkv007. [51] T. Tanaka, T. Ozawa, E. Oga, A. Maraguchi, H. Sakurai, Cisplatininduced noncanonical endocytosis of egfr via p38 phosphorylation of the cterminal region containing ser1015 in nonsmall cell lung cancer cells, Oncology Letters (2018) 9251–9256. [52] U. Jo, K. H. Park, Y. M. Whang, J. S. Sung, N. H. Won, J. K. Park, Y. H. Kim, Egfr endocytosis is a novel therapeutic target in lung cancer with wild-type egfr, Oncotarget 5 (5) (2014) 1265–1278. [53] I. Mellman, Y. Yarden, Endocytosis and cancer. [54] H. Yang, Q. Zhang, J. He, W. Lu, Regulation of calcium signaling in lung cancer, Journal of Thoracic Disease 2 (1) (2010) 52–56. [55] T. A. Stewart, K. T. Yapa, G. R. Monteith, Altered calcium signaling in cancer cells, Biochimica et Biophysica Acta (BBA) - Biomembranes 1848 (10) (2015) 2502–2511.

26

[56] A. L. V. Dyke, M. L. Cote, A. S. Wenzlaff, W. Chen, J. Abrams, S. Land, C. N. Giroux, A. G. Schwartz, Cytokine and cytokine receptor single-nucleotide polymorphisms predict risk for non-small cell lung cancer among women, Cancer Epidemiology Biomarkers & Prevention 18 (6) (2009) 1829–1840. [57] S. Saito, Y. Tanio, I. Tachibana, S. Hayashi, T. Kishimoto, I. Kawase, Complementary dna sequence encoding the major neural cell adhesion molecule isoform in a human small cell lung cancer cell line, Lung Cancer 10 (5-6) (1994) 307–318. [58] R. Meng, H.-Y. Tang, J. Westfall, D. London, J. H. Cao, S. A. Mousa, M. Luidens, A. Hercbergs, F. B. Davis, P. J. Davis, H.-Y. Lin, Crosstalk between integrin v3 and estrogen receptor -a is involved in thyroid hormone-induced proliferation in human lung carcinoma cells, PLoS ONE 6 (11) (2011) e27547. [59] A. Hercbergs, P. J. Davis, H.-Y. Lin, K. A. Keating, S. A. Mousa, Thyroid hormone replacement therapy in patients with various types of cancer, intechopen 6. [60] E. Krashin, A. Piekieko-Witkowska, M. Ellis, O. Ashur-Fabian, Thyroid hormones and cancer: A comprehensive review of preclinical and clinical studies, Frontiers in Endocrinology 10. [61] W. L. X. L. Q. X. F. L. P. X. Huan, J., L. Zhu, Insights into significant pathways and gene interaction networks underlying breast cancer cell line mcf-7 treated with 17-estradiol (e2)., Gene, 533(1) (2014) 346355. [62] Y. Chen, Z. Ma, L. Min, H. Li, B. Wang, J. Zhong, L. Dai, Biomarker identification and pathway analysis by serum metabolomics of lung cancer, BioMed Research International 2015 (2015) 1–9. [63] E. Bieberich, G. Wang, Sphingolipid in lung cancer pathogenesis and therapy, in: A Global Scientific Vision - Prevention, Diagnosis, and Treatment of Lung Cancer, InTech, 2017. [64] T. M. Ashton, W. G. McKenna, L. A. Kunz-Schughart, G. S. Higgins, Oxidative phosphorylation as an emerging target in cancer therapy, Clinical Cancer Research 24 (11) (2018) 2482–2490. [65] T. Tsukahara, Y. Matsuda, H. Haniu, Cyclic phosphatidic acid stimulates cAMP production and inhibits growth in human colon cancer cells, PLoS ONE 8 (11) (2013) e81139. [66] J.-H. Tsai, C.-H. Tsai, M.-H. Cheng, S.-J. Lin, F.-L. Xu, C.-C. Yang, Association of viral factors with non-familial breast cancer in taiwan by comparison with non-cancerous, fibroadenoma, and thyroid tumor tissues, Journal of Medical Virology 75 (2) (2004) 276–281. [67] C.-R. Hsu, T.-M. Lu, L. W. Chin, C.-C. Yang, Possible DNA viral factors of human breast cancer, Cancers 2 (2) (2010) 498–512. [68] L. X. X. Q. L. F. X. P. L. Z. Jinliang Huan, Lishan Wang, Insights into significant pathways and gene interaction networks underlying breast cancer cell line mcf-7 treated with 17 -estradiol (e2), Gene 533 (1) (2014) 346–355. [69] K. Yang, J. Gao, M. Luo, Identification of key pathways and hub genes in basal-like breast cancer using bioinformatics analysis, OncoTargets and Therapy Volume 12 (2019) 1319–1331. [70] H.-H. Lin, M. E. Farkas, Altered circadian rhythms and breast cancer: From the human to the molecular level, Frontiers in Endocrinology 9. [71] V. Blakeman, J. L. Williams, Q.-J. Meng, C. H. Streuli, Circadian clocks and breast cancer, Breast Cancer Research 18 (1). [72] E. Filipski, Host circadian clock as a control point in tumor progression, CancerSpectrum Knowledge Environment 94 (9) (2002) 690–697. ˙ Peker, T. Ozmen, ¨ ¨ [73] M. Akkiprik, I. G. Amuran, B. G¨ ull¨ uo˘ glu, H. Kaya, A. Ozer, Identification of differentially expressed IGFBP5-related genes in breast cancer tumor tissues using cDNA microarray experiments, Genes 6 (4) (2015) 1201–1214. [74] Y. Bao, L. Wang, L. Shi, F. Yun, X. Liu, Y. Chen, C. Chen, Y. Ren, Y. Jia, Transcriptome profiling revealed multiple genes and ECM-receptor interaction pathways that may be associated with breast cancer, Cellular & Molecular Biology Letters 24 (1). [75] S. Haupt, Apoptosis - the p53 network, Journal of Cell Science 116 (20) (2003) 4077–4085.

27

[76] J. Yu, L. Zhang, The transcriptional targets of p53 in apoptosis control, Biochemical and Biophysical Research Communications 331 (3) (2005) 851–858. [77] J. Donovan, J. Slingerland, Transforming growth factor-beta and breast cancer: Cell cycle arrest by transforming growth factor beta and its disruption in cancer, Breast Cancer Research 2 (2). [78] Y. Drabsch, S. He, L. Zhang, B. E. Snaar-Jagalska, P. ten Dijke, Transforming growth factor-beta signalling controls human breast cancer metastasis in a zebrafish xenograft model, Breast Cancer Research 15 (6). [79] D. Padua, J. Massagu´ e, Roles of tgf-beta in metastasis, Cell Research 19 (1) (2008) 89–102. [80] M. Sakaki-Yumoto, Y. Katsuno, R. Derynck, Tgf-beta family signaling in stem cells, Biochimica et Biophysica Acta (BBA) - General Subjects 1830 (2) (2013) 2280–2296.

28

• • • •

Novel methodology proposed for subtyping cancer patients from gene expression data. Aims to achieve maximum separation among identified subtypes for targeted treatment. The methodology relies on t-SNE and topological Mapper. Proposed a filter function for mapper and a way to measure separation among subtypes.