Combining multiple clusterings using fast simulated annealing

Combining multiple clusterings using fast simulated annealing

Pattern Recognition Letters 32 (2011) 1956–1961 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www...

238KB Sizes 1 Downloads 35 Views

Pattern Recognition Letters 32 (2011) 1956–1961

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Combining multiple clusterings using fast simulated annealing Zhiwu Lu a, Yuxin Peng a,⇑, Horace H.S. Ip b a b

Institute of Computer Science and Technology, Peking University, Beijing 100871, China Department of Computer Science, City University of Hong Kong, Hong Kong

a r t i c l e

i n f o

Article history: Received 17 September 2009 Available online 19 September 2011 Communicated by F. Roli Keywords: Clustering ensemble Comparing clusterings Simulated annealing

a b s t r a c t This paper presents a fast simulated annealing framework for combining multiple clusterings based on agreement measures between partitions, which are originally used to evaluate a clustering algorithm. Although we can follow a greedy strategy to optimize these measures as the objective functions of clustering ensemble, it may suffer from local convergence and simultaneously incur too large computational cost. To avoid local optima, we consider a simulated annealing optimization scheme that operates through single label changes. Moreover, for the measures between partitions based on the relationship (joined or separated) of pairs of objects, we can update them incrementally for each label change, which ensures that our optimization scheme is computationally feasible. The experimental evaluations demonstrate that the proposed framework can achieve promising results. Ó 2011 Elsevier B.V. All rights reserved.

1. Introduction Comparing clusterings plays an important role in the evaluation of clustering algorithms. A number of criteria have been proposed to measure how close the obtained clustering is to a ground truth clustering, such as mutual information (MI) (Strehl and Ghosh, 2002), Rand index (Hubert and Arabie, 1985), Jaccard index (Denoeud and Guénoche, 2006), and Wallace index (Wallace, 1983). An important application of these measures is to make objective evaluation of the image segmentation algorithms (Lu et al., 2008; Unnikrishnan et al., 2007), because image segmentation can be considered as a pixel clustering problem. Since the major difficulty of clustering combination lies in finding a consensus partition from the ensemble of partitions, the measures for comparing clusterings can be adopted as the objective functions of clustering ensemble. Here, the key difference is that the consensus partition has to be compared to multiple partitions. Such consensus functions have been developed in (Strehl and Ghosh, 2002) based on MI. Though a greedy strategy can be used to maximize normalized MI via single label change, it incurs too large computational cost. Hence, in this work, we focus on the measures defined based on the relationship (joined or separated) of pairs of objects, which can be updated incrementally for each single label change. Examples of these measures are Rand index, Jaccard index, and Wallace index. Moreover, in order to tackle

⇑ Corresponding author. Tel.: +86 10 82529699; fax: +86 10 82529207. E-mail address: [email protected] (Y. Peng). 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.09.022

the local convergence problem, we propose a simulated annealing optimization scheme which is computationally feasible due to the incremental update of the objective function. Concretely, the proposed simulated annealing framework for clustering ensemble has three main advantages compared with existing approaches: (1) it develops a series of consensus functions for clustering ensemble, not just one; (2) it avoids the local optima problem; (3) it boosts an overall low computational complexity due to our consensus functions that is O(nkr) for n objects, k clusters in the target partition, and r clusterings in the ensemble. Hence, the proposed framework is readily applicable to large data sets, as opposed to other consensus functions that are based on the co-association of objects in clusters from an ensemble which typically has quadratic complexity O(n2kr). We have tested the proposed framework on the US Postal Service (USPS) handwritten digit data set (Hull, 1994) (which contains thousands of samples) in our experiment. Moreover, unlike those algorithms that search for a consensus partition via re-labeling and subsequent voting, the proposed framework can operate with arbitrary partitions with varying numbers of clusters, and not constrained to a predetermined number of clusters in the ensemble partitions. These contributions provide a scalable solution to the clustering ensemble problem that can be applied in practice to large data sets. The rest of this paper is organized as follows. Section 2 describes closely related work on clustering ensemble. In Section 3, we briefly introduce measures for comparing clusterings and especially describe three of them in detail. Section 4 then presents a fast simulated annealing framework for combining multiple clusterings based on the three measures. The experimental results on several

1957

Z. Lu et al. / Pattern Recognition Letters 32 (2011) 1956–1961

data sets are shown in Section 5, followed by the conclusions drawn from these results in Section 6.

2. Related work Approaches to combining multiple clusterings differ in two main respects, namely the way in which the contributing component clusterings are obtained and the method by which they are combined. An important consensus function is proposed in (Fred and Jain, 2005) to summarize various clustering results in a coassociation matrix. Co-association values represent the strength of association between objects by analyzing how often each pair of objects appears in the same cluster. Then the co-association matrix serves as a similarity matrix for the data items. The final clustering is formed from the co-association matrix by linking the objects whose co-association value exceeds a certain threshold. One major drawback of the co-association consensus function is its quadratic computational complexity in the number of objects O(n2). Moreover, experiments in (Topchy et al., 2005) show that co-association methods are usually unreliable with the number of clusterings r < 50. Hypergraph-based consensus functions have also been developed (Strehl and Ghosh, 2002). All the clusters in the ensemble partitions can be represented as hyperedges on a graph with n vertices. Each hyperedge describes a set of objects belonging to the same cluster. A consensus function can be formulated as a solution to k-way min-cut hypergraph partitioning problem. In particular, an efficient solution to this problem is the meta-clustering algorithm (MCLA), which also uses hyperedge collapsing operations to determine soft cluster membership values for each object. Here, it should be noted that hypergraph methods seem to work best for nearly balanced clusters. A different consensus function has been developed in (Topchy et al., 2003) based on information-theoretic principles. An elegant solution can be obtained from a generalized definition of Mutual Information, namely Quadratic Mutual Information (QMI), which can be effectively maximized by the k-means algorithm in the space of specially transformed cluster labels of the given ensemble. However, its limitation is the sensitivity to initialization due to the local optimization scheme of k-means. In (Dudoit and Fridlyand, 2003; Fischer and Buhmann, 2003), a combination of multiple partitions by re-labeling and voting is implemented. Their works pursue direct re-labeling approaches to the correspondence problem. A re-labeling can be done optimally between two clusterings using the Hungarian algorithm (Kuhn, 1955). After an overall consistent re-labeling, voting can be used to determine cluster membership for each object. Moreover, voting is also used for clustering ensemble in (Frossyniotis et al., 2004; Tumer and Agogino, 2008). However, as compared with other clustering ensemble methods, these voting approaches require a relatively larger number of clusterings (i.e. r  1) for them to obtain reliable results. A probabilistic model of consensus is presented in (Topchy et al., 2005) using a finite mixture of multinomial distributions in the space of cluster labels. A combined partition is found as a solution to the corresponding maximum likelihood problem using the EM algorithm. Since the EM consensus function needs to estimate many parameters, accuracy degradation will inevitably occur with increasing number of partitions when the sample size is fixed. To summarize, the existing consensus functions suffer from a number of drawbacks that include computational complexity, heuristic character of objective function, and uncertain statistical status of the consensus solution.

3. Measures for comparing clusterings This section first presents the basic notations for comparing two clusterings, and then introduces three measures of agreement between partitions which will be used for combining multiple clusterings in the rest of the paper. 3.1. Notations and problem statement Let ka and kb be two clusterings of the sample data set X ¼ fxt gnt¼1 , which contain ka and kb clusters respectively. To compare these two clusterings, we have to first give a quantitative measure of agreement between them. In the case of evaluating a clustering algorithm, it means that we have to show how close the obtained clustering is to a ground truth clustering. Since the measures will subsequently be used as the objective functions of clustering ensemble, it is very important that we can formulate them in a form that allows the incremental update for a single label change. That is, we aim to achieve the highly efficient update of the objective function. Here, we focus on measures which have the following general form:

  kb a Sðka ; kb Þ ¼ f fnai gki¼1 ; fnbj gj¼1 ; fnij gij ;

ð1Þ

where nai is the number of objects in cluster Ci for ka ; nbj is the number of objects in cluster Cj for kb, and nij denotes the number of objects that are in cluster Ci for ka as well as in cluster Cj for kb. When an object (which is assumed to be in cluster Cj according to kb) moves from cluster Ci to cluster C i0 according to ka, only the following updates arise for this single label change:

^ ai ¼ nai  1; n ^ ij ¼ nij  1; n

^ai0 ¼ nai0 þ 1; n ^i0 j ¼ ni0 j þ 1; n

ð2Þ ð3Þ

^ ai and n ^ ij are the new version of nai and nij, respectively. where n According to (1), it can be seen that the measure S(ka, kb) could then be updated incrementally. Although many measures for comparing clusterings can be represented in a form shown in (1), in the paper, we will focus on one special type of measures defined based on the relationship (joined or separated) of pairs of objects, which are called pairwise measures in the following. 3.2. Pairwise measures The comparison of partitions for pairwise measures is based on the pairs of objects of X. Two partitions ka and kb agree on a pair of objects x1 and x2 if these objects are simultaneously joined or separated in them. On the other hand, there is a disagreement if x1 and x2 are joined in one of them and separated in the other. Let nA be the number of pairs simultaneously joined together, nB the number of pairs joined in ka and separated in kb, nC the number of pairs separated in ka and joined in kb, and nD the number of pairs simultaneously separated. According to (Hubert and Arabie, 1985), we       P P nai P nb nij j ; nB ¼ i  nA , and nC ¼ j have nA ¼ i;j  nA . 2 2 2   n  nA  nB  nC . Moreover, we can obtain nD ¼ 2 Although there exist many pairwise measures, we only consider Rand index (Hubert and Arabie, 1985), Jaccard index (Denoeud and Guénoche, 2006), and Wallace index (Wallace, 1983). They can be respectively defined as follows: a b

nA  h h

  n

Rðka ; kb Þ ¼ a 1 ðh 2

b

a

þh Þh h

2   ; n b 2

ð4Þ

1958

Z. Lu et al. / Pattern Recognition Letters 32 (2011) 1956–1961

nA nA ; Jðka ; kb Þ ¼   ¼ a b n  nD ðh þ h  nA Þ 2 Wðka ; kb Þ ¼ nA

ð5Þ

qffiffiffiffiffiffiffiffiffiffi a b h h ;

ð6Þ

   P nai P nb a b j where h ¼ i . The above three measures and h ¼ j 2 2 can all be considered as some kind of normalization of nA. 

functions that have the general form as (1). Hence, our solution can be considered as a general framework for clustering ensemble. Given a set of r partitions K = {kqjq = 1, . . . , r}, the objective function of clustering ensemble can be set as the measure between a single partition k and K in (8). The measure S(k, kq) between k and kq can be Rand index, Jaccard index, or Wallace index. According to (4)–(6), we can set S(k, kq) as any one of the following three measures: q

4. The proposed framework

4.1. The clustering ensemble problem

N

n

ð7Þ

which maps a set of clusterings to an integrated clustering. If there is no prior information about the relative importance of the individual groupings, then a reasonable goal for the consensus answer is to seek a clustering that shares the most information with the original clusterings. More precisely, based on the measure of agreement (i.e. shared information) between partitions, we can now define a measure between a set of r partitions K and a single partition k as the average shared information: r 1X Sðk; KÞ ¼ Sðk; kq Þ: r q¼1

ð8Þ

1 r

q



q

q h1 h2

2   ; n

Sðk; kq Þ:

ð9Þ

ð10Þ

2

q

ð11Þ

qffiffiffiffiffiffiffiffiffiffi q q h1 h2 ; Sðk; kq Þ ¼ h0

ð12Þ

     P ni P nq nqij q j ; h1 ¼ i . Here, the , and h2 ¼ j 2 2 2 frequency counts are denoted a little differently from (1): ni is the number of objects in cluster Ci according to k; nqj is the number of P



i;j

objects in cluster Cj according to kq, and nqij is the number of objects that are in cluster Ci according to k and in cluster Cj according to kq. The corresponding algorithms based on these three measures which follow the simulated annealing optimization scheme are denoted as SA-RI, SA-JI, and SA-WI, respectively. To find the consensus partition from the multiple clusterings K, we can maximize the objective function S(k, K) by making single label change. That is, we randomly select an object xt from the data set X ¼ fxt gnt¼1 , and then change its label k(xt) = i to another randomly selected label i0 – i according to k, i.e., move it from the current cluster Ci to another cluster C i0 . Such single label change only leads to the following updates:

^ i ¼ ni  1; n ^ i0 ¼ ni0 þ 1; n q q ^ ^ q0 ¼ nq0 þ 1; nij ¼ nij  1; n ij ij

ð13Þ ð14Þ ^ qij ) n

q

Hence, the problem of clustering ensemble is to find a consensus partition k⁄ of the data set X that maximizes the objective function S(k,K) from group of partitions K:

k ¼ arg maxk

þ

q h2 Þ

Sðk; kq Þ ¼ h0 =ðh1 þ h2  h0 Þ;

q

!N ;

r X

1 ðh1 2

where h0 ¼

Given a set of r partitions K = {kqjq = 1, . . . , r}, with the qth partition kq having kq clusters, the consensus function C for combining multiple clusterings can be defined just as (Strehl and Ghosh, 2002):

C : K ! k;

  n

Sðk; kq Þ ¼

We use the above pairwise measures for comparing clusterings as our objective functions of clustering ensemble. In this section, we first describe the clustering ensemble problem, and then present a fast simulated annealing framework for combining multiple clusterings that operates through single label changes to optimize these measure-based objective functions efficiently.

nr

q

h0  h1 h2

^ i (or where j = k (xt) (q = 1, . . . , r), and n is the new version of ni (or q nqij ). To update S(k,kq), we can first calculate h1 and h0 incrementally:

^1 ¼ h1 þ n 0  n þ 1; h i i ^q ¼ hq þ nq0  nq þ 1: h 0 0 ij ij

ð15Þ ð16Þ

q

q¼1 ⁄

It is worth noting that the desired number of clusters k in the consensus clustering k⁄ deserves a separate discussion which is beyond the scope of this paper. Here, we assume that k⁄ is predetermined for the consensus clustering k⁄. More details about this model selection problem and its solutions can be found in (Lu et al., 2008; Figueiredo and Jain, 2002). 4.2. Clustering ensemble via simulated annealing To update the objective function of clustering ensemble incrementally, we consider the measures in the form of (1). Although many measures for comparing clusterings can be represented as (1), we will focus on one special type of measures which are defined based on the relationship (joined or separated) of pairs of objects. Three measures, i.e. Rand index, Jaccard index, and Wallace index, are used here as the objection functions of clustering ensemble. Moreover, to address the local convergence problem of the greedy optimization strategy, we propose a search scheme based on simulated annealing. This clustering ensemble scheme, with slight modification, can also be used for other types of objective

Here, h2 stays fixed for each label change. Hence, we can obtain the new b Sðk; kq Þ according to (10)–(12), and the new objective function b Sðk; kq Þgrq¼1 . It is worth pointing out that Sðk; KÞ is just the mean of fb the update of the objective function has only linear time complexity O(r) for single label change, which ensures that the simulated annealing scheme is computationally feasible for the maximum of S(k, K). We further take into account a simplified simulated annealing scheme to determine whether to select a single label change k(xt):i ? i0 . Given a temperature T, we define the following function with respect to this label change: 0

Pðkðxt Þ : i ! i Þ ¼



1 DS

eT

if DS > 0; otherwise;

ð17Þ

where DS ¼ b Sðk; KÞ  Sðk; KÞ. This function can then be used to select a single label change. More concretely, we select a single label change k(xt) : i ? i0 if P(k(xt) : i ? i0 ) is higher than a threshold P0(0 < P0 < 1); otherwise, we discard it and begin to try the next single label change. In theory, we should only choose the label changes that increase the objective function (i.e. P(k(xt) : i ? i0 ) = 1). However, this simple

1959

Z. Lu et al. / Pattern Recognition Letters 32 (2011) 1956–1961

1

Table 1 A fast simulated annealing algorithm for clustering ensemble.

0.8

Input: 1. A set of r partitions K = {kqjq = 1, . . . , r} 2. The desired number of clusters k⁄ 3. The threshold for selecting label change P0 4. The cooling ratio c(0 < c < 1) Output: The consensus clustering k⁄ Process:

0.6 0.4 0.2

1. Select a candidate clustering k, store the current best solution with k⁄ = k and S⁄ = S, and set the temperature T = T0 2. Start a loop with all objects set unvisited (v(t) = 0, t = 1, . . . , n). Randomly select an unvisited object xt, and change its label k(xt) to the other k⁄  1 labels. If a label change is selected by (17), we store the current best solution with k⁄ = k and S⁄ = S only if S > S⁄, and set v(t) = 1 for going to a new unvisited object. If all possible label changes have been tried for xt, we also set v(t) = 1 and go to a new unvisited object. The loop is stopped until all objects are visited 3. Set T = c  T, and go to step 2. If there is no label change during two successive loops, stop the algorithm and output k⁄

0 −0.2 −0.4 −0.6 −0.8 −1 −1.5

−1

−0.5

0

0.5

1

1.5

Fig. 1. The artificial data set: half-rings. It is difficult for any centroid based clustering algorithms.

optimization scheme would lead to local convergence. To avoid local optima, we should also choose the label changes that slightly decrease the current objective function (i.e. P(k(xt) : i ? i0 ) < 1 but P(k(xt) : i ? i0 ) > P0). This augmented optimization scheme is similar to other simulated annealing algorithms. Here, we cannot choose one of the neighboring clusters to form the label change because it is too difficult to define the neighboring clusters when only multiple clusterings are provided. Moreover, this choice cannot ensure the increase of the objective function. It should be that the current solution may become worse than any candidate solution at the initial stage of simulated annealing. However, as the temperature T becomes smaller (suppose T tends to zero), the function given by (17) will tend to zero if DS < 0 and thus we will discard the corresponding label change. Since at this stage we only select the label change with DS > 0, the current solution cannot become worse. The complete description of our simulated annealing framework for clustering ensemble is summarized in Table 1. It should be noted that the time complexity of this framework is O(nk⁄r).

an interval of [0, 10]. As for USPS, we only consider the digits 0–4 just as (Lawrence, 2004; Walder and Schölkopf, 2009) and then reduce the number of features to 20 via principal component analysis. Here, the other two data sets stay unchanged. The average clustering errors by the k-means algorithm for 20 independent runs on the four data sets are listed in Table 2, as the baselines for the consensus functions. As for the normalization of wine data, the average clustering error by the k-means algorithm can be decreased from 36.3% to 8.4% for 20 independent runs. Here, we evaluate the performance of a clustering algorithm by matching the detected and the known partitions as done in (Topchy et al., 2005). The best possible matching of clusters provides a measure of performance expressed as the misassignment rate. To determine the clustering error, one needs to solve the correspondence problem between the labels of known and derived clusters. The optimal correspondence can be obtained using the Hungarian method (Kuhn, 1955) for minimal weight bipartite matching problem.

5. Experimental results

5.2. Selection of parameters and algorithms

The experiments are conducted on four data sets, where true natural clusters are known, to validate both accuracy and robustness of consensus via our simulated annealing framework. We also explore the data sets using three other consensus functions.

To implement our simulated annealing framework for clustering ensemble, we must select two important parameters, i.e., the threshold P0 for selecting label change and the cooling ratio c(0 < c < 1). When c takes a larger value, we may obtain a better solution but the algorithm may converge slower. Meanwhile, when P0 is larger, the algorithm may converge faster but the local optima may be less possibly avoided. To achieve a tradeoff between the clustering accuracy and speed, we empirically set P0 = 0.85 and c = 0.99 in all the experiments. Moreover, the temperature T is initialized by T = 0.1S0 where S0 is the initial value of objective function. Our three simulated annealing methods (i.e. SA-RI, SA-JI, and SA-WI) for clustering combination are compared to three other consensus functions:

5.1. Data sets The details of the four data sets are summarized in Table 2. The artificial data set, i.e. half-rings, is shown in Fig. 1, which is difficult for any centroid based clustering algorithm. We also use three reallife data sets: two (i.e. iris and wine) from UCI benchmark repository, one from US Postal Service (USPS). Since the last feature of wine data is far larger than the others, we first normalize them into

Table 2 Details of the four data sets. The average clustering error is obtained by k-means. Data sets

# features

k⁄

n

Avg. error (%)

Half-rings Iris Wine USPS

2 4 13 20

2 3 3 5

500 150 178 5427

26.4 21.7 8.4 32.9

1. EM algorithm for consensus clustering based on the mixture model (Topchy et al., 2005). 2. QMI approach described in (Topchy et al., 2003), which is actually implemented by k-means in the space of specially transformed cluster labels of the given ensemble. 3. MCLA, a hypergraph method introduced in (Strehl and Ghosh, 2002).

1960

Z. Lu et al. / Pattern Recognition Letters 32 (2011) 1956–1961

Here, our methods are initialized via EM because this algorithm runs very fast. Other consensus functions can also be similarly used as initializations. Since the co-association methods have the complexity O(n2) and may lead to severe computational limitations, our methods are not compared to these algorithms. The performance of the co-association methods has previously been analyzed in (Topchy et al., 2003). The k-means algorithm is used to generate partitions for the combination. Diversity of the partitions is ensured by: (1) initializing the algorithm randomly; (2) selecting the number of clusters k randomly. In the experiments, we actually give k a random value near the number of true natural clusters k⁄ (k P k⁄). We have found that this method of generating partitions leads to better results than that only by random initialization. Moreover, we vary the number of combined clusterings r in the range [10, 50]. 5.3. Comparison with other consensus functions The main results for each of the four data sets are presented in Tables 3–6. We have also initialized our simulated annealing meth-

Table 3 Average error rate (%) on the half-rings data set. The k-means algorithm randomly selects k 2 [3, 5] to generate r partitions for different combination methods. r

SA-RI

SA-JI

SA-WI

EM

QMI

MCLA

10 20 30 40 50

19.7 17.7 17.4 17.6 19.5

21.9 19.8 18.4 19.0 18.8

20.3 19.8 19.9 17.6 17.4

26.4 24.4 26.9 27.5 28.5

25.7 25.3 24.6 25.9 26.6

24.6 19.9 24.9 23.5 21.7

ods using other consensus functions besides EM, and found that similar results can be obtained. Here, all the tables report the average error rates (%) of clustering combination from 20 independent runs of the combination methods. It can be observed that our simulated annealing methods perform generally better than other consensus functions. Among our three methods, SA-RI and SA-WI outperform SA-JI generally. It is worth noting that all co-association methods are usually unreliable with r < 50 and this is where our methods are positioned. Moreover, we can find that both EM and QMI suffer from the local convergence problem. Although EM is used for initializations of our methods, our methods can avoid local optima through simulated annealing optimization. Since our methods lead to slightly higher errors only in several cases as compared with MCLA, we can think our methods are preferred for an overall evaluation. Moreover, it is also interesting to note that, as expected, the average error of consensus clustering by our simulated annealing methods is lower than the average error of the k-means clusterings in the ensemble (Table 2) when k is chosen to be equal to the true number of clusters k⁄. This actually verifies the effectiveness of clustering ensemble. Finally, the average time taken by our three methods (Matlab code) is less than 30 s per run on a 1 GHz PC in all cases. As reported in (Strehl and Ghosh, 2002), the experiments with n = 400, k = 10, r = 8 take average one hour using the greedy algorithm based on normalized MI, which is similar to our methods but does not incrementally update its objective function. In contrast, our methods only take about 10 s in this case, due to our incremental update given by (15–16). Here, any similar algorithm without incremental update of its objective function can be selected for efficiency comparison and we can consistently obtain similar results. 6. Conclusions

Table 4 Average error rate (%) on the iris data set. The k-means algorithm randomly selects k 2 [3, 5] to generate r partitions for different combination methods. r

SA-RI

SA-JI

SA-WI

EM

QMI

MCLA

10 20 30 40 50

10.6 10.6 10.7 10.6 10.6

10.7 10.7 10.7 10.7 10.7

10.6 10.9 10.7 10.7 10.7

12.3 17.5 18.1 16.6 26.9

14.3 14.8 12.3 13.9 12.6

10.4 10.6 10.5 10.7 10.7

Table 5 Average error rate (%) on the wine data set. The k-means algorithm randomly selects k 2 [4, 6] to generate r partitions for different combination methods. r

SA-RI

SA-JI

SA-WI

EM

QMI

MCLA

10 20 30 40 50

6.5 6.5 6.4 6.3 6.3

6.7 6.5 6.3 6.3 6.2

6.5 6.4 6.3 6.2 6.2

17.1 17.9 13.1 17.2 21.1

8.8 10.4 7.5 7.4 7.3

7.6 8.5 7.4 7.5 7.8

Table 6 Average error rate (%) on the USPS data set. The k-means algorithm randomly selects k 2 [5, 7] to generate r partitions for different combination methods. r

SA-RI

SA-JI

SA-WI

EM

QMI

MCLA

10 20 30 40 50

17.1 15.6 13.6 15.6 12.7

19.4 14.5 18.8 19.5 16.7

20.9 15.4 13.4 15.7 12.5

26.3 21.9 24.9 27.5 30.1

25.3 24.9 22.8 24.5 20.3

19.1 23.3 16.7 15.8 15.6

We have proposed a fast simulated annealing framework for combining multiple clusterings based on several measures for comparing clusterings. When the objective functions of clustering ensemble are specified in terms of the relationship of pairs of objects in the data set, we can formulate them in a form that allows incrementally updating for each single label change. This provide a highly efficient (linear complexity) optimization scheme based on simulated annealing which makes it a general computationally feasible solution for large data sets. Experimental evaluations on four data sets demonstrate that the proposed framework can achieve superior results in terms of accuracy and efficiency over existing techniques. It should be noted that the performance of the proposed framework is affected by (i) the measure for comparing clusterings used to define the objective function, and (ii) the optimization scheme used to find the consensus solution. The current measure for comparing clusterings and optimization scheme used by the proposed framework can only achieve relatively small improvements in some complex cases. To further improve the proposed framework and make it more robust for challenging applications, we will test other measures and optimization schemes in the future work. Acknowledgements The work described in this paper was fully supported by the National Natural Science Foundation of China under Grant No. 60873154 and 61073084, and the National Development and Reform Commission High-tech Program of China under Grant No. [2010]3044.

Z. Lu et al. / Pattern Recognition Letters 32 (2011) 1956–1961

References Denoeud, L., Guénoche, A., 2006. Comparison of distance indices between partitions. In: Proc. IFCS’2006: Data Science and Classification, pp. 21–28. Dudoit, S., Fridlyand, J., 2003. Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19 (9), 1090–1099. Figueiredo, M., Jain, A.K., 2002. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Machine Intell. 24 (3), 381–396. Fischer, R.B., Buhmann, J.M., 2003. Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans. Pattern Anal. Machine Intell. 25 (4), 513–518. Fred, A., Jain, A.K., 2005. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Machine Intell. 27 (6), 835–850. Frossyniotis, D., Likas, A., Stafylopatis, A., 2004. A clustering method based on boosting. Pattern Recognition Lett. 25 (6), 641–654. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classif. 2, 193–218. Hull, J., 1994. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Machine Intell. 16 (5), 550–554. Kuhn, H.W., 1955. The hungarian method for the assignment problem. Naval Res. Log. Quart. 2, 83–97.

1961

Lawrence, N., 2004. Gaussian process latent variable models for visualisation of high dimensional data. In: Advances in Neural Information Processing Systems, vol. 16, pp. 329–336. Lu, Z., Peng, Y., Xiao, J., 2008. Unsupervised learning of finite mixtures using entropy regularization and its application to image segmentation. In: Proc. CVPR. Strehl, A., Ghosh, J., 2002. Cluster ensembles – a knowledge reuse framework for combining partitionings. In: Proc. AAAI, pp. 93–99. Topchy, A., Jain, A.K., Punch, W., 2003. Combining multiple weak clusterings. In: Proc. ICDM, pp. 331–338. Topchy, A., Jain, A.K., Punch, W., 2005. Clustering ensembles: Models of consensus and weak partitions. IEEE Trans. Pattern Anal. Machine Intell. 27 (12), 1866– 1881. Tumer, K., Agogino, A.K., 2008. Ensemble clustering with voting active clusters. Pattern Recognition Lett. 29 (14), 1947–1953. Unnikrishnan, R., Pantofaru, C., Hebert, M., 2007. Toward objective evaluation of image segmentation algorithms. IEEE Trans. Pattern Anal. Machine Intell. 29 (6), 929–944. Walder, C., Schölkopf, B., 2009. Diffeomorphic dimensionality reduction. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1713–1720. Wallace, D.L., 1983. Comment on a method for comparing two hierarchical clusterings. J. Amer. Statist. Assoc. 78, 569–576.