- Email: [email protected]

Contents lists available at SciVerse ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Unsupervised fuzzy-rough set-based dimensionality reduction Neil Mac Parthaláin ⇑, Richard Jensen Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion, SY23 3DB Wales, UK

a r t i c l e

i n f o

Article history: Received 18 May 2012 Received in revised form 20 August 2012 Accepted 2 December 2012 Available online 14 December 2012 Keywords: Rough set Fuzzy set Attribute reduction Unsupervised feature selection Unsupervised learning

a b s t r a c t Each year worldwide, more and more data is collected. In fact, it is estimated that the amount of data collected and stored at least doubles every 2 years. Of this data, a large percentage is unlabelled or has labels which are incomplete or missing. It is because this data is so large that it becomes very difﬁcult for humans to manually assign labels to data objects. Additionally, many real-world application datasets such as those in gene expression analysis, and text classiﬁcation are also of large dimensionality. This further frustrates the process of label assignment for domain experts as not all of the features are relevant or necessary in order to assign a given label. Hence unsupervised feature selection is required. For supervised learning, feature selection algorithms attempt to maximise a given function of predictive accuracy. This function typically considers the ability of feature vectors to reﬂect decision class labels. However, for the unsupervised learning task, decision class labels are not provided, which poses questions such as: which features should be retained? In fact, not all features are important and some are irrelevant, redundant or noisy. In this paper, several unsupervised FS approaches are presented which are based on fuzzy-rough sets. These approaches require no thresholding information, are domain-independent, and can operate on real-valued data without the need for discretisation. They offer a signiﬁcant reduction in dimensionality whilst retaining the semantics of the data, and can even result in supersets of the supervised fuzzy-rough approaches. The approaches are compared with some supervised techniques and are shown to retain useful features. Ó 2012 Elsevier Inc. All rights reserved.

1. Introduction It is often desirable to present as large a number of features as possible for any given domain, such that every single aspect of that domain is represented. The problem is that this can often result in the inclusion of many redundant or irrelevant features, which in turn may lead to poor results when using data mining tools for knowledge discovery. Feature selection (FS) [23], is a process which chooses a subset of the original features present in a given dataset which provides the most useful information; the most important information of the dataset should still remain following selection. In fact, efﬁcient FS techniques should be able to detect and ignore noisy and misleading features. As a result, the dataset quality may even increase following feature selection. Classiﬁcation accuracy may be increased as a result of feature selection through the removal of noisy, irrelevant or redundant features. Also, in domains, where features correspond to measurements (the water treatment plant in [36] demonstrates this well), fewer features offer advantages such as minimising the expense and time consumed in recording such measurements. For datasets which are smaller in size, the runtimes of learning algorithms can be improved signiﬁcantly. This is equally applicable to both training and application (e.g. classiﬁcation) phases. Where there are fewer dimensions, ⇑ Corresponding author. Tel.: +44 (0) 1970 622869; fax: +44 (0) 1970 628536. E-mail addresses: [email protected] (N. Mac Parthaláin), [email protected] (R. Jensen). 0020-0255/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2012.12.001

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

107

identiﬁcation of trends and correlations within the data becomes easier [9]. This is evident, where a small number of features have an inﬂuence on data outcomes. Methods which extract knowledge from data (e.g. rule induction) may also beneﬁt from the use of FS, and show an improvement in the readability of the discovered knowledge. When induction algorithms are applied to reduced data, the resulting rules are more compact. A good feature selection algorithm will remove unnecessary attributes which may affect both rule comprehension and rule prediction performance. In conventional supervised learning, FS methods evaluate various feature subsets using an evaluation function or metric to select only those features which are related to, or lead to, the decision classes of the data under consideration. However, as mentioned previously, for many data mining applications decision class labels are often unknown or incomplete, thus indicating the signiﬁcance of unsupervised feature selection. Historically, there have been two approaches to unsupervised FS; the ﬁrst set of approaches aim to maximise clustering performance using an index function, whilst the second set of approaches consider features for selection on the basis of dependency or relevance to each another. Unsupervised FS approaches which maximise a clustering performance function aim to improve the ability to group the data, by removing features which do not contribute to effective clustering. Such methods include sequential unsupervised feature selection algorithm [7], expectation maximisation unsupervised wrapper [11], and maximum entropy based methods [3]. Recent examples of supervised FS which rely on dependence or feature similarity include the use of correlation coefﬁcients [13], statistical redundance [33] and Markov blankets [22] and indeed some of these have been adapted for unsupervised FS. It should also be noted that traditional statistical methods such as principal component analysis (PCA) and linear discriminant analysis (LDA) can be used to perform unsupervised dimensionality reduction using the information content of features. However, such methods return a transformed set of features which is not a subset of the original features, thus destroying the underlying semantics or meaning of the data. The work on rough set theory (RST) [32] offers a formal methodology that can be employed to reduce the dimensionality of datasets, as a preprocessing step to assist any chosen modelling method for learning from data. It assists in identifying and selecting the most information-rich features in a dataset. This is achieved without transforming the data, whilst simultaneously attempting to minimise information loss during the selection process. In terms of computational effort, this approach is highly efﬁcient, and is based on simple set operations, which makes it suitable as a preprocessor for techniques that are much more complex. In contrast to statistical correlation–reduction approaches [12], RST requires no human input or domain knowledge other than the given datasets. Perhaps most importantly though, it retains the underlying semantics of the data, which results in data models that are more transparent to human scrutiny. The vast majority of work which has been carried out in the area of rough sets and fuzzy-rough sets for FS has concentrated on supervised approaches, i.e. where the class labels are known. Very little work has been done in the area of unsupervised learning employing fuzzy-rough sets. Indeed the methods described in this paper are motivated by the domain independence and data-driven foundation of fuzzyrough sets [18], as well as the ability of some of approaches presented here to return sets of features which are supersets of the supervised approaches. The remainder of this paper is structured as follows. Section 2 summarises the related work and background of unsupervised FS. Next, the theoretical background and ideas of rough sets, and fuzzy rough sets are explored, as this forms the basis for the new approaches. Section 3 demonstrates the main contribution of the described approaches with a description of the algorithms proposed, as well as an examination of the computational complexity. Section 4 documents the results of applying the techniques to nine real-valued benchmark datasets. These results compare the methods in terms of the level of dimensionality reduction, classiﬁcation accuracy (using three different classiﬁer learners), and statistical signiﬁcance. Section 5 concludes the paper along with some suggestions for further development, and a discussion of future work.

2. Background 2.1. Related work The task of unsupervised FS is somewhat ill-deﬁned when compared to its supervised counterpart and there are a number of differing opinions as to what its true goals or aims are. In [11] the task of unsupervised FS is deﬁned as that of ‘ﬁnding the smallest feature subset that best uncovers ‘‘interesting natural’’ groupings (clusters) from data according to the chosen criteria’. This deﬁnition is that which is held in much early research on unsupervised FS [11,14] but it carries with it the intrinsic assumption that ‘interesting and natural groupings’ exist in the data, which of course may not be the case. Furthermore, any deﬁnition of ‘interesting’ is subjective and may not be related to any inherent characteristics of the data. Additionally, clustering may not be the ﬁnal application for any discovered feature set. In [16] the authors are of the opinion that ‘A fundamental goal of unsupervised feature selection is denoising, which aims to identify and reduce noisy features that are not discriminative’. Once again, however, a subjective assumption is made about which features are ‘discriminative’. In [26] the authors regard unsupervised FS as a data pre-processing step for dimensionality reduction in order to select the most useful ‘representative features’. However, in [23] the author points out that the goal of unsupervised FS may not only lie in improving clustering ability, but also the discovery of important features. This would suggest that unsupervised FS has evolved in recent years, and is no longer viewed simply as a method to improve cluster groupings for unlabelled data.

108

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

As mentioned previously, there are generally two approaches to unsupervised feature selection. The ﬁrst of these attempts to select features by maximising a clustering function such that features that do not contribute to the clustering function are removed, or not included during selection. These methods include the expectation maximisation unsupervised wrapper proposed in [11], which employs an unsupervised FS algorithm which is ‘wrapped’ around a clustering method such that features are evaluated and selected based on how well they cluster the data. This process is also known as biclustering [27]. The expectation maximisation algorithm is based on that proposed in [8]. The authors (as mentioned previously) make some subjective assumptions about ’interesting and natural’ groupings of the data however which may, or may not be present. Nevertheless, this wrapper approach seems to work well although its focus is centred on the clustering problem. The authors of [30] employ a neuro-fuzzy approach to select features in an unsupervised manner. A fuzzy feature evaluation measure is used to assess subsets which are then transformed, a neural network is then used to select features from the transformed set. This approach transforms the original set of features however, and therefore does not return a subset of those original features. A method for maximum entropy and maximum likelihood measure based approach in [3] are discussed but no algorithm is presented. A number of other approaches are also employed to perform unsupervised FS for clustering optimisation such as GAs [19,28] Self-Organising Maps [21], and Naive Bayes [37]. In the other approach to unsupervised FS, where features are selected based on redundancy, i.e. if a single feature or group of features does not carry any more information than that possessed by the rest of the dataset, these can be removed without any resulting loss of information. There are a number of methods which adopt this strategy, and as it is not speciﬁc to the task of clustering it may be seen as a more general approach to unsupervised FS. Indeed, various measures such as [1,6,13,15] can be employed to discover features that are statistically redundant/irrelevant. The method proposed in [26] clusters features through the use of a pairwise feature metric which the authors term ‘maximal information compression index’. This metric is based on the linear dependence of a pair of features, and an algorithm is built around this comparison. As features are clustered using this comparison the authors argue that since there is no search involved the complexity of the employed method is low. The algorithm works by partitioning the original feature subset into distinct subsets or clusters so that the features within a subset are similar and those is different clusters are dissimilar. A single feature from each cluster is then selected as a prototype for the inclusion in the reduced feature subset. 2.2. Rough set attribute reduction The principal focus of this paper lies in unsupervised feature selection using fuzzy-rough sets. An in-depth appraisal of the rough set methodology is necessary however in order to appreciate the merits of the fuzzy-rough technique and its application to the unsupervised learning domain. At the heart of the RSAR approach is the concept of indiscernibility [32]. Let I ¼ ðU; SÞ be an information system, where U is a non-empty set of ﬁnite objects (the universe of discourse) and S is a non-empty ﬁnite set of attributes so that a : U ! V a for every a 2 S. V a is the set of values that a can take. For any P # S, there exists an associated equivalence relation INDðPÞ:

INDðPÞ ¼ fðx; yÞ 2 U2 j8a 2 P; aðxÞ ¼ aðyÞg

ð1Þ

The partition generated by INDðPÞ is denoted U=INDðPÞ and is calculated as follows:

U=INDðPÞ ¼ fU=INDðfagÞ : a 2 Pg

ð2Þ

S T ¼ fX \ Y : 8X 2 S; 8Y 2 T; X \ Y – ;g

ð3Þ

where

If ðx; yÞ 2 INDðPÞ, then x and y are indiscernible by attributes from P. The equivalence classes of the P-indiscernibility relation are denoted ½xP . Let X # U. X can be approximated using only the information contained in P by constructing the P-lower and P-upper approximations of X:

PX ¼ fx : ½xP # Xg

ð4Þ

PX ¼ fx : ½xP \ X – ;g

ð5Þ

Let P and Q be equivalence relations over U, then the concepts of the positive, negative and boundary regions can be deﬁned:

POSP ðQ Þ ¼

[

PX

ð6Þ

X2U=Q

[

NEGP ðQ Þ ¼ U

PX

ð7Þ

X2U=Q

BNDP ðQ Þ ¼

[ X2U=Q

PX

[ X2U=Q

PX

ð8Þ

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

109

By employing this deﬁnition of the positive region it is possible to calculate the rough set degree of dependency of a set of attributes Q on a set of attributes P. This can be achieved as follows: For P; Q # S, it can be said that Q depends on P in a degree k ð0 6 k 6 1Þ, thus the higher the value of k the more dependent Q is upon P. This is denoted ðP)k Q Þ if:

k ¼ cP ðQÞ ¼

jPOSP ðQ Þj jUj

ð9Þ

The reduction of attributes can be achieved through the comparison of equivalence relations generated by sets of attributes. Attributes are removed such that the reduced set provides identical predictive capability of the decision feature or features, as that of the original or unreduced set of features (assuming of course that the dataset is consistent). A reduct is a minimal set of attributes B # A such that INDðBÞ ¼ INDðAÞ. In other words, a reduct is a minimal set of attributes from A that preserves the partitioning of the universe, and hence the ability to perform classiﬁcations as the whole attribute set A does. The QUICKREDUCT algorithm shown in Fig. 1 searches for a minimal subset without exhaustively generating all possible subsets. The search begins with an empty subset, attributes which result in the greatest increase in the rough set dependency value are added iteratively. This process continues until the search produces its maximum possible dependency value for that dataset ðcc ðDÞÞ. Note that this type of search does not guarantee a minimal subset and may only discover a local minimum. 2.2.1. Fuzzy-rough approaches Other hybrid approaches such as rough-fuzzy [31] and fuzzy-rough sets [17], have been proposed in order to improve the ability to deal with uncertainty and vagueness present in data. A fuzzy-rough set [10,40] is deﬁned by two fuzzy sets, fuzzy lower and upper approximations, obtained by extending the corresponding crisp rough set notions. In the crisp case, elements that belong to the lower approximation (i.e. have a membership of 1) are said to belong to the approximated set with absolute certainty. In the fuzzy-rough case, elements may have a membership in the range [0,1], allowing greater ﬂexibility in handling uncertainty. Fuzzy-rough sets encapsulate the related but distinct concepts of vagueness (for fuzzy sets) and indiscernibility (for rough sets), both of which occur as a result of uncertainty in knowledge. Deﬁnitions for the fuzzy lower and upper approximations can be found in [35], where a T -transitive fuzzy similarity relation is used to approximate a fuzzy concept X:

lRP X ðxÞ ¼ inf IðlRP ðx; yÞ; lX ðyÞÞ y2U

ð10Þ

lRP X ðxÞ ¼ supT ðlRP ðx; yÞ; lX ðyÞÞ

ð11Þ

y2U

Here, I is a fuzzy implicator and T a t-norm. A fuzzy implicator is any ½0; 12 ! ½0; 1-mapping I satisfying I ð0; 0Þ ¼ 1; I ð1; xÞ ¼ x for all x in ½0; 1. RP is the fuzzy similarity relation induced by the subset of features P:

lRP ðx; yÞ ¼ T a2P flRa ðx; yÞg

ð12Þ

lRa ðx; yÞ is the degree to which objects x and y are similar for feature a, and may be deﬁned in many ways, for example: jaðxÞ aðyÞj jamax amin j ðaðyÞ ðaðxÞ ra ÞÞ ððaðxÞ þ ra Þ aðyÞÞ lRa ðx; yÞ ¼ max min ; ;0

lRa ðx; yÞ ¼ 1

ra

ra

ð13Þ ð14Þ

where r2a is the variance of feature a. As these relations do not necessarily display T -transitivity, the fuzzy transitive closure can be computed for each attribute. The choice of relation is largely determined by the intended application. For feature

Fig. 1. The

QUICKREDUCT

algorithm.

110

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

selection, a relation such as (14) may be appropriate as this permits only small differences between attribute values of differing objects. For classiﬁcation tasks, a more gradual and inclusive relation such as (13) should be used. In a similar way to the original crisp rough set approach, the fuzzy positive region [18] can be deﬁned as:

lPOSP ðDÞ ðxÞ ¼ sup lRP X ðxÞ

ð15Þ

X2U=D

An important issue in data analysis is discovering dependencies between attributes. The fuzzy-rough degree of dependency of D on the attribute subset P can be deﬁned in the following way:

P

c0P ðDÞ ¼

x2U

lPOSP ðDÞ ðxÞ jUj

ð16Þ

A fuzzy-rough reduct R can be deﬁned as a minimal subset of features that preserves the dependency degree of the entire dataset, i.e. c0R ðDÞ ¼ c0C ðDÞ. Based on this, a fuzzy-rough greedy hill-climbing algorithm can be constructed that uses Eq. (16) to gauge subset quality. In [18], it has been shown that the dependency function is monotonic and that fuzzy discernibility matrices may also be used to discover reducts. 3. Unsupervised fuzzy-rough feature selection This section sets out the formal deﬁnitions, approaches and algorithms for unsupervised feature selection using fuzzyrough sets. Two approaches are examined in this paper, one which utilises a fuzzy-rough dependency measure to eliminate redundant features in a backwards-elimination type search strategy [24]. The other utilises fuzzy-rough discernibility to eliminate irrelevant features using a number of different search techniques. 3.1. Dependency measure The discovery of dependencies between attributes, is in general, an important issue in data analysis. Intuitively, a set of attributes Q depends totally on a set of attributes P, denoted P ! Q , if all attribute values from Q are uniquely determined by values of attributes from P. The motivation behind the approach described in this section is that, as with supervised fuzzy-rough FS [18], the fuzzy dependency measure can also be used to discover the interdependency of features. This can be achieved by substituting the decision feature (s) D of the supervised approach for any given feature or group of features Q such that

P

c0P ðQÞ ¼

x2U

lPOSRP ðQÞ ðxÞ jUj

ð17Þ

where P \ Q ¼ ; and

lPOSRP ðQÞ ðxÞ ¼ suplRP RQ z ðxÞ

ð18Þ

z2U

Here, RQ z indicates the fuzzy tolerance class (or fuzzy equivalence class) for object z. The lower approximation becomes:

lRP RQ z ðxÞ ¼ inf I ðlRP ðx; yÞ; lRQ ðy; zÞÞ y2U

ð19Þ

3.1.1. Finding reductions For the supervised approach, search is conducted within PðCÞ, the set of all possible subsets of the conditional feature set. However, for the unsupervised approach search is performed within PðCÞ PðCÞ, as to search for reductions any subset can be compared with any other subset. This is a vastly more complex space in which to search. For the purposes of this paper, a linear backward search is employed that achieves reasonable reductions in a short space of time. In order to avoid local minima and the selection of the features in the order of which they appear in the dataset (a problem which affected the results in [24]), an initial ordering of the features is conducted using their score based on fuzzy entropy. The fuzzy entropy for a fuzzy subset F i can be deﬁned as:

HðF i Þ ¼

X

pðDjF i Þlog2 pðDjF i Þ

ð20Þ

D2U=D

where pðQ jF i Þ is the relative frequency of the fuzzy subset F i of attribute a with respect to the other features in the dataset Q, and is deﬁned:

pðDjF i Þ ¼

jD \ F i j jF i j

ð21Þ

The cardinality of a fuzzy set is denoted by j j. Based on these deﬁnitions, the fuzzy entropy for an attribute subset R is deﬁned as follows:

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

EðRÞ ¼

X F i 2U=R

P

jF i j

Y i 2U=R jY i j

HðF i Þ

111

ð22Þ

The algorithm (Fig. 2) starts by considering all of the features contained in the dataset. It then reorders all of the features according to their individual fuzzy-entropy value as described previously. The removal of each feature is then examined iteratively, and the corresponding measure is calculated (MðR; fxgÞ < 1). This measure is simply termed M here as a number of different measures can be used – see Sections 3.2 and 3.3. If the M is unaffected then the feature can be removed (R R [ fxg). This process continues until all features in C have been examined. If no interdependency exists, the algorithm will return the full set of features. The complexity for the search in the worst case is OðnÞ, where n is the number of original features. 3.1.2. Worked Example To illustrate the ideas discussed in previous sections, a small dataset shown in Table 1 is used. It is important to note that only the basic concepts are illustrated here, and in the interest of brevity, tasks such as the ordering of the features are not described. For this particular example, the Łukasiewicz t-norm ðmaxðx þ y 1; 0ÞÞ and the Łukasiewicz fuzzy implicator ðminð1 x þ y; 1ÞÞ are adopted to implement the fuzzy connectives. Other interpretations may also be used. Using the fuzzy similarity measure deﬁned in (13), the resulting relations for each feature in the dataset are shown (for brevity) in Table 2. Initially, the lower approximations of the concepts of a given feature must be computed for each of the other features in the dataset. This is then used to calculate the dependency degree. For the example dataset, consider the dependency of the feature b on the feature a:

lRfag Ru b ðxÞ ¼ inf IðlRfag ðx; yÞ; lRu b ðyÞÞ y2U

ð23Þ

Thus, for a particular instance, where object x ¼ 2, and u ¼ 2, this is (as highlighted in Table 2):

lRfag R2 b ð2Þ ¼ inf IðlRa ð2; yÞ; lR2 b ðyÞÞ ¼ inffIð0:83; 1Þ; Ið1; 1ÞIð0:33; 0ÞIð0:83; 0:5ÞIð1; 0Þg ¼ 0 y2U and for the remaining objects regarding a (i.e. u 2 f1; 3; 4; 5g) this is:

lRfag R1 b ð2Þ ¼ inffIð1; 1Þ; Ið0:83; 1Þ; Ið0; 0Þ; Ið0:5; 0:5Þ; Ið0:83; 0Þg ¼ 0:0 lRfag R3 b ð2Þ ¼ inffIð0; 1Þ; Ið0:33; 1Þ; Ið1; 0Þ; Ið0:5; 0:5Þ; Ið0:33; 0Þg ¼ 0:0 lRfag R4 b ð2Þ ¼ inffIð0:5; 1Þ; Ið0:83; 1Þ; Ið0:5; 0Þ; Ið1; 0:5Þ; Ið0:83; 0Þg ¼ 0:17 lRfag R5 b ð2Þ ¼ inffIð0:83; 1Þ; Ið1; 1Þ; Ið0:33; 0Þ; Ið0:83; 0:5Þ; Ið1; 0Þg ¼ 0:0 This process is repeated for every object regarding b in order to calculate the remaining lower approximations for each object. These values can then be used to calculate the positive regions:

lPOSR

fag ðfbgÞ

lPOSR

fag ðfbgÞ

lPOSR

fag ðfbgÞ

lPOSR

fag ðfbgÞ

lPOSR

fag ðfbgÞ

ð1Þ ¼ 0:50 ð2Þ ¼ 0:50 ð3Þ ¼ 0:67 ð4Þ ¼ 0:67 ð5Þ ¼ 0:67

Therefore the resulting dependency degree is:

Fig. 2. The

UFRQUICKREDUCT

algorithm.

112

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121 Table 1 Example dataset. Object

a

b

c

d

1 2 3 4 5

0.0 0.2 0.6 0.3 0.2

0.1 0.1 0.7 0.4 0.7

0.1 0.6 0.3 0.8 0.9

0.5 0.9 0.9 0.6 0.2

Table 2 Fuzzy similarity relations. Ra ðx; yÞ 1.0 0.83 0.0 0.50 0.83

0.83 1.0 0.33 0.83 1.0

Rb ðx; yÞ 0.0 0.33 1.0 0.50 0.33

0.50 0.83 0.50 1.0 0.83

P 0 fag ðfbgÞ

c

x2U

¼

0.83 1.0 0.33 0.83 1.0

1.0 1.0 0.0 0.50 0.0

lPOSRP ðxÞ jUj

¼

1.0 1.0 0.0 0.50 0.0

Rc ðx; yÞ 0.0 0.0 1.0 0.50 1.0

0.50 0.50 0.50 1.0 0.50

0.0 0.0 1.0 0.50 1.0

1.0 0.375 0.75 0.125 0.0

Rd ðx; yÞ 0.375 1.0 0.625 0.75 0.625

0.75 0.625 1.0 0.375 0.25

0.125 0.75 0.375 1.0 0.375

0.0 0.625 0.25 0.375 1.0

1.0 0.429 0.429 0.857 0.572

0.429 1.0 1.0 0.572 0.0

0.429 1.0 1.0 0.572 0.0

0.857 0.572 0.572 1.0 0.429

0.572 0.0 0.0 0.429 1.0

3:01 ¼ 0:602 5

In the interests of brevity only the computation of the dependency of feature b upon feature a is illustrated here. However, in the actual implementation of the UFRFS algorithm, the ﬁrst step is to consider the dependency of fag on the subset fb; c; dg. For the example dataset this leads to the following result:

c0fb;c;dg ðfagÞ ¼ 1:0 c0fc;dg ðfbgÞ ¼ 0:9569 c0fb;dg ðfcgÞ ¼ 1:0 c0fbg ðfdgÞ ¼ 0:2

ðT ¼ fb; c; dgÞ ðT ¼ fc; dgÞ ðT ¼ fdgÞ ðT ¼ ;Þ

Note that each time c0 ¼ 1, the feature in question is eliminated, resulting in the ﬁnal subset fb; dg, after all features have been examined.

3.2. Boundary region measure Most approaches to crisp rough set FS and all approaches to fuzzy-rough FS use only the lower approximation for the evaluation of feature subsets. The lower approximation contains information regarding the extent of certainty of object membership to a given concept. However, the upper approximation contains information regarding the degree of uncertainty of objects and hence this information can be used to discriminate between subsets. For example, two subsets may result in the same lower approximation but one subset may produce a smaller upper approximation. This subset will be more useful as there is less uncertainty concerning objects within the boundary region (the difference between upper and lower approximations). The fuzzy-rough boundary region for a fuzzy tolerance class RQ z X may thus be deﬁned:

lBNDP ðRQ zÞ ðxÞ ¼ lRP RQ z ðxÞ lRP RQ z ðxÞ

ð24Þ

with the upper approximation deﬁned as:

lRP RQ z ðxÞ ¼ supT ðlRP ðx; yÞ; lRQ ðy; zÞÞ

ð25Þ

y2U

As the search for an optimal subset progresses, the object memberships to the boundary region diminish until a minimum is achieved. From this, the total certainty degree given a feature subset P is deﬁned as:

P kP ðQ Þ ¼ 1

z2U

P

x2U

lBNDRP ðRQ zÞ ðxÞ

jUj2

It is this measure, k, that can be used to guide an unsupervised subset selection process.

ð26Þ

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

113

3.3. Discernibility measure There are two main branches of research in crisp rough set-based FS: those based on the dependency degree and those based on discernibility matrices and functions (as mentioned previously). Therefore, it is a natural progression to extend concepts in the latter branch to the fuzzy-rough domain [5]. The fuzzy tolerance relations that represent objects’ approximate equality can be used to extend the classical discernibility function. For each combination of features P, a value is obtained indicating how well these attributes maintain the discernibility, relative to another subset of features Q, between all objects.

0

1

B C f ðP; QÞ ¼ T @cij ðP; Q ÞA |ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

ð27Þ

16i

with

0

1

B C cij ðP; Q Þ ¼ I ðT @lRa ðxi ; xj ÞA; lRQ ðxi ; xj ÞÞ |ﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄ}

ð28Þ

a2P

Alternatively, rather than taking a minimum operation in Eq. (27), one can also consider the average over all object pairs, i.e.

gðP; Q Þ ¼

2

P

16i

ð29Þ

jUjðjUj 1Þ

This measure is less rigid than Eq. (27), which produces the value 0 as soon as one of the cij equals 0. 3.4. Alternative fuzzy discernibility approach In contrast to the previously described method for unsupervised fuzzy-rough feature selection, the method presented in this section (abbreviated to dUFRFS) is based on fuzzy indiscernibility. This unsupervised FS approach essentially views each data object as belonging to its own unique decision class (i.e. no other objects belong to this class). This can be seen as the extreme case of the approach that supervised methods take, which aim to ﬁnd a subset of features such that objects in different decision classes are maximally discernible. Hence (super) reducts found using this approach will also be fuzzy-rough reducts. More formally, the fuzzy-rough positive region becomes:

lPOSP ðxÞ ¼ y2U;y–x inf I ðlRP ðx; yÞ; 0Þ As each x belongs to its own individual decision class, all approximation. The resulting dependency function is:

P

c0P ¼

x2U

lPOSP ðxÞ

jUj

ð30Þ

lX ðyÞ are zero in the original deﬁnition of the fuzzy lower ð31Þ

Equivalently, there is a normalised version that is maximised when each object belongs to the positive region to the same extent as it belongs to the positive region when considering all conditional features:

c0P ¼

X lPOS ðxÞ P l ðxÞ x2U POSC

ð32Þ

Note that the complexity of this measure is less than that of the one considered in the previous section, as this method does not consider the case of each instance being a representative fuzzy decision class, only that each object should be discernible from other objects. 3.4.1. Finding reductions In contrast to the previous unsupervised approach, a number of search methods can be employed for the discovery of subsets. In fact, any of the methods used for ﬁnding fuzzy-rough reducts can be used for this approach. The dependency measure is monotonic, and using the normalised dependency, will reach the value 1.0 when a (super) reduct has been discovered. For the work discussed in this paper a greedy-type forward selection is adopted, although other types of search are also equally applicable. 3.5. Relation to supervised fuzzy-rough feature selection It is interesting to note that the fuzzy discernibility approach attempts only to preserve the consistency of the data. However, if the supervised fuzzy-rough FS approach is examined in detail, it can be seen that there are some parallels between

114

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

these two approaches, both conceptually and formally. Indeed, it was discovered by empirical evaluation that the subsets found by the unsupervised fuzzy-discernibility approach were fuzzy-rough reducts of the supervised approach. This prompted a more formal investigation into the properties of the approach which is outlined in the following proof. In the supervised fuzzy-rough FS case it can be shown that c0P ðQ Þ 6 gðP; Q Þ: Proof

P

# Rd yÞðyÞ n P y2X inf x2X I ðRP ðx; yÞ; Rd ðx; yÞÞ ¼ n P 16j6n inf x2X I ðRP ðx; xj Þ; Rd ðx; xj ÞÞ ¼ n P 2 16i

c

¼

y2X ðRP

Therefore, when gðP; Q Þ is maximised, c0P ðQ Þ is maximised and P is a fuzzy-rough reduct. In the unsupervised case, all Rd ðx; yÞ are zero when x – y and the above still holds. 4. Experimental evaluation and setup The experimentation section has been divided into three sections; the ﬁrst considers the results generated by applying the unsupervised fuzzy-rough approaches (UFRFS and dUFRFS) to nine different benchmark datasets. The second compares the unsupervised approaches to their supervised fuzzy-rough counterparts. It could be argued that to compare supervised and unsupervised FS methods would be unfair, since these tasks are very different. In addition, there is often much discriminative information contained in the class labels for the supervised method (which of course is absent in the unsupervised case). However, the comparison with such methods helps to demonstrate that useful and valuable features are indeed retained by the fuzzy-rough unsupervised feature selection methods described in this paper. The third part is an investigation which provides an insight into whether UFRFS and dURFS may be useful for the task of clustering. For this investigation, a further three datasets without labels are used. All of the benchmark datasets employed for this experimental evaluation were obtained from [29,2]. The experimental setup in the ﬁrst section consists of three main steps: (a) feature selection, (b) dataset reduction (using the selected subsets), and (c) classiﬁer learning. Note that class label removal is employed for the unsupervised approaches, so that FS is only performed on the unlabelled data. The benchmark datasets employed range from small to large in size with between 7 and 280 features and 120–452 data objects. Additionally, for the subset size and runtime evaluations only, a gene expression dataset containing 2000 features and 62 data objects is also included, in order to demonstrate the scalability of the approaches for datasets of large dimensionality. It should also be noted that FS is carried out as part of the cross-validation so that the selection is not optimistically biased with respect to the whole dataset. Therefore the subset sizes presented here are average sizes. As mentioned previously, the experimental evaluation is divided into three parts. The unsupervised approaches are assessed based on the results relating to (a) average subset size and times taken to discover subsets, (b) classiﬁcation accuracy using the selected classiﬁer learners, and (c) statistical analysis using a paired t-test of the results generated by the unsupervised approaches. Since the performance of the supervised approaches is not the main focus of investigation, a statistical signiﬁcance test is not applied to these methods. For the third part, the intra-cluster error is used as a metric to determine whether the reduced dimensionality datasets offer an improved clustering over the whole dataset. For classiﬁer learning, three different learners are employed: J48 [39], JRip [4], and PART [38]. J48 [34] creates decision trees by choosing the most informative features and recursively partitioning the data into subtables based on their values. Each node in the tree represents a feature, with branches from a node representing the alternative values this feature can take according to the current subtable. Partitioning stops when all data items in the subtable have the same classiﬁcation. A leaf node is then created, and this classiﬁcation assigned. JRip [4] learns propositional rules by repeatedly growing rules and pruning them. During the growth phase, antecedents are added greedily until a termination condition is satisﬁed. Antecedents are then pruned in the next phase subject to a pruning metric. Once the ruleset is generated, a further optimisation is performed, where rules are evaluated and deleted based on their performance on randomized data. PART [38] generates rules by means of repeatedly creating partial decision trees from data. The algorithm adopts a divideand-conquer strategy such that it removes instances covered by the current ruleset during processing. Essentially, a rule is created by building a pruned tree for the current set of instances; the leaf with the highest coverage is promoted to a rule. Stratiﬁed 10 10-fold cross-validation (10-FCV) is employed for data validation. In 10-FCV, the original dataset is partitioned into 10 subsets of instances. Of these 10 subsets, a single subset is retained as the testing data for the model, and the remaining nine subsets are used for training. The cross-validation process is then repeated 10 times (the number of folds),

115

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

with each of the 10 subsets used only once for testing. The 10 results from the folds then combined to produce a single model estimation. The advantage of 10-FCV over random sub-sampling is that all objects are used for both training and testing, and each object is used for testing only once. The stratiﬁcation of the data prior to its division into folds ensures that each class label (as far as possible) has equal representation in all folds, thus helping to reduce bias/variance problems [20]. For the approaches described in this paper (UFRFS and dUFRFS), the Łukasiewicz t-norm (maxðx þ y 1; 0Þ) and the Łukasiewicz fuzzy implicator (minð1 x þ y; 1Þ) are adopted to implement the fuzzy connectives. Also, as well as the fuzzy-rough dependency measure approaches (labelled UFRFS and dUFRFS respectively), the same additional measures are used for both unsupervised approaches: fuzzy-rough boundary region (B-UFRFS, B-dUFRFS), fuzzy-entropy (FE-UFRFS and FEdUFRFS) and the vaguely-quantiﬁed rough set measure (VQ-UFRFS, VQ-dUFRFS). These metrics have been shown to be useful for supervised FS [25,18]. Also, for the supervised methods comparison, the same metrics are employed such that a comprehensive view can be obtained with respect to the unsupervised approaches. In summary, the results presented in the following sections are for two different approaches; UFRFS and dUFRFS. For each of these approaches, three additional metrics are employed based on (a) the fuzzy-rough set boundary region, termed B-UFRFS and B-dUFRFS, (b) fuzzy-entropy, termed FEUFRFS and FE-dUFRFS, and (c) vaguely quantiﬁed rough sets, termed VQ-UFRFS, and VQ-dUFRFS. 5. Evaluation 1 – Benchmark data 5.1. Evaluation of subset size and time taken to discover subsets The subset size results for the nine benchmark datasets are shown in Tables 3 and 4. The tables demonstrate that both approaches offer signiﬁcant reduction when compared to the original data dimensionality. Considering the UFRFS approach, the use of different metrics seems to have little effect on subset size, with only the boundary measure (B-UFRFS) resulting in a marginally smaller average subset size for the ionosphere dataset. For dUFRFS, the results are more consistent with less variation across the different metrics suggesting that the choice of metric is less important in this particular case Table 5. It is interesting to note that for the UFRFS approach the fuzzy-rough dependency metric performs best in terms of average subset size, and also manages at least some reduction regardless of the dataset employed. B-UFRFS also manages some reduction (with the exception of the wisconsin dataset) but returns quite large average subset sizes for arrhythmia when compared with the other metrics. Both FE-UFRFS and VQ-UFRFS, fail to offer any reduction for the glass dataset. For the dUFRFS approaches, dUFRFS offers the best reduction in terms of subset size, with the boundary region-based metric (B-dUFRFS) offering similar reductions (and in some cases outperforming it). Similar to the UFRFS approaches, two of the four metrics for dUFRFS fail to gain any reduction for the wisconsin dataset. The ionosphere dataset seems to pose a challenge for dUFRFS in much the same way that wisconsin does for UFRFS. This becomes clear when the subset sizes are compared to those obtained by UFRFS which manages to obtain an averages size of just 9.0, while dUFRFS only manages 18.4. 5.2. Classiﬁcation accuracy The subset size results shown in Tables 3 and 4 show that both of the unsupervised approaches perform well. However, the level of dimensionality reduction is only one aspect which should be considered when employing FS. The classiﬁcation accuracy of any model built from the reduced data is also an important consideration. In Tables 6–8, the results for classiﬁcation accuracy are presented for UFRFS and dUFRFS using three classiﬁer learners (J48, JRip, and PART). From these results, it can be seen that the two approaches are comparable on the whole, with dUFRFS showing slightly better results in general for all three classiﬁers. The most notable exception is FE-UFRFS, which as well as ﬁnding a smaller subset size for the sonar dataset, also manages to obtain the best classiﬁcation accuracy amongst all methods. It should be noted that because no reduction was obtained for the Wisconsin dataset, the classiﬁcation accuracies for B-UFRFS and FEUFRFS presented in Tables 6–8 are those for the unreduced dataset.

Table 3 Average subset sizes for benchmark data using UFRFS approach.

a

Dataset

Features

Objects

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin Colon cancer

280 13 7 9 13 34 25 60 9 2000

452 297 336 214 270 230 120 208 683 63

7.8 10.4 6.0 7.1 10.3 9.0 6.1 6.6 8.1 5.25

21.3 11.4 6.0 7.1 10.3 8.9 6.1 6.7 9.0a 2000a

7.8 11.0 6.0 9.0a 10.3 23 6.8 7.0 9.0a 2000a

14.3 10.4 6.0 9.0a 10.3 16.4 7.0 9.0 8.1 6.75

Denotes that no reduction was achieved.

116

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

Table 4 Average subset sizes for benchmark data using dUFRFS approach.

a

Dataset

Features

Objects

dUFRFS

B-dUFRFS

FE-dUFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin Colon cancer

280 13 7 9 13 34 25 60 9 2000

452 297 336 214 270 230 120 208 683 63

8.0 10.2 6.0 8.0 9.2 18.4 6.0 6.7 8.0 4.0

7.0 10.2 6.0 8.0 10 18.7 5.3 6.4 9.0a 6.8

8.0 11 7.0a 9.0a 10.2 22 6.0 8.0 9.0a 4.8

9 10.3 6.7 9.0a 9.3 21.1 5.9 7.9 8.0 4.0

Denotes that no reduction was achieved.

Table 5 Runtimes(s) for UFRFS and dUFRFS approaches. Dataset

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dUFRFS

B-dUFRFS

FE-dUFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin Colon cancer

11.4 0.03 0.01 0.01 0.01 0.70 0.04 0.09 0.70 1.64

13.2 0.12 0.70 0.68 0.50 1.03 0.38 0.90 0.91 2.02

256.34 24.80 19.1 6.01 19.80 31.30 3.60 37.62 210.20 41.27

92.08 2.6 1.00 0.12 0.93 0.81 0.09 1.26 1.10 1.60

55.28 1.3 0.60 0.40 0.80 3.80 0.06 1.84 1.03 5.56

1871.6 75.50 28.36 16.0 57.40 215.02 10.80 91.16 218.30 109.64

114.40 15.50 17.01 7.3 13.10 44.29 3.20 42.63 93.26 32.26

69.53 1.03 0.86 0.93 1.00 4.30 0.08 2.80 1.43 6.60

5.3. Comparison with PCA The majority of early approaches for unsupervised FS are focussed on improving the clustering or grouping of data [11,14]. Many of these approaches rely on the prior speciﬁcation of the number of clusters, etc. The work in [26], once again relies on thresholding values and is evaluated with regard to the entropy of pre-deﬁned subset sizes. In contrast, the methods proposed in this paper are data-driven and free from subjective tuning parameters or threshold speciﬁcation. This fact, combined with the speciﬁc and focussed nature of many of the previous approaches therefore makes comparison very difﬁcult. However, in order to provide a comparison with a well-known and often used technique for unsupervised dimensionality reduction, an experimental evaluation of PCA [12] is provided here. Since PCA is a rank-based method for dimensionality reduction, the rounded average subset size achieved using both UFRFS and dUFRFS (for each dataset) is utilised when specifying the number of features to return. As can be seen from Table 9, the classiﬁcation accuracies for UFRFS/dUFRFS are comparable or better than those for PCA. The only exceptions are the Oiltos and Arrhythmia datasets, where PCA does particularly well. However, the proposed methods outperform PCA for the Heart dataset across all measures. It should be noted that regardless of any minor increase in performance, the set of features returned by PCA are not a subset of the original features but rather a set of transformed dimensions. This means that the original meaning or semantics of the data have been lost. Indeed, one of the main reasons for performing feature selection is to maintain the interpretability of the data; with PCA this aspect is destroyed.

Table 6 Classiﬁcation accuracies: J48 (%). Dataset

Unred

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-dUFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin Ean accy.

65.78 53.39 82.83 68.08 78.15 86.13 65.75 73.61 95.44 –

49.26 54.03 82.83 69.16 79.15 83.00 64.83 63.33 95.44 71.23

55.69 54.03 82.83 68.58 79.15 86.04 64.83 62.40 95.44 72.11

65.64 53.65 82.83 68.08 78.59 87.22 65.75 73.61 95.44 74.53

55.29 52.54 82.83 68.08 78.67 87.83 64.83 64.71 95.44 72.25

52.15 52.29 83.54 70.40 76.89 86.17 66.17 68.02 95.52 72.35

49.73 53.69 83.54 70.40 76.89 87.83 61.50 72.05 95.52 72.35

53.53 53.65 82.83⁄ 68.08⁄ 78.59 87.83 65.50 72.44 95.44⁄ 73.10

53.99 53.69 83.54 70.40 76.89 85.09 64.58 69.49 95.52 72.58

117

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121 Table 7 Classiﬁcation accuracies: JRip (%). Dataset

Unred

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-dUFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin Mean accy.

70.97 54.16 81.41 67.05 79.19 87.09 68.83 75.84 96.08 –

56.48 53.04 80.93 66.14 80.04 84.74 67.58 64.86 96.08 72.21

59.98 53.04 80.93 67.96 80.04 86.48 67.58 59.23 96.08 72.39

70.62 54.70 80.93 67.05 79.33 87.61 68.83 75.84 96.08 75.67

61.04 54.29 80.93 67.05 79.00 87.04 67.58 61.81 96.08 72.76

58.36 53.99 80.99 67.75 77.30 87.65 63.83 70.99 95.94 72.97

57.75 53.11 80.99 67.75 77.30 87.74 62.33 71.95 95.54 72.70

60.36 54.70 81.41 67.05 79.33 87.43 62.00 71.65 96.08 73.33

60.29 53.11 80.99 67.75 77.30 87.52 65.00 70.60 95.94 73.17

Table 8 Classiﬁcation accuracies: PART (%). Dataset

Unred

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-dUFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin Mean accy.

65.15 52.44 81.79 69.12 77.33 87.39 67.00 77.40 95.68 –

46.34 51.74 81.79 67.82 79.63 84.74 61.17 61.98 95.68 70.10

48.65 51.74 81.79 69.62 79.63 86.52 61.17 60.30 95.68 70.57

64.23 52.07 81.79 69.12 79.44 87.35 67.00 77.40 95.68 74.90

50.61 54.20 81.79 69.12 77.93 88.35 61.17 62.27 95.68 71.24

48.65 51.50 84.65 68.80 78.00 88.78 68.25 65.36 95.59 72.18

44.36 54.04 84.65 68.80 78.00 87.78 63.25 71.36 95.59 71.98

49.78 52.07 81.79 69.12 79.44 89.13 64.00 70.45 95.68 72.38

48.01 54.04 84.65 68.80 78.00 87.09 64.75 66.06 95.59 71.88

5.4. Statistical Analysis As part of the evaluation of the unsupervised fuzzy-rough techniques, a paired t-test with signiﬁcance level of 0.05 has been carried out over 10 runs. The t-test is carried out as part of the cross-validation and as such is a resampled t-test. The baseline reference for the tests are the unreduced data classiﬁcation results. This test has been carried out in order to ensure that results that have been obtained have not been found by chance. The results are shown in Tables 10–12 for each of the classiﬁers. The symbols indicate whether the outcomes are statistically better (v), worse (⁄), or where there is no statistical difference (–) between the results. The results for this evaluation show that for all of the classiﬁer learners, only the Arrhythmia and Sonar datasets return results which are statistically worse than the unreduced data. In this respect, the UFRFS methods seem to perform less optimally when compared with dUFRFS. Although it should be noted that only FE-UFRFS (amongst all of the UFRFS and dUFRFS methods) achieves a result for the Arrhythmia dataset which is statistically comparable to accuracy for the unreduced data. Overall, UFRFS does not perform as well as dUFRFS with results which are consistent across all three classiﬁer learners, showing that there are six cases, where UFRFS is statistically worse than the unreduced data in contrast to only four for dUFRFS.

6. Evaluation 2 – Comparison with supervised methods In the previous experimental evaluations, much effort was focused on demonstrating that the UFRFS and dUFRFS methods selected useful feature subsets. In this section, both unsupervised methods are compared with some existing supervised Table 9 Classiﬁcation accuracies for benchmark data following reduction using PCA. Dataset

J48

JRIP

PART

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

58.40 54.54 82.60 70.56 76.66 83.04 85.00 76.44 94.92

62.30 55.21 81.54 54.47 80.37 85.21 77.50 75.00 93.36

55.75 54.80 80.30 65.88 75.92 83.81 86.66 70.19 94.92

118

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

Table 10 Statistical signiﬁcance using paired t-test: J48. Dataset

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-UFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

⁄ – – – – – – ⁄ –

⁄ – – – – – – ⁄ –

– – – – – – – – –

⁄ – – – – – ⁄ –

⁄ – – – – – – – –

⁄ – – – – – – – –

⁄ – – – v – – – –

⁄ – – – – – – – –

Table 11 Statistical signiﬁcance using paired t-test: JRip. Dataset

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-dUFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

⁄ – – – – – – ⁄ –

⁄ – – – – – – ⁄ –

– – – – – – – – –

⁄ – – – – – – ⁄ –

⁄ – – – – – – – –

⁄ – – – – – – – –

⁄ – – – – – – – -

⁄ – – – – – – – –

Table 12 Statistical signiﬁcance using paired t-test: PART. Dataset

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-UFRFS

VQ-dUFRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

⁄ – – – – – – ⁄ –

⁄ – – – –

– – – – – – – – –

⁄ – – – – – – ⁄ –

⁄ – – – – – – – –

⁄ – – – – – – – –

⁄ – – – – – – – –

⁄

– ⁄ –

– – – – – – –

methods, with a view to comparing the results obtained for both. As stated previously, it may not be fair to compare unsupervised and supervised methods in such a manner, however it does offer a useful benchmark for judging the efﬁcacy of the UFRFS and dUFRFS approaches. From the results for average subset size in Table 13 it can be seen that generally the supervised approaches perform better with lower average subset sizes in comparison to either the UFRFS or dUFRFS approaches. In particular for UFRFS, it is clear that the supervised approaches for all metrics offer a greater reduction in subset size. However this is to be expected since there is much discriminative information contained in the decision feature; an aspect which unsupervised approaches obviously cannot take advantage of. When considering the dUFRFS approaches, it is clear (as has been mentioned previously) that it offers an improvement in performance over the UFRFS approaches. Whilst dUFRFS does not offer subsets which are as compact as those of supervised FRFS, once again it should be remembered that the approaches are applied to different areas of learning. It would seem that the Ionosphere dataset is one of the most difﬁcult for dUFRFS to deal with, consistently returning larger average subset sizes. 6.1. Classiﬁcation accuracy Whilst the results for the average subset sizes demonstrate a clear advantage for the supervised approaches, those for the classiﬁcation accuracies shon in Tables 14–16 are quite different. One result in particular stands out from the rest, and that is FE-UFRFS which offers better (or comparable) results than its corresponding supervised counterpart (FE-FRFS), or indeed most of the other supervised metrics. Those results for the dUFRFS methods offer (on the whole) comparable performance to their supervised counterparts across all classiﬁers. This is to be expected since the dUFRFS methods generate super-reducts of the supervised approach as demonstrated in Section 3.5

119

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121 Table 13 Average subset sizes for benchmark data using supervised FRFS approaches. Dataset

Features

Objects

FRFS

B-FRFS

FE-FRFS

VQ-FRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

280 13 7 9 13 34 25 60 9

452 297 336 214 270 230 120 208 683

6.9 7.6 6.0 9.0⁄ 6.3 6.9 5.0 5.4 6.9

6.4 7.8 6.0 8.4 6.1 6.9 5.0 5.4 6.8

6.3 8.5 6.0 8.1 6.7 7.0 5.0 5.0 6.7

7.0 7.8 6.0 8.5 6.7 6.9 5.0 5.2 6.9

Table 14 Classiﬁcation accuracies: J48 (%). Dataset

Unred

FRFS

B-FRFS

FE-FRFS

VQ-FRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

65.78 53.39 82.83 68.08 78.15 86.13 65.75 73.61 95.44

53.53 55.89 83.92 68.08 78.51 86.08 66.66 70.67 95.02

63.27 55.89 83.92 68.69 78.51 86.08 58.33 70.67 95.02

62.38 51.51 83.92 67.75 76.29 87.39 57.50 73.55 94.87

63.71 56.22 83.92 68.69 71.11 87.39 62.50 72.59 94.72

Dataset

Unred

FRFS

B-FRFS

FE-FRFS

VQ-FRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

70.97 54.16 81.41 67.05 79.19 87.09 68.83 75.84 96.08

59.73 54.54 81.54 67.05⁄ 75.18 87.82 65.00 67.30 95.75

65.48 54.54 81.54 66.82 75.18 87.82 63.33 67.30 95.75

63.71 53.87 81.54 64.01 77.40 89.56 61.66 75.00 95.16

65.70 54.20 81.54 66.82 67.77 86.08 64.16 75.48 95.60

Dataset

Unred

FRFS

B-FRFS

FE-FRFS

VQ-FRFS

Arrhythmia Cleveland Ecoli Glass Heart Ionosphere Olitos Sonar Wisconsin

65.15 52.44 81.79 69.12 77.33 87.39 67.00 77.40 95.68

48.45 49.15 82.14 69.12⁄ 76.29 86.52 68.33 68.26 95.02

58.40 49.15 82.14 69.15 76.29 86.52 64.16 68.26 95.02

56.19 50.84 82.14 68.22 82.22 87.39 58.33 75.96 94.58

60.17 47.81 82.14 69.15 72.96 87.39 65.00 73.55 96.48

Table 15 Classiﬁcation accuracies: JRip (%).

Table 16 Classiﬁcation accuracies: PART (%).

7. Evaluation 3 – Do UFRFS/dUFRFS improve clustering? Whilst the primary focus of this paper is dimensionality reduction, an investigation of the applicability of UFRFS/dUFRFS for the task of improving clustering is carried out in order to determine the effects of the approaches for this task. Five datasets are used for this task (see Table 17) which are known to contain clusters of data objects [29]. A partitional approach k-means algorithm with euclidean distance is used to generate the clusters and assess the clustering. For each of the datasets, the number of clusters is known and this is used for the number-of-clusters parameter (k) for the k-means

120

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121 Table 17 Clustering data. Dataset

Features

Objects

No. clusters

Glass Wine Feat32 Feat64 Libras

9 14 32 64 91

214 270 1024 1024 360

7 3 16 16 15

Table 18 Subset sizes for clustering data. Dataset

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-UFRFS

VQ-dUFRFS

Glass Wine Feat32 Feat64 Libras

7 7 23 12 4

8 8 – – 5

– – – – –

– 9 – – 41

– 8 – – 16

– 8 14 – 17

– – – – 17

– – – – 16

Table 19 Intra-cluster squared error results for clustering data. Dataset

Unred

UFRFS

B-UFRFS

FE-UFRFS

VQ-UFRFS

dFRFS

B-dUFRFS

FE-UFRFS

VQ-dUFRFS

Glass Wine Feat32 Feat64 Libras

18.61 51.39 627.16 823.04 546.20

14.18 26.93 587.62 420.96 12.74

16.80 26.61 – – 18.36

– – – – –

– 34.38 – – 180.49

– 29.21 – – 92.17

– 29.21 403.25 – 104.60

– – – – 100.43

– – – – 95.14

algorithm. The squared error is used to measure the compactness of the clusters both prior to, and following the application of the UFRFS and dUFRFS methods. As can be seen in Table 18, not all metrics result in a reduced feature subset for all datasets. The dFRFS methods tend to result in fewer reductions when compared with the UFRFS methods, with UFRFS itself achieving a reduction for all datasets. When k-means clustering is applied both before and after feature selection, it can be seen from Table 19 that this results in a signiﬁcant reduction in squared error for the generated clusters. Again, the UFRFS methods seem to do better in this respect with the exception of VQ-UFRFS for the Libras dataset. 8. Conclusion This paper has presented two different approaches for unsupervised feature selection. In addition, a number of different metrics have also been included. Both approaches use fuzzy-rough sets to select features for inclusion or removal from the ﬁnal candidate subset. At present, the UFRFS algorithm utilises a simple but nevertheless effective backwards elimination method for search, whilst dUFRFS uses a greedy forward selection method. The problem with such search techniques is that they often return results which are sub-optimal. The investigation of other search techniques, such as ant colony optimisation and particle swarm optimisation, may help in alleviating this problem and thus further improve the efﬁciency of both the UFRFS and dUFRFS approaches. Another area of particular interest is the investigation of the commonality of subsets selected by UFRFS and dUFRFS. Such an investigation may provide an important insight into those subsets of features (and individual features) which are selected by each approach. Also, a more complete comparison of UFRFS and dUFRFS with other unsupervised FS techniques for clustering performance, would form the basis for a series of topics for future investigation. Other areas that warrant further attention include the fusion of the two techniques (UFRFS and dUFRFS) in order to investigate if a hybrid approach is more effective for the task of unsupervised FS. References [1] M. Alibeigi, S. Hashemi, A. Hamzeh, Unsupervised feature selection using feature density functions, International Journal of Computer Science and Engineering 3 (3) (2009) 146–151. [2] C. Armanino, R. Leardi, S. Lanteri, G. Modi, Chemometric analysis of Tuscan olive oils, Chemometrics and Intelligent Laboratory Systems 5 (1989) 343– 354.

N. Mac Parthaláin, R. Jensen / Information Sciences 229 (2013) 106–121

121

[3] S. Basu, C.A. Micchelli, P. Olsen, Maximum entropy and maximum likelihood criteria for feature selection from multi-variate data, in: Proc. IEEE Intl. Symp. Circuits and Systems, 2000, pp. 267–270. [4] W.W. Cohen, Fast effective rule induction, in: Machine Learning: Proceedings of the 12th International Conference, 1995, pp. 115–123. [5] C. Cornelis, G. Hurtado Martı´n, R. Jensen, D. S´leßzak, Feature selection with fuzzy decision reducts, in: 3rd Int. Conf. on Rough Sets and Knowledge Technology (RSKT’08), 2008, pp. 284–291. [6] S.K. Das, Feature selection with a linear dependence measure, IEEE Transactions on Computers 20 (9) (1971) 1106–1109. [7] M. Dash, H. Liu, Unsupervised feature selection, in: Proc. Paciﬁc Asia Conf. Knowledge Discovery and Data Mining, 2000, pp. 110–121. [8] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal Royal Statistical Society, Series B 39 (1) (1977) 1–38. [9] E.P.M. de Sousa, C. Traina, A.J.M. Traina, L. Wu, C. Faloutsos, A fast and effective method to ﬁnd correlations among attributes in databases, Data Mining and Knowledge Discovery 14 (2007) 367–407. [10] D. Dubois, H. Prade, Putting rough sets and fuzzy sets together, in: slo92, 1992, pp. 203–232. [11] J.G. Dy, C.E. Brodley, Feature subset selection and order identiﬁcation for unsupervised learning, in: P. Langley (Ed.), Proceedings of the Seventeenth international Conference on Machine Le arning, Morgan Kaufmann Publishers, San Francisco, CA, 2000, pp. 247–254. [12] P. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, Prentice Hall, 1982. [13] M.A. Hall, Correlation-based feature selection machine learning, Ph.D. Thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand, 1998. [14] J.A. Hartigan, Statistical theory in clustering, Journal of Classiﬁcation 2 (1985) 63–76. [15] R.P. Heydorn, Redundancy in feature selection, IEEE Transactions on Computers (1971) 1051–1054. [16] Y. Hou, P. Zhang, T. Yan, W. Li, D. Song, Beyond redundancies: a metric-invariant method for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering (2010) 348–364. [17] R. Jensen, Q. Shen, Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches, IEEE Transactions on Knowledge and Data Engineering 16 (12) (2004) 1457–1471. [18] R. Jensen, Q. Shen, New approaches to fuzzy-rough feature selection, IEEE Transactions on Fuzzy Systems 17 (4) (2009) 824–838. [19] Y. Kim, W.N. Street, F. Menczer, Evolutionary model selection in unsupervised learning, Intelligent Data Analysis 6 (6) (2002) 531–556. [20] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the International Joint Conference on Artﬁcial, Intelligence (IJCAI’95), 1995, pp. 1137–1143. [21] A. Köhler, M. Ohrnberger, F. Scherbaum, Unsupervised feature selection and general pattern discovery using Self-Organizing Maps for gaining insights into the nature of seismic waveﬁelds, Computer Geoscience 35 (9) (2009) 1757–1767. [22] D. Koller, M. Sahami, Toward optimal feature selection, in: Proceeding of ICML, 1996, pp. 284–292. [23] H. Liu, H. Motoda (Eds.), Computational Methods of Feature Selection, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2008. [24] N. Mac Parthalain, R. Jensen, Measures for unsupervised fuzzy-rough feature selection, International Journal of Hybrid Intelligent Systems 7 (2010) 249–259. [25] N. Mac Parthaláin, R. Jensen, Q. Shen, Fuzzy entropy-assisted fuzzy-rough feature selection, in: Proceedings of the 15th International Conference on Fuzzy Systems (FUZZ-IEEE’06), 2006. [26] P. Mitra, C.A. Murthy, S.K. Pal, Unsupervised feature selection using feature similarity, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (3) (2002) 301–312. [27] B. Mirkin, Mathematical Classiﬁcation and Clustering, Kluwer Academic Publishers, 1996. [28] M. Morita, R. Sabourin, F. Bortolozzi, C.Y. Suen, Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition, Proc. of the Seventh international Conference on Document Analysis and Recognition, vol. 22(August 03–06, 2003), ICDAR. IEEE Computer Society, Washington, DC, 2003. [29] D.J. Newman, S. Hettich, C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases, University of California, Department of Information and Computer Science, Irvine, CA, 1998.