A GRASP method for building classification trees

A GRASP method for building classification trees

Expert Systems with Applications 39 (2012) 3241–3248 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

233KB Sizes 0 Downloads 52 Views

Expert Systems with Applications 39 (2012) 3241–3248

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A GRASP method for building classification trees Joaquín Pacheco a,⇑, Esteban Alfaro b, Silvia Casado a, Matías Gámez b, Noelia García b a b

Department of Applied Economics. University of Burgos, Plaza Infanta Elena s/n, 09001 Burgos, Spain Faculty of Economic and Business Sciences, University of Castilla-La Mancha, Plaza de Universidad 1, 02071 Albacete, Spain

a r t i c l e Keywords: Decision trees Complexity Metaheuristics GRASP

i n f o

a b s t r a c t This paper proposes a new method for constructing binary classification trees. The aim is to build simple trees, i.e. trees which are as less complex as possible, thereby facilitating interpretation and favouring the balance between optimization and generalization in the test data sets. The proposed method is based on the metaheuristic strategy known as GRASP in conjunction with optimization tasks. Basically, this method modifies the criterion for selecting the attributes that determine the split in each node. In order to do so, a certain amount of randomisation is incorporated in a controlled way. We compare our method with the traditional method by means of a set of computational experiments. We conclude that the GRASP method (for small levels of randomness) significantly reduces tree complexity without decreasing classification accuracy. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction In a classification problem we have a set of n cases where each is characterized by a number m of attributes and a class label which is known a priori. The entire set or part of it is used as a training set to design a set of rules that enable classification of the cases with the highest accuracy possible. Various methodologies exist for tackling the task of classification: Classical discriminant analysis, logistic regression, decision trees, artificial neural networks, etc. Tree-based methods (Aitkenhead, 2008; Chen & Hung, 2009; Chen, Hu, & Tand, 2009) are characterized by easy representation and simple performance. Moreover, the information can easily be extracted from the tree as a group of decision rules enabling a simple and direct interpretation. These methods have been applied to several areas related to the company world, for example in finance and management (Kirkos, Spathis, & Manolopoulos, 2007; Tsai & Chiou, 2009), marketing (Abrahams et al., 2009) and industrial production processes (Muller & Wiederhold, 2002). A classification tree is represented graphically using nodes and branches. Each node symbolizes a question or a decision about one of the attributes. The initial node is usually called the root node. Two or more branches branch off each node depending on whether the response is binary or not. Finally, a terminal or leaf node is reached and a decision about the class to be assigned is taken. When a new case is presented to the tree, it will be passed through ⇑ Corresponding author. Tel.: +34 947 25 90 21; fax: +34 947 25 80 13. E-mail addresses: [email protected] (J. Pacheco), [email protected] (E. Alfaro), [email protected] (S. Casado), [email protected] (M. Gámez), Noelia. [email protected] (N. García). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.09.011

the tests in the nodes. Each test has mutually exclusive and exhaustive outputs. Each example is therefore assigned to one and only one output which means that neither will an example be labelled with two or more classes nor will it remain without a class. One important property of decision trees is that they can work with both numerical and categorical attributes with two or more categories. In order to construct a tree, a set of conditions over an attribute are established to make a split. In this way, the training data set is divided into subsets that satisfy each condition. The subsets are split again and new levels are added to the tree until a stop criterion is reached. The node splitting process is then complete and the final nodes are called terminal or leaf nodes. The cases belonging to each leaf will be homogeneous so the node can be labelled with the most popular class. If all the cases belong to the same class, the node is said to be pure. Among all the possible splits that can be made, the best one must be selected in terms of increasing the homogeneity or purity level for descendent nodes in comparison to the parent. All the available attributes are evaluated to choose the one that provides subsets which are as homogeneous as possible with respect to the target variable. In order to do so, it is necessary to define an impurity measure which should be 0 if the node is pure and maximal if all the classes are present in the same proportion (Breiman, Friedman, Olshen, & Stone, 1984). If the tree is allowed to grow until its leaves are pure, the risk of over-fitting arises, i.e. in the final splits the most appropriate one is selected on the basis of a very small set of examples. The consequence is a very high percentage of correctly classified cases in the training data set but a much lower one in the test set. In order to design good classification methods, not only is it important to

3242

J. Pacheco et al. / Expert Systems with Applications 39 (2012) 3241–3248

obtain a high percentage of correctly classified cases in the training set but also to take into account the generalization capability. The main goal is to extract general rules for the correct classification of data sets which are different from the training data set. It has been observed that an excessive growth in tree complexity often implies low generalization capabilities (Wess & Kulikoswki, 1991, p. 126). It is therefore interesting to build simple but effective trees, i.e. trees with as few nodes as possible but with an acceptable percentage of correctly classified cases. The aim of this article is to try to solve the following problem: given a fixed number of correctly classified cases in the training data set, to obtain the tree with the smallest number of leaf nodes as possible to reach that level of accuracy. We are consequently dealing with an optimization task. More specifically, we prefix the level of correctly classified cases at 100% but the method can easily be generalised to any other level or percentage of correctly classified cases. It should be mentioned that we consider binary trees because they are possibly the most popular (see Duda, Hart, & Stork, 2000, p. 397) and they can be used with both numerical and categorical (nominal or ordered) attributes. In contrast to traditional methods that choose the best possible split (attribute) according to the impurity reduction measure that is being used, in this paper when a tree is constructed, the split chosen is not the best of all the feasible ones. We randomly select one of the best to obtain a tree and the process is then successively repeated so as to obtain different trees and finally the best tree in terms of minimum size is selected. This is how the algorithm proposed in this paper works. The algorithm is based on the strategy known as GRASP (Greedy Randomized Adaptive Search Procedure) for optimization problems. GRASP is a metaheuristic strategy which was initially proposed by Feo and Resende (1989) at the end of the eighties and deals with search procedures based on greedy, random and adaptive functions. The use of heuristics and meta-heuristics in statistics and data mining is a relevant approach, and has a promising future, (Winker & Gilli, 2004). There are several works and applications in the last years. For example: in classification (Belacel, Raval, & Punnen, 2007; Pendharkar, 2006), in variable selection for classification (García, García, Melián, Moreno, & Moreno, 2006; Pacheco, Casado, Núñez, & Gómez, 2006; Pacheco, Casado, & Núñez, 2009; Yang & Olafsson, 2006), in clustering (Belacel, Hansen, & Mladenovic´, 2003; Pacheco & Valencia, 2003; Paleologo, Elisseeff, & Antonini, 2010; Woodruff & Reiners, 2004), in Principal Components, (Cadima, Cerdeira, & Minhoto, 2004), Regression and Prediction models (Baragona, Battaglia, & Cucina, 2004; Gatu & Kontoghiorghes, 2003, 2005; Hofmann, Gatu, & Kontoghiorghes, 2007; Kapetanios, 2007), etc. The second section of this paper describes traditional algorithms for growing trees. Section 3 presents a formulation of the problem and outlines the GRASP algorithm. Section 4 presents the computational experiments. Finally, the main conclusions are outlined in Section 5.

2. Description of traditional algorithms Let A = {a1, a2, . . . , an} be the training data set, V = {1, 2, . . . , m} the set of attributes, and xij the ith value of jth attribute. Let K be the number of classes. Henceforth, the attributes are considered as numerical attributes although the methods described here can be easily adapted to binary and ordered features. The algorithms for constructing binary classification trees start from the root node to which the entire training data set A has been assigned and this is then divided into two subsets. These two subsets are assigned to the branches which branch off. The process is repeated in each node and stops when a fixed stop criterion is reached (all the cases in the node belong to the same class or at

least there is a certain level of homogeneity). In this case, the node is said to be a terminal or leaf node and is labelled with the highest relative frequency in the partition. Each node is partitioned according to an attribute j and a value h. The subset T  A is divided into two groups T1 and T2 in such a way that T1 = T \ {ai/xij 6 h} and T2 = T \ {ai/xij > h}. Before partitioning takes place, it is necessary to find the attribute and the value (split point) that provides the highest reduction in the impurity measure. As mentioned previously, without programming details and without excluding similar alternatives, a node can be considered as a structure or record defined by the following fields: – Terminal: Boolean. – If Terminal = TRUE its class field is defined. – If Terminal = FALSE it has the fields: attribute (integer number from 1 to m) split point (real number) and left_child_node, right_child_node (the nodes branching off) It is therefore a recursive structure that ends in the terminal nodes. Given a node x_node and its corresponding data subset T  A, traditional algorithms for building trees are based on the following recursive method: Procedure Set_node_characteristics (T; var x_node) If the stopping conditions are true, then:

Terminal ¼ TRUE and class ¼ majority class in T otherwise (a) Terminal = FALSE (b) For j = 1, . . ., m: Determine the value hj that obtains the maximum increase in purity Dj in the partitions determined by the attribute j (c) Determine j⁄ = argmax{Dj/ j = 1, . . ., m} (d) Do attribute = j⁄ and split point = hj (e) Do T1 = T \ fai =xij 6 hj g and T2 = T \fai =xij > hj g (f) Execute Set_node_characteristics (T1, left_child_node) and Set_node_characteristics (T2, right_child_node)

The tree is therefore grown by executing

Set node characteristics ðA; root nodeÞ: The basic difference between the main published algorithms lies in the function for measuring the level of impurity and the criteria used to determine the best attribute and values. Several alternatives have been developed according to how the feasible split points for each attribute are selected. The selected impurity measure determines different algorithms. Breiman et al. (1984) proposed the Gini index as an impurity measure in classification and regression trees (CART) and, alternatively, Quinlan proposed the entropy-based ID3 and C4.5 algorithms in 1986 (Quinlan, 1986) and 1993 (Quinlan, 1993), respectively. More specifically, ID3 uses Information Gain as a partitioning method and C4.5 uses the Gain Ratio. These functions and criteria are defined below. Let T  A be a set of nT cases and nk the number of elements or cases of class k in T, then for k = 1 . . . K, the Gini index is defined as:

GiniðTÞ ¼ GIðTÞ ¼ 1 

2 K  X nk k¼1

nT

;

and the Entropy index is defined as

EntropyðTÞ ¼ EðTÞ ¼ 

  K X nk nk  log2 nT nT k¼1

J. Pacheco et al. / Expert Systems with Applications 39 (2012) 3241–3248

The most commonly used algorithms for evaluating a partition are defined as follows: Let T  A be a set, T1 and T2 a partition of T, nT1 = jT1j, nT2 = jT2j and nT = jTj, then the values Information Gain (G) and Gain Ratio (R) for this partition are defined as

G ¼ EðTÞ 



 nT1 nT2 G  EðT 1 Þ þ  EðT 2 Þ and R ¼ I nT nT

where I ¼  nnT1T  log2



nT1 nT



 nnT2T  log2



nT2 nT

 .

In terms of the general procedure for determining the best split point hj for each attribute, the instances are ordered according to attribute value in ascending order and the average point between each pair of values is calculated. The impurity of the partition is then calculated for each of these average points so as to select the best one. Therefore, n  1 average points are evaluated in the root node. This procedure, despite its good reputation, does not work properly with numerical attributes (Fayyad & Irani, 1993). An alternative selecting method was proposed by Fayyad and Irani (1992) whereby only points with a change of class are considered as feasible ones. This method known as the Fayyad method has substantially improved the efficiency of the C4.5 algorithm in relation to the split point selection (see Figure 3). Other alternatives can be found in Quinlan (1996), Wu and Urpani (1999), and Yen and Chu (2007). 3. Problem formulation and description of GRASP method Let A be the training data, let V be the set of attributes and let K the number of classes defined in the previous section, The problem of finding classification trees as simple as possible could be formulated in the same way

minimize n leav esðnodo iniÞ

ð1Þ

subject to:

Accuraccy trainingðnodo iniÞ ¼ 1;

ð2Þ

nodo ini 2 R;

ð3Þ

where R = Set of classification trees defined for V and K; n_leaves(node_ini) is the number of leaves contained in nodo_ini; Accuraccy_training (node_ini) = Ratio of success in classification of cases of A. In order to find optimum or at least good solutions to this problem a GRASP method is proposed. GRASP (Greedy Randomized Adaptive Search Procedure) is a metaheuristic strategy that constructs solutions with controlled randomisation and a greedy function. GRASP was originally proposed in the context of a set covering problem (Feo & Resende, 1989). Details of the methodology and a survey on applications can be found in Feo and Resende (1995) and Pitsoulis and Resende (2002). The GRASP is a constructive strategy. As almost all of constructive methods, is based in adding in every step an element to the solution and repeating it until the solution was completed. A guide function (‘‘greedy’’ function) is used in order to evaluate the possible elements and to decide which one is the best. The underlying idea behind this strategy GRASP, (and the main difference against others constructive methods), is that the apparent best choice does not always lead to the best final solution. The GRASP algorithm therefore proposes that a list be built with the best elements according to the guide function (i.e. a list of candidates) and then to randomly choose an element from the list. Controlled randomisation is therefore used since the element is chosen randomly but only among the best candidates. This process is repeated successively and a set of different

3243

solutions is obtained from which the best one is selected. The final solution is often better than the one obtained in a deterministic way, i.e. strictly according to the guide function. A GRASP procedure used in classification, more specifically in variables selection, could be found in Pacheco et al. (2006). In our case, given a node x_node and its corresponding data subset T  A, the recursive method for building trees is modified in the following way. Procedure Set_node_characteristics_greedy_random (a, T, var x_node) If all the cases in T belong to the same class, then:

Terminal ¼ TRUE and class ¼ majority class in T

ð4Þ

otherwise (a) Terminal = FALSE (b) For j = 1 . . . m: Determine the value hj that obtains the maximum increase in purity Dj in the partitions determined by the attribute (c) Determine Dmax = max{Dj/j = 1 . . . m} and Dmin = min{Dj/j = 1 . . . m} (d) Construct L = {j = 1 . . . m/ Dj P a  Dmax + (1  a)  Dmin} (e) Choose j⁄2 L randomly (f) Do attribute = j⁄ and split point ¼ hj (g) Do T1 = T \fai =xij 6 hj g and T2 = T \fai =xij > hj g (h) Execute: Set_node_characteristics_greedy_random (a, T1, left_child_node) and Set_node_characteristics_ greedy _random (a, T2, right_child_node)

where a 2 [0, 1]. It should be noted that if a = 0, the choice is totally random while if a = 1, it is totally deterministic. The selection of an appropriate a is usually a matter of great concern. The operating scheme of our GRASP algorithm is as follows: Method GRASP_for_Classification_Trees(a, var node_best; var n_leaves_best) Do n_leaves_best = 1 Repeat 1. Execute Set_node_characteristics_greedy_random (a, A, node_initial) 2. If n_ leaves (node_initial) < n_ leaves _best then: (2.a) Do n_leaves_best :¼ n_leaves(node_initial) (2.b) Do node_best :¼ node_initial until a stop criterion is satisfied

We therefore build a classification tree in each iteration. The final solution is the best solution from all the iterations. The stop criterion is executed after a maximum number of iterations or computational time, or after a number of iterations with no improvement. Next we have to remark the following considerations and reflections: – From the formulation of optimization model (1)–(3), specifically from restriction (2) we can conclude that the trees built by our GRASP method obtain 100% of success in training set, and so that its not a pre-pruning method. But of course, it could be combined with post-pruning method. - However the optimization model could be generalized changing restriction (2) by the next one

Accuraccy trainingðnodo iniÞ P b;

ð5Þ

3244

J. Pacheco et al. / Expert Systems with Applications 39 (2012) 3241–3248

for any value b 2 [0, 1]; and our GRASP method could be adapted easily to this more general model changing sentence (4) of Set_node_characteristics_greedy_random by the next one ‘‘If ratio of cases of majority class in T is Pb then:

Terminal ¼ TRUE and class ¼ majority class in T

ð6Þ

With this modification and for suitable values of b < 1 our GRASP could be considered a pre-pruning method. - Even a bi-objective formulation could be considered:

minimize n leav esðnodo iniÞ maximize Accuraccy trainingðnodo iniÞ

http://archive.ics.uci.edu/ml/. In Table 1 are shown the database used and comments about the changes performed. All the experiments have been run on a Pentium IV 2.4 GHz. PC using the compiler BORLAND DELPHI (see. 5.0). For each of the 17 databases considered, the traditional algorithm, (i.e. the best variable at each step), and the GRASP method proposed in this paper have been performed, using 0.2, 0.4, 0.6, 0.8, 0.9, 0.95 and 0.99 as different values of a. In each database, a 1  10 cross-validation has been developed to measure and to test the methods used. Therefore, 170 runs of each method have been carried out. For each method in each run the following values are registered: – Accuracy: Ratio of hits in the test set – Complexity: Number of leaves.

subject to:

nodo ini 2 R: An initial approach to the efficiency curve could be obtained executing our adapted GRASP method (for the problem defined by (1), (5) and (2)) for a set of different values of b; and then to choose a solution into the efficiency curve approach. - These last models (more general problem and bi-objective problem) are very interesting fields of research for future works. Specifically others methods based in Metaheuristic strategies for mono-objective and bi-objective can be used (Tabu Search, VNS, Memetic Algorithms, etc.). Also others ways, not only complexity, to find good generalization/forecasting properties could be considered. 4. Computational experiments In order to check and compare the working effectiveness of our GRASP algorithm for building classification trees we have conducted trials with different databases. For this, we have used 17 data sets. These databases belong to the well-known UCI repository of Machine Learning at the University of California, Irvine (see Murphy & Aha, 1994), which is available on the web page:

Moreover, for the GRASP method the following statistics are also calculated: – N_best: Number of trials where the obtained tree has fewer leaves than the tree grown with the traditional method – N_ties: Number of trials where the obtained tree has the same number of leaves as the tree grown with the traditional method As we mentioned before, construction of each tree should ensure a 100% correct classification rate for examples in A. In the traditional method, the criterion for choosing the best attribute and the split point is Information Gain. This is also the guide function used by our GRASP method. In all the cases, the Fayyad method is used to form the set of candidate values for the split point on each feature. The stopping criterion of our GRASP method is a maximum of 25 iterations. In Table 2 for every database are shown the statistics corresponding to Complexity: Average, standard deviations and the t statistic corresponding to the differences of averages between the GRASP and the traditional method. In bold are shown the t values indicating a significant lower average of Complexity of GRASP trees (at 10% signification level in one-tailed test).

Table 1 Database used. Database

No. of classes

No. of attributes

No. of instances

Instances selected

SPAM Covertype Financial Wave Optical Conect-4

2 8 2 3 10 3

57 54 93 40 64 42

4601 > 580000 3000 5000 5620 67557

4601 6000 3000 3347 5620 6000

GermanDataNumeric Contraceptive Primary Tumor Domain

2

24

1000

1000

3 22

9 17

1473 339

1473 336

Vehicle Ionosphere Heart Wisconsin Breast Cancer Mushroom

2 2 2 2

18 34 13 30

946 351 270 569

435 351 270 569

2

22

8100

5644

House-Votes84 SPECT Dermatology

2

16

435

281

2 6

22 34

267 366

267 366

Comments

6000 instances from two first classes have been selected randomly All instances from two first classes have selected The 42 nominal attributes have been transformed into 126 binary attributes. 6000 cases selected randomly

The 22 classes are divided into two groups: the majority of the cases are in one, and the remaining ones are in the other. The two attributes with the greatest number of missing cases are deleted, and then the cases with missing values are also deleted. Categorical attributes are transformed into 17 binary attributes The instances corresponding with the classes bus and Saab in the training set have been considered After transforming categorical attributes into binary ones, 22 attributes were obtained

The initial attributes are transformed into 121 binary attributes: 1 for each binary variable and 1 for each possible answer for the remaining attributes. The cases with missing values have been deleted The variable with the most missing cases was deleted, and then the cases with missing values were also deleted Variable ‘‘Age’’ with 8 missing values was deleted

3245

J. Pacheco et al. / Expert Systems with Applications 39 (2012) 3241–3248 Table 2 Statistics of Complexity. GRASP

Tradition

a

0.2

0.4

0.6

0.8

0.9

0.95

0.99

SPAM

mean std t

416 13.904 37.670

291.7 8.782 17.972

238 4.570 5.024

215.3 8.274 2.047

214.2 6.286 2.636

215 6.236 2.400

215.2 6.812 2.262

222.9 8.333

Covertype

mean std t

1244.4 16.985 82.185

929.7 12.988 39.655

765.7 9.068 11.398

704.8 7.525 1.972

696.2 8.867 3.752

700.7 8.420 2.817

703.7 9.487 2.051

713.3 11.363

Financial

mean std t

247.1 2.923 96.399

184.2 2.860 46.959

141.5 2.461 13.591

122.8 2.251 2.699

120.5 2.321 4.707

120.3 2.541 4.691

122.6 1.838 3.098

125.8 2.700

Wave

mean std t

287.9 6.839 56.167

201.7 3.592 32.237

155.9 3.247 6.259

142.4 2.221 1.999

139.8 3.011 3.429

140.3 2.359 3.351

143 3.859 1.332

145.4 4.195

Optical

mean std t

999.9 21.825 100.861

602.1 23.019 41.879

401.6 12.094 27.052

298.3 5.458 2.507

288.5 4.428 2.423

284.5 3.064 5.455

287.9 5.301 2.460

293 3.859

Conect-4

mean std t

1690.8 26.055 56.308

1234.3 16.984 19.700

1053.5 22.510 0.823

1000.5 14.370 4.810

985.2 13.653 6.554

988.2 19.481 5.602

1020.4 24.816 2.170

1044.7 25.254

German-Data-Numeric

mean std t

289.4 10.002 29.657

213.4 4.949 13.186

184.1 4.332 0.505

175.4 5.254 3.204

177.8 5.138 2.216

179.9 6.590 1.155

181.7 5.945 0.514

183 5.354

Contraceptive

mean std t

731.2 6.730 31.419

648.7 10.812 6.125

617.1 3.665 1.572

606.9 7.824 4.023

610.3 7.334 3.193

611.7 10.285 2.369

614.4 9.252 1.841

621.8 8.715

Primary Tumor Domain

mean std t

74.4 4.006 0.088

71.6 4.671 1.252

70 4.295 1.979

70.4 4.526 1.773

73.2 5.534 0.544

73.2 5.432 0.549

73.7 5.539 0.350

74.6 5.967

Vehicle

mean std t

26.2 1.549 5.568

19.8 1.476 1.729

18.6 1.075 3.349

17.8 0.632 4.618

19.2 1.619 2.353

19.8 1.874 1.594

21.2 2.251 0.098

21.3 2.312

Ionosphere

mean std t

31.2 2.201 15.744

22.2 1.751 5.951

17.1 1.197 1.131

15.7 1.160 3.432

15.5 1.269 3.632

16 1.054 3.038

16.9 2.025 1.116

17.8 1.549

Heart

mean std t

61.7 3.335 15.091

44.4 2.591 3.450

37.4 2.011 2.229

36.4 1.578 3.280

36.8 1.932 2.776

38.3 2.452 1.363

38.8 2.860 0.901

40 3.091

Wisconsin Breast Cancer

mean std t

27 2.160 11.249

20.6 2.011 3.508

17.1 1.287 1.346

15.6 1.174 4.031

15.7 1.252 3.749

16.5 0.972 2.635

16.9 1.449 1.586

17.9 1.370

Mushroom

mean std t

12.7 2.163 5.410

8.6 0.843 1.500

7.3 0.675 7.965

7.3 0.675 7.965

8.8 0.422 1.500

9 0.000 0.000

9 0.000 0.000

9 0.000

House-Votes-84

mean std t

21.4 2.413 2.193

17.1 1.449 2.982

17.6 2.319 1.820

18.7 1.829 0.734

19.3 1.829 0.000

19.3 1.829 0.000

19.3 1.829 0.000

19.3 1.829

SPECT

mean std t

61.1 3.035 5.068

51.1 2.470 2.753

49.2 3.521 3.617

50.1 1.663 4.094

51.8 2.781 2.055

52.5 2.224 1.653

53.5 2.877 0.700

54.4 2.875

Dermatology

mean std t

33.9 4.999 8.426

23.1 3,213 2.768

19.8 2.098 0.246

18.1 1.449 2.890

18 1.491 3.000

18 1.155 3.354

19.3 1.494 1.049

20 1.491

In Table 3, for every database, are shown the values of N_best, N_ties and N_best + N_ties/2. This last is used in the test sign. This test is recommend because does not assume any commensurability of scores or differences nor does it assume normal distribution and is thus applicable to any data set (Demsar, 2006). In bold are shown the values indicating a significant lower number of terminal leaves of GRASP trees (10% signification level two-tailed test). In Table 4, for every data base, are shown the statistics corresponding to Accuracy: Average, standard deviations and the t statistic corresponding to the differences of averages between the GRASP and the traditional method. In bold are shown the t

values indicating a significant better average of Complexity of GRASP trees (at 10% signification level in one-tailed test). From Table 2 the following comments can be highlighted: – In general least complex trees are obtained for high values of a (a = 0.8, 0.9 and 0.95). Even in almost all databases for these values of a the trees obtained are significantly less complex than the obtained by traditional method. – In database corresponding with more complex trees (high values of Complexity), good results are also obtained with a = 0.99. For example in SPAM, Covertype, Wave, Optical, etc.

3246

J. Pacheco et al. / Expert Systems with Applications 39 (2012) 3241–3248

Table 3 N_best. N_ties and N_best + N_ties/2.

a?

0.2

0.4

0.6

0.8

0.9

0.95

0.99

SPAM

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

0 0 0

8 0 8

9 0 9

8 0 8

7 3 8.5

Covertype

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

0 0 0

7 0 7

10 0 10

10 0 10

10 0 10

Financial

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

0 0 0

8 2 9

10 0 10

10 0 10

7 2 8

Wave

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

0 0 0

6 2 7

7 3 8.5

10 0 10

7 2 8

Optical

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

0 0 0

3 0 3

8 0 8

10 0 10

9 1 9.5

Conect-4

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

4 0 4

10 0 10

10 0 10

10 0 10

10 0 10

German-Data-Numeric

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

4 1 4.5

10 0 10

9 0 9

9 1 9.5

7 2 8

Contraceptive

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

7 0 7

9 0 9

10 0 10

10 0 10

10 0 10

Primary Tumor Domain

N_best N_ties N_best + N_ties/2

5 1 5.5

8 1 8.5

9 0 9

10 0 10

8 1 8.5

7 2 8

6 4 8

Vehicle

N_best N_ties N_best + N_ties/2

0 1 0

6 2 7

10 0 10

9 1 9.5

8 2 9

7 3 8.5

1 9 5.5

Ionosphere

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

7 2 8

9 0 9

8 2 9

7 3 8.5

4 6 7

Heart

N_best N_ties N_best + N_ties/2

0 0 0

0 1 0.5

9 0 9

8 1 8.5

9 1 9.5

8 2 9

5 5 7.5

Wisconsin Breast Cancer

N_best N_ties N_best + N_ties/2

0 0 0

0 0 0

7 1 7.5

10 0 10

10 0 10

8 2 9

7 3 8.5

Mushroom

N_best N_ties N_best + N_ties/2

0 0 0

4 5 6.5

10 0 10

10 0 10

2 8 6

0 10 5

0 10 5

House-Votes-84

N_best N_ties N_best + N_ties/2

2 1 2.5

9 0 9

8 2 9

7 2 8

0 10 5

0 10 5

0 10 5

SPECT

N_best N_ties N_best + N_ties/2

0 0 0

9 0 9

9 1 9.5

9 1 9.5

9 1 9.5

10 0 10

7 3 8.5

Dermatology

N_best N_ties N_best + N_ties/2

0 0 0

0 1 0.5

3 4 5

9 0 9

9 1 9.5

9 1 9.5

6 4 8

– In database corresponding with less complex trees good results are obtained also with lower values of a, specifically a = 0.6 and sometimes a = 0.4. This fact is remarkable in Mushroom and House-Votes-84 data bases. - In general we can conclude that: as more ‘‘complexity’’ of databases as more high values of a is recommended; and in general values of a = 0.8, 0.9 and 0.95 obtained trees with less complexity. From Table 3 similar conclusions have been obtained. From Table 4 we it could be highlighted that the best results are obtained with high values of a (a P 0.6), but in general the significant

differences against traditional method are very scarce. In order to drawn a more general view of the accuracy, in Table 5 its shown the mean and average, standard deviations of Accuracy and Relative_Acc for all database overall; where Relative_Acc = ratio Accuracy/(Accuracy of the traditional algorithm). From Table 5 it seems that the best results are obtained in general for high values of a, and it seems that for these values (except for a = 0.8) there are some slight differences of our GRASP over traditional method. But it must be recognized that these differences seems to be very short. We have avoided performing statistical test because observations come from different data sets. In our opinion this is because GRASP method, as traditional method, is designed

3247

J. Pacheco et al. / Expert Systems with Applications 39 (2012) 3241–3248 Table 4 Statistics of Accuracy. GRASP

Tradition

a

0.2

0.4

0.6

0.8

0.9

0.95

0.99

SPAM

mean std t

0.916 0.011 0.528

0.921 0.012 0.333

0.928 0.015 1.508

0.926 0.010 1.603

0.929 0.011 2.069

0.926 0.017 1.047

0.926 0.012 1.320

0.919 0.011

Covertype

mean std t

0.744 0.014 2.859

0.765 0.017 0.108

0.764 0.012 0.076

0.767 0.023 0.348

0.765 0.018 0.148

0.763 0.014 0.165

0.765 0.009 0.272

0.764 0.017

Financial

mean std t

0.856 0.017 0.234

0.860 0.024 0.616

0.858 0.027 0.432

0.850 0.013 0.465

0.853 0.013 0.085

0.853 0.026 0.031

0.848 0.026 0.498

0.854 0.021

Wave

mean std t

0.846 0.023 2.626

0.865 0.019 0.693

0.855 0.024 1.654

0.876 0.019 0.584

0.868 0.018 0.386

0.867 0.017 0.431

0.868 0.019 0.374

0.871 0.020

Optical

mean std t

0.808 0.025 11.445

0.866 0.016 6.873

0.888 0.011 4.093

0.904 0.011 1.026

0.909 0.010 0.071

0.909 0.011 0.070

0.913 0.014 0.653

0.909 0.012

Conect-4

mean std t

0.766 0.009 4.330

0.779 0.014 1.593

0.784 0.021 0.660

0.790 0.021 0.146

0.783 0.012 1.059

0.783 0.020 0.791

0.791 0.014 0.313

0.789 0.015

German-Data-Numeric

mean std t

0.672 0.064 1.577

0.667 0.065 1.797

0.676 0.050 1.737

0.699 0.053 0.346

0.692 0.033 1.116

0.696 0.046 0.588

0.705 0.027 0.000

0.705 0.016

Contraceptive

mean std t

0.474 0.046 0.636

0.459 0.026 0.228

0.469 0.047 0.352

0.461 0.032 0.084

0.472 0.036 0.602

0.472 0.029 0.660

0.462 0.037 0.000

0.462 0.039

Primary Tumor Domain

mean std t

0.794 0.087 0.302

0.782 0.077 0.656

0.812 0.070 0.184

0.803 0.070 0.086

0.809 0.065 0.093

0.809 0.068 0.089

0.815 0.072 0.262

0.806 0.083

Vehicle

mean std t

0.952 0.025 2.406

0.945 0.027 1.896

0.938 0.028 1.394

0.931 0.032 0.886

0.924 0.034 0.434

0.931 0.040 0.784

0.924 0.030 0.450

0.917 0.037

Ionosphere

mean std t

0.886 0.046 1.687

0.900 0.052 0.916

0.895 0.057 1.121

0.906 0.035 0.790

0.926 0.031 0.331

0.906 0.033 0.815

0.909 0.032 0.658

0.920 0.044

Heart

mean std t

0.719 0.094 1.160

0.737 0.064 0.810

0.781 0.059 0.845

0.730 0.089 0.879

0.744 0.083 0.462

0,744 0,066 0.529

0.756 0.047 0.156

0.759 0.059

Wisconsin Breast Cancer

mean std t

0.916 0.045 0.933

0.921 0.028 0.805

0.948 0.035 1.080

0.929 0.028 0.269

0.934 0.025 0.140

0.941 0.033 0.624

0.914 0.035 1.213

0.932 0.031

Mushroom

mean std t

1 0 0

1 0 0

1 0 0

1 0 0

1 0 0

1 0 0

1 0 0

1 0

House-Votes-84

mean std t

0.961 0.036 1.486

0.957 0.044 1.195

0.943 0.045 0.514

0.932 0.043 0.006

0.943 0.051 0.473

0.936 0.047 0.166

0.939 0.048 0.330

0.932 0.049

SPECT

mean std t

0.788 0.093 0.926

0.785 0.085 0.867

0.765 0.093 0.370

0.769 0.103 0.439

0.762 0.097 0.271

0.777 0.087 0.670

0.765 0.084 0.389

0.750 0.093

Dermatology

mean std t

0.919 0.040 1.799

0.956 0.027 0.684

0.947 0.036 0.000

0.942 0.042 0.348

0.942 0.046 0.326

0.942 0.044 0.336

0.944 0.035 0.198

0.947 0,028

Table 5 Accuracy and Relative_Acc for all database. GRASP

Tradition

a

0.2

0.4

0.6

0.8

0.9

0.95

0.99

Accuracy

mean std

0.825 0.134

0.833 0.136

0.838 0.132

0.836 0.133

0.838 0.132

0.838 0.130

0.838 0.129

Relative_Acc

mean std

0.987 0.076

0.996 0.064

1.002 0.056

0.999 0.042

1.002 0.041

1.002 0.039

1.002 0.031

0.837 0.131

3248

J. Pacheco et al. / Expert Systems with Applications 39 (2012) 3241–3248

to classify correctly all cases in training set. So overfitting its assumed in both cases, and for that there are not significant improvements in forecasting. Otherwise some specific ideas have been done to adapt our GRASP method in order to obtain trees with less overfitting (end of Section 3) and so that to improve its forecasting properties. 5. Conclusions In this paper a new method has been proposed for developing classification trees with the lowest complexity possible, specifically binary trees. This method is based on the GRASP strategy and basically consists in modifying the way the feature determining the split on each node is selected. While traditionally the best feature is selected, with our methodology a list with the best attributes is built and one is chosen at random. In other words, our method uses randomness and traditional criteria and functions to select the feature in a well-balanced way. This method can be used with different functions for measuring purity or criteria for selecting the feature (Gain Index, Gain Ratio, etc.), with different stopping criteria, with different ways to determine the split points set, or with different post-pruning methods. The obtained results show that with a suitably sized list significantly less complex trees are achieved than in traditional methods. Interpretation is consequently easier and a balance between optimization and generalization in the test set is established. There is also a slight (but insignificant) improvement in test set accuracy. Our method is therefore as accurate as the traditional method but obtains less complex trees. Although, generally speaking, the best results are obtained with small lists (high a values, i.e. less randomness), higher randomness might, however, be advisable for simple databases. Finally with some modifications, shown at the end of Section 3, our method could be used as pruning strategy. This last issue leaves open an interesting research field for future works. Acknowledgements Authors are grateful for financial support from the Spanish Ministry of Education and Science and FEDER founds (National Plan of R&D-Project ECO2008–06159/ECON), from the ‘‘Junta de Castilla y León’’ (Project BU008A10–2) and from ‘‘Caja de Burgos’’ and University of Burgos (Projects H04J.0G and H02.0I). References Abrahams, A. S., Becker, A. B., Sabido, D., D’Souza, R., Makriyiannis, G., & Krasnodebski, M. (2009). Inducing a marketing strategy for a new pet insurance company using decision trees. Expert Systems with Applications, 36(2), 1914–1921. Aitkenhead, M. J. (2008). A co-evolving decision tree classification method. Expert Systems with Applications, 34(1), 18–25. Baragona, R., Battaglia, F., & Cucina, D. (2004). Fitting piecewise linear threshold autoregressive models by means of genetic algorithms. Computational Statistics & Data Analysis, 47(2), 277–295. Belacel, N., Hansen, P., & Mladenovic´, N. (2003). Fuzzy J-Means: a new heuristic for fuzzy clustering. Pattern Recognition, 35, 2193–2200. Belacel, N., Raval, H. B., & Punnen, A. P. (2007). Learning multicriteria fuzzy classification method PROAFTN from data. Computers & Operations Research, 34(7), 1885–1898. Breiman, L., Friedman, J. H., Olshen, R., & Stone, C.J. (1984). Classification and regression trees. Wadsworth International Group: Belmont. Cadima, J., Cerdeira, J. O., & Minhoto, M. (2004). Computational aspects of algorithms for variable selection in the context of principal components. Computational Statistics & Data Análisis, 47(2), 225–236.

Chen, Y. L., & Hung, L. T. H. (2009). Using decision trees to summarize associative classification rules. Expert Systems with Applications, 36(2), 2338–2351. Chen, Y. L., Hu, H. W., & Tand, K. (2009). Constructing a decision tree from data with hierarchical class labels. Expert Systems with Applications, 36(3), 4838–4847. Demsar, J. (2006). Statistical comparison of classifier over multiple data sets. Journal of Machine Learning Research, 7, 1–30. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). John Wiley. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8, 87–102. Fayyad, U. M. & Irani, K. B. (1993). Multi-interval discretization of continuousvalued attributes for classification learning, in: Proceedings of the 13th International Joint Conference on Artificial Intelligence pp. 1022–1027. Feo, T. A., & Resende, M. G. C. (1995). Greedy randomized adaptive search procedures. Journal of Global Optimization, 2, 1–27. Feo, T. A., & Resende, M. G. C. (1989). A Probabilistic heuristic for a computationally difficult Set Covering Problem. Operational Research Letters, 8, 67–71. García, F. C., García, M., Melián, B., Moreno, J. A., & Moreno, M. (2006). Solving feature selection problem by a parallel scatter search. European Journal of Operational Research, 169(2), 477–489. Gatu, C., & Kontoghiorghes, E. J. (2003). Parallel algorithms for computing all possible subset regression models using the qr decomposition. Parallel Computing, 29, 505–521. Gatu, C., & Kontoghiorghes, E. J. (2005). Efficient strategies for deriving the subset {VAR} models. Computational Management Science, 2(4), 253–278. Hofmann, M., Gatu, C., & Kontoghiorghes, E. J. (2007). Efficient algorithms for computing the best subset regression models for large-scale problems. Computational Statistics and Data Analysis, 52(1), 16–29. Kapetanios, G. (2007). Variable selection in regression models using nonstandard optimisation of information criteria. Computational Statistics and Data Analysis, 52(1), 4–15. Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). Data mining techniques for the detection of fraudulent financial statements. Expert Systems with Applications, 32(4), 995–1003. Muller, W., & Wiederhold, E. (2002). Applying decision tree methodology for rules extraction under cognitive constraints. European Journal of Operational Research, 136(2), 282–289. Murphy, P. M., & Aha., D. W. (1994). UCI repository of Machine Learning. University of California, Department of Information and Computer Science; . Pacheco, J., Casado, S., & Núñez, L. (2009). A variable selection method based in tabu search for logistic regression models. European Journal of Operational Research, 199(1), 506–511. Pacheco, J., Casado, S., Núñez, L., & Gómez, O. (2006). Analysis of new variable selection methods for discriminant analysis. Computational Statistics and Data Analysis, 51(3), 1463–1478. Pacheco, J., & Valencia, O. (2003). Design of hybrids for the minimum sum-ofsquares clustering problem. Computational Statistics and Data Analysis, 43(2), 235–248. Paleologo, G., Elisseeff, A., & Antonini, G. (2010). Subagging for credit scoring models. European Jounal of Operational Research, 201(2), 490–499. Pendharkar, P. C. (2006). A data mining-constraint satisfaction optimization problem for cost effective classification. Computers & Operations Research, 33(11), 3124–3135. Pitsoulis, L. S., & Resende, M. G. C. (2002). Greedy randomized adaptive search procedures in handbook of applied optimization. In P. M. Pardalos & M. G. C. Resende (Eds.) (pp. 168–182). Oxford University Press. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufman. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Tsai, C. F., & Chiou, Y. J. (2009). Earnings management prediction: A pilot study of combining neural networks and decision trees. Expert Systems with Applications, 36(3), 7183–7191. Wess, S. M., & Kulikoswki, C. (1991). Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning and expert systems. Morgan Kauffmann Publisher. Winker, P., & Gilli, M. (2004). Applications of optimization heuristics to estimation and modelling problems. Computational Statistics & Data Analysis, 47(2), 211–223. Wu, X., & Urpani, D. (1999). Induction by attribute elimination. IEEE Transactions on Knowledge and Data Engineering, 11(5), 805–812. Yang, J. Y., & Olafsson, S. (2006). Optimization-based feature selection with adaptive instance sampling. Computers & Operations Research, 33(11), 3088–3106. Yen, E., & Chu, I. W. M. (2007). Relaxing instance boundaries for the search of splitting points of numerical attributes in classification trees. Information Sciences, 177, 1276–1289. Woodruff, D. L., & Reiners, T. (2004). Experiments with, and on, algorithms for maximum likelihood clustering. Computational Statistics & Data Analysis, 47(2), 237–253.