Chemometrics and Intehgent Laboratory Systems, 22 (1994) 17-23 Elsevier Science Publishers B.V., Amsterdam
Partial least squares and classification and regression trees C.H. Yeh The Procter & Gamble Company, Regulatory and Clinical Development, Sharon Woods Technical Center,
11370 Reed Hartman Highway, Cincinnati, OH 4.5241-2422 (USA)
C.H. Spiegelman Texas A&M University, Department of Statistics, College Station, TX 77843-3143 (USA) (Received 30 June 1992; revised manuscript received 13 July 1993)
Yeh, C.H. and Spiegelman, C.H., 1994. Partial least squares and classification and regression trees. Chemometrics and Intelligent Laboratory Systems, 22: 17-23. High-dimensional data are found in scientific fields. Difficulties arise when one applies classical classification methods to these high-dimensional data sets, because of multicollinearity. Problems with high-dimensional data sets can be overcome by reducing the dimensions of data sets. The partial least squares (PLS) method is a new method used for dimension reduction. The classification and regression trees method is applied to the reduced data for solving classification problems. A new stopping criterion for the PLS procedure is introduced.
Current classification methods are not easily applied to high-dimensional data sets. This is due to the ill-posed structure of the data sets. However, these classification problems can be simplified by reducing dimensions of the data sets, and then applying a classification method to the re-
Correspondence to: C.H. Spiegelman, Texas A&M University, Department of Statistics, College Station, TX77843-3143 (USA). 0169-7439/94/$07.00
duced data. In this paper, we use the partial least squares (PLS) method for dimension reduction. For solving classification problems, we will use classification and regression trees (CART). Partial least squares methods are common chemometric techniques for dimension reduction. The PIS algorithm and properties have been studied by several authors such as Wold et al. [II Naes and Martens , Hoskuldsson , Otto and Wegscheider , Stahle and Wold [5J, Helland , Frank  and Thomas and Haaland 181.It is found that partial least squares is often a reasonable choice for dimension reduction when model-based reduction is not feasible.
0 1994 - Elsevier Science Publishers B.V. All rights reserved
C.H. Yeh and C.H. Spiegelman / Chemom. Intell. Lab. Syst. 22 (I 994) 17-23 /Original Research Paper
The CART methodology was developed based on the ‘partitioning’ idea. It partitions the measurement space into several subspaces so that each subspace has a distinct character and contains samples with similar properties. Breiman et al.  extended previous works such as Belson [lo], Henrichon and Fu [ll], Meisel and Michalopoulos , Friedman , Breiman and Stone , Gordon and Olshen , and Mabbet et al. . The CART methodology uses a binary classification tree, and it imposes very few assumptions on the structure of data sets. It gives an insight and an understanding of the structure of data sets. We feel that CART is a flexible nonparametric tool for analyzing classification problems.
The CART method is fully explained in ref. 9; however, we provide an abbreviated presentation below. A binary classification tree is implemented in three steps. The first step is ‘growing’ a tree; the second step is ‘pruning’ a grown tree and the last step is ‘assigning’ classes to a right-size tree.
Fig. 1. The six-class binary classification
The detailed procedures for constructing a classification tree are described in ref. 9. Step 1. Growing a binary tree To give a more precise formulation for growing a binary tree, we need the following explanations. Arrange the set of variables for each observation and define the variables (xi, x2,.. ., XJ as measurement vector X corresponding to the observation. Define the measurement space X as containing all possible measurement vectors. A classifier or classification rule is a partition of X into disjoint subsets A,, A,,‘. . . , Aj, x = U jAj such that for every measurement vector XE Aj the predicted class is i. Binary tree structured classifiers are constructed by splitting subsets of the measurement space into two descendant subsets repeatedly. This procedure is displayed in Fig. 1. Here X2 and ,y3 are disjoint with Xi =X2 U x3. Similarly, X4 and X5 are disjoint with X2 =X4 U x5; x6 and X7 are disjoint with X3 = X6 U x7. The subsets that are not split into two descendant subsets by any classifiers, for example, X6, X8, x9, x107 x11, x12, ,y14, and Xl5 are called terminal subsets. Each
C.H. Yeh and C.H. Spiegelman/Chemom. Intell. Lab. Syst. 22 (1994) 17-23/Original Research Paper
terminal subset is designated by a class label. There may be two or more terminal subsets with the same class label. The classification rule is obtained by putting all terminal subsets together with the same class label. Thus, 4
= Xl0 “x14
The splits are decided by conditions of the nature of X= (xi, x2,. . . , xp). For example, split 1 of X could be of the form X2 = (X; xg Q 5) and X3 = (X; x3 > 5) for a numerical variable; or could be of the form X2 = (X; xg is female) and X3 = (X, xg is male) for a categorical variable. After the split is found, it is used to determine whether X goes to X2 or X3. X will go to the left if xg < 5; otherwise, go to the right. When X moves to a terminal subset, its predicted class is given by the class label attached to that terminal subset. Using the terminology of tree theory, we define a subset of measurement space ,yj as a node t and X as the root node. In the process of growing trees, one of the important things is to find the splits so that the data in the ‘after-split’ subsets (descendant nodes) are purer than the data in the ‘before-split’ subsets (parent nodes). The splits that give purer descendant nodes are implemented as follows: (1) Define the node proportion p( j I t) as the proportion of observations in node t belonging to class .Z,where j = 1, 2,. . . , .I, J is the number of classes, and ~(1 I t) +p(2 I t) + . . . +p(J I t) = 1. (2) Define a node impurity measure of node t, denoted by i(t), as a non-negative function C#Jof PC1 I t), PC2 I t), . . . , p(J I t>. The function 4 has the following properties: (i) 4 is a maximum at the point ($, $, . . . , +I, and (ii) C#Jachieves its minimum at the point (1, 0,. . . , O), (0, 1, 0,. . . , 01,. . .) (0, 0,. . .) 1). These two properties mean that a node impurity is largest when observations from different classes are equally mixed together in a node and a node impurity is smallest when a
node contains only one class. (3) Define a set of possible binary splits s at each node. Each split divides a node into two descendant nodes, t, and t,, such that a proportion p1 of the observations in node t go to t, and a proportion p, of the observations go to t,. The decrease in the node impurity measure is defined as Ai(s, t) = i(t) -p&J -p&J At each node, a search is made through all the possible splits to find the split s,, that gives the largest decrease in node impurity measure, that is, Ai( s,,
, t) = max Ai(s, t)
Then the node t is split into two descendant nodes by split s,,. The split s,, is called the ‘best’ splitting rule because it gives maximum decrease in node impurity. The same search procedures for the ‘best’ splitting rule repeat on descendant nodes. The splitting rule search procedure will continue until there are only a few observations in the descendant nodes. At this time, when a node reaches a point where there is no significant decrease in the node impurity measure, it becomes a terminal node. Denote a set of terminal nodes by f and set Z(t) = i(t)p(t). The tree impurity measure Z(T) is defined by Z(T) = c,,,z(t)= Ctei: i(t>p(t). Take any node t * E f and split the node into t, and t, using a split S. The new tree T * has impurity measure Z(T *) = (f_r*1Z(t) + Z(t,) + Z(t,>. The decrease in the c Il tree impurity is AZ(s, t) = Z(T) - Z(T *) = Z(t) Z(t,) - Z(t,>. Breiman et al.  showed that maximizing the decrease in the tree impurity is equivalent to maximizing the decrease in the node impurity. Thus, the split selection procedure can be viewed as a repeated procedure to minimize the overall tree impurity measure. Step 2. Pruning a grown tree
If the splitting procedures described in step 1 are carried out until each terminal node has only one observation, then each terminal node is classified by the observation it contains, and the misclassification rate is zero. In general, more splits (that is, a larger tree) result in a lower value of misclassification rate. However, too large a
C.H. Yeh and C.H. Spiegelman / Chenaom. Intell. Lab. Syst. 22 (I 994) 17-23 /Original Research Paper
tree will have a higher true misclassification rate than the right size of a tree, as the effect of entering additional variables will cause the sum of squares to decrease in stepwise regression. Therefore, the structured tree described in step 1 needs ‘pruning upward’ until an appropriate-size tree is found. The pruning procedure is based on the idea of weakest-link cutting. The procedure prunes off the links that are not necessary in terms of reducing the misclassification rate R(t). For an arbitrary node, Breiman et al. l-91showed that R(t) & R(t,) +R(t,) where t, and t, are two descendant nodes of node t. When R(t) = R(t,) + R(t,), the split of node t does not help reduce the misclassification rate; then one should prune off t, and t,. Let T be a tree with branch Tt where a branch T, is defined as a root node t E T consisting of the node and all descendant nodes of t in T. The pruning procedure is implemented as follows. For any branch T,, define a cost-complexity measure as R,(T,) = R(T,) + (YT, where R(T,) is a misclassification rate for branch T,, (Y is a cost-complexity parameter and T, is the number of terminal nodes for branch T,. For a node t, define R,(t) =R(t) + (Y. As long as R,(T,) Q R,(t), the branch T, has a smaller cost complexity than a single node t. This means that the split of node t does help reduce RJT,). In other words, the split is necessary. As the tree grows (more splits), the misclassification rate decreases while the cost-complexity parameter (Yincreases. For any branch T,, there is a value of (Yfor which R,(T,) and R,(t) are equal so that the antecedent node and the branch have the same cost complexity. Therefore, the split is unnecessary and the node t is considered as the weakest link. So we will prune off the branch T, from tree T. The pruning procedure will repeat the process from tree Tpruned= CT - TI> to find the weakestlink node until no more pruning is needed. The right-size tree is thus chosen. The right-size tree, called the optimal tree, has the minimum costcomplexity measure. Step 3. Assigning classes The last step of constructing a classification tree is to assign classes to terminal nodes of the
optimal tree. A class assignment rule assigns a class j E (1, 2,. . . , J) to every terminal node t E f. The class of a terminal node is determined by p( j I t> = rnax!p( j I t ), called the plurality rule. A node t is assrgned to class j if the majority of the observations in node t are class j. In the case where there are equal numbers of observations from each group in a terminal node, the terminal node can be arbitrarily assigned to one of the classes.
IMPURITY FUNCTION COMPONENTS
AND THE NUMBER
There are several stopping criteria for deciding the number of PLS components. For example, the number of PLS components can be determined by the norm of the residuals of Y( 11F II) where 11F II is defined as m. The PLS procedure stops extracting components when 11F II reaches a specified threshold. Another criterion is to choose the number of PLS components that gives a minimum prediction residual sum of squares (PRESS). Hiiskuldsson  suggested a likelihood ratio test as a criterion for deciding the number of components. However, the criteria described above are appropriate for numerical response variables. For categorical response variables, these criteria are not suitable. In CART, the tree impurity measure is used to indicate how well the observations of a data set are classified. Hence, we propose using it as a criterion for determining the number of PLS components for the case where the response variables are categorical. Using an impurity measure to determine the number of PLS components is based on the concept that PLS extracts the most significant components in the X and Y spaces and CART uses the most significant (or most informative) variables to separate observations. There are three kinds of impurity measures used in Breiman’s CART: the Gini criterion, the Gini diversity index and the twoing criterion. The Gini diversity index, i(t) = CizipciItjpoII), is adopted here to be the impurity measure of a node. The tree impurity measure Z(T) = C,,rZ(t) = C,,+(t)
W C.H. Yeh and C.H. Spiegelman/Chemom. Intell. Lab. Syst. 22 (1994) 17-23/0riginal
xp(t) where f is the set terminal nodes of a tree T. The Gini diversity index is used here because the index has an interesting interpretation: (i) the estimated probability of misclassification under the plurality rule is i(t) = CizipciIt) XJ+~,~), the Gini diversity index; (ii> the Gini diversity index is the sample variance, p( j I t)[l p(j ) t)], of the observations in a node t . We propose the following stopping rule for adding PLS components: at each stage of the PLS procedure, construct the classification tree and calculate its impurity measure. Stop the PLS procedure when the addition of the next PLS component to the CART procedure does not produce a tree which has a lower impurity measure than the current tree. The number of PLS components is thus determined. This procedures selects the number of PLS components that gives a minimum tree impurity measure. There, a tree impurity can be expressed as a function of A, T,, where A is the number of PLS components.
Two examples are used here to illustrate the methods that we proposed. The first data set is a rice data set provided by the US Department of Agriculture as an illustrative example. We use PLS with CART. The data set consists of twenty observations on two kinds of rice. The two kinds of rice are labeled as ‘G and ‘L’. The researchers are interested in analyzing the protein composition of the rice so that they can separate the two kinds of rice. High-performance liquid chromatography (HPLC) is used to analyze the protein composition. There are 150 wavelengths for each observation. Before the dimension reduction procedure was done, CART was applied to the original rice data set to examine the misclassification result. Since the dimension is much larger than the sample size, we would expect that the misclassification rate will not be satisfactory. The result showed that eight out of ten rice L observations were misclassified. The 20-fold cross-validation misclassification rate was 0.55. Therefore, we tried reducing the dimension of the data set. CART
Fig. 2. The classification tree of the rice example.
was iteratively applied to the reduced data. Our purpose was to better classify the rice. The iteration was used to decide the number of PLS components. The classification tree for this example can be seen in Fig. 2. The impurity measure and the result of deciding the number of components are shown in Fig. 3. The stopping criterion was: if the decrease in tree impurity is less than 0.1, we will stop adding more components. The first latent variable was entered into CART and separated the observations into two descendants based on the split t, Q -0.628. The impurity measure was 0.29. Then the second latent variable was added into CART and separated the observations in the descendants based on the split t, G -0.075 or I(T.,
Fig. 3. The number of PLS components for the rice example.
C.H. Yeh and C.H. Spiegelman / Chemom. Intell. Lab. Syst. 22 (1994) 17-23 /Original Research Paper
t, < -0.455. The impurity measure was 0.145. When the third latent variable was entered, however, the impurity measure remained the same. The decrease in tree impurity was 0. Thus we stopped the procedure and decided the number of components to be two. To compare the performance of the Fisher linear discriminant method and CART, we applied the reduced data set to Fisher linear discriminant function and CART. The first and second PLS components were used to calculate the discriminant function and classify the observations. The n-fold cross-validation misclassification rate was calculated. Table 1 displays the results of applying the Fisher linear discriminant method and CART to the rice example. In the table R(cv) is the misclassification rate using 20fold cross-validation. In this example, CART is a better method than Fisher’s since it gives both a lower 20-fold cross-validation misclassification rate. The n-fold cross-validation misclassification rate of CART is 0.20 while that of Fisher is 0.25. Another example we use to illustrate the method is a detergent data set. This data set contains 29 spectra that were collected from two batch mixers, mixers 1 and 2. On a certain day of the week, the liquid detergent samples were collected from the mixers for several consecutive weeks. The samples were then run through spectroscopy. The spectrum for each sample was recorded. We wish to know which mixer a outof-spec sample came from. Before the dimension reduction procedure was done, CART was applied to the original detergent data set to examine the misclassification result. The result is not satisfactory. The misclas-
B MlXER .?
Fig. 4. Classification tree for the detergent data set.
sification is 0.205. Since the covariance matrix for this data set is not full rank, the Fisher’s Exact test cannot be applied to this data set. The PLS method was used to reduce the dimension and CART was used to separate samples into one of the mixers. The stopping criterion was: if the decrease in tree impurity is less than 0.1, we will stop the procedure. The first latent variable was entered into CART and separated the observations into two descendants based on the split f 1 > = - 0.01. The impurity measure is 0.124. Then the second latent variable was added into CART and separated the observations into two descendants based on the splits t, > 0.06. The impurity measure is 0.06. The classification tree for this example is shown in Fig. 4. The I(b.A)A 0.60.5-
Results of applying the Fisher linear discriminant method and CART to the rice data
Observations are classified
G L R(cv)
Fisher; observations are from
CART; observations are from
I 3 0.25
8 2 0.20
2 of PLS
Fig. 5. The number of PLS components data.
for the detergent
C.H. Yeh and C.H. Spiegelman /Chemom.
Intell. Lab. Syst. 22 (1994) 17-23 /Original Research Paper
the Conference in Matrix Pencil, March 1982 (Lecture Notes
Results of applying the fisher linear discriminant method and CART to the detergent data
in Mathematics), Springer, Heidelberg, 1983, pp. 286-293. T. Noes and H. Martens, Comparison of prediction methods for multicollinear data, Communication on Statistics,
Observations are classified
Mixer 1 Mixer 2 Rfcv)
Simulation and Computations, 14 (1985) 546-576. PLS regression method, Journal Chemometrics, 2 (19881211-228.
CART; observations are from
Fisher; ‘observations are from
impurity measure and the result of deciding the number of components are shown in Fig. 5. To compare the performance of the Fisher linear discriminant method and CART, we applied the reduced data set to Fisher linear discriminant function and CART. The first and second PLS components were used to calculate the discriminant function and classify the observations. Both n-fold cross validation misclassification rates were calculated. Table 2 displays the results of using the Fisher’s linear discriminant method and CART. In this example, CART also gave better results than the Fisher method did. The 29-fold cross-validation misclassification rate for CART is 0.06 whereas with the Fisher method it is 0.11.
C.H. Spiegelman’s work was in part supported by a grant from the US National Science Foundation.
REFERENCES 1 S. Wold, H. Martens and H. Wold, The multivariate calibration problem in chemistry solved by the PLS methods, in A. Ruhe and B. Kagstrijm (Editors), Proceedings of
M. Otto and W. Wegscheider, Spectrophotometric multicomponent analysis applied to trace metal determinations, Analytical Chemistry, 57 (1985) 63-69.
L. Stable and S. Wold, Partial least squares analysis with cross-validation for the two-class problem: A Monte Carlo study, Journal of Chemometrics, 1 (1987) 185-196. I.S. Helland, On the structure of partial least squares regression, Communication on Statistics, Simulation and Computation, 17 (1988) 581-607.
I.E. Frank, Comparariue Monte Carlo Study of Biased ReReport No. 105, Department of Statistics, Stanford University, Palo Alto, CA, 1989. E. Thomas and D.M. Haaland, Comparison of multivariate calibration methods for quantitative spectral analysis,
gression Techniques, Technical
Analytical Chemistry, 62 (1990) 1091-1099.
9 L.J. Breiman, R. Freidman, R. Olsen and C. Stone, Classification and Regression Trees, Wadsworth, Pacific Grove, CA, 1984. 10 W.A. Belson, Matching and prediction on the principal of biological classification, Applied Statistics, 8 (1959) 65-75. 11 E.G. Henrichon and K.S. Fu, A nonparametric partitioning procedure for pattern classification, IEEE Transaction on Computers, K-18 (1969) 614-624. 12 W. Meisel and D.A. Michalopoulous,
A partitioning algorithm with application in pattern classification and the optimization of decision trees, IEEE Transaction on Com-
puters, IC-22 (1973) 93-103. 13 J.H. Friedman, A recursive partitioning decision rule for nonparametric classification, IEEE Transaction on Computers, IC-26 (1977) 404-408. 14 L. Breiman and C.J. Stone, Parsimonious Binary Classification Trees, Technical Report, Technology Service Corpo-
ration, Santa Monica, CA, 1978. 15 L. Gordon and R.A. Ohlsen, Consistent nonparametric regression from recursive partitioning schemes, Journal of Multioariate Analysis, 10 (1980) 611-627. 16 A. Mabbett, M. Stone and J. Washbrook, Cross-validation selection of binary variables in differential diagnosis, Applied Statistics, 29 (1980) 198-204. 17 R.J. Light and B.H. Margolin, An analysis of variance for categorical data, Journal of the American Statistical Association, 66 (1971) 534-544.