Comparison of supervised pattern recognition methods with McNemar’s statistical test

Comparison of supervised pattern recognition methods with McNemar’s statistical test

Analytica Chimica Acta 477 (2003) 187–200 Comparison of supervised pattern recognition methods with McNemar’s statistical test Application to qualita...

232KB Sizes 1 Downloads 39 Views

Analytica Chimica Acta 477 (2003) 187–200

Comparison of supervised pattern recognition methods with McNemar’s statistical test Application to qualitative analysis of sugar beet by near-infrared spectroscopy Y. Roggo∗ , L. Duponchel, J.-P. Huvenne Laboratoire de Spectrochimie Infrarouge et Raman, CNRS UMR 8516, Bˆatiment C5—Université des Sciences et Technologies de Lille, 59655 Villeneuve d’Ascq Cedex, France Received 11 June 2002; received in revised form 22 October 2002; accepted 30 October 2002

Abstract The application of supervised pattern recognition methodology is becoming important within chemistry. The aim of the study is to compare classification method accuracies by the use of a McNemar’s statistical test. Three qualitative parameters of sugar beet are studied: disease resistance (DR), geographical origins and crop periods. Samples are analyzed by near-infrared spectroscopy (NIRS) and by wet chemical analysis (WCA). Firstly, the performances of eight well-known classification methods on NIRS data are compared: Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN) method, Soft Independent Modeling of Class Analogy (SIMCA), Discriminant Partial Least Squares (DPLS), Procrustes Discriminant Analysis (PDA), Classification And Regression Tree (CART), Probabilistic Neural Network (PNN) and Learning Vector Quantization (LVQ) neural network are computed. Among the three data sets, SIMCA, DPLS and PDA have the highest classification accuracies. LDA and KNN are not significantly different. The non-linear neural methods give the less accurate results. The three most accurate methods are linear, non-parametric and based on modeling methods. Secondly, we want to emphasize the power of near-infrared reflectance data for sample discrimination. McNemar’s tests compare classification developed with WCA or with NIRS data. For two of the three data sets, the classification results are significantly improved by the use of NIRS data. © 2002 Elsevier Science B.V. All rights reserved. Keywords: Classification; Supervised pattern recognition method; McNemar’s test; NIRS; Sugar beet

1. Introduction Supervised pattern recognition refers to techniques in which a priori knowledge about the category membership of samples is used for classification. The clas∗ Corresponding author. Tel.: +33-3-20-43-66-61; fax: +33-3-20-43-67-55. E-mail address: y [email protected] (Y. Roggo).

sification model is developed on a training set of samples with known categories [1]. The model performance is evaluated by the use of a validation set by comparing the classification predictions with the true categories. The application of pattern recognition methodology within chemistry [2–4], biology [5], pharmaceutical [6] and food sciences [7] is becoming more important. Near-infrared spectroscopy (NIRS) data is also used

0003-2670/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved. PII: S 0 0 0 3 - 2 6 7 0 ( 0 2 ) 0 1 4 2 2 - 8

188

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

to solve classification problems [8,9]. Classification methods are numerous and the main problem is to choose the most accurate method. Previous studies use two types of criteria to compare methods: quantitative criteria [10] like prediction rate or misclassification percentage and qualitative criteria [11] like memory requirement, training speed or interpretation easiness. The aim is to optimize the determination of qualitative parameters of sugar beet. Three parameters of sugar beet are studied: disease resistance (DR), geographical origins and crop periods. In this study, a statistical test is proposed to compare quantitative criteria: McNemar’s tests [12] will be used to compare the misclassification rates obtained with different methods in terms of statistical significance. We can notice that this test is already used in medicine [13,14]. This test is chosen because it’s the only statistical test with an acceptable type I error (α) and a low type II error (β) [15]. The first kind of error (α) corresponds to the probability of rejecting a member of a class as a non-member (false negative) and the second kind of error (β) is the probability to classify a non-member as a member (false positive). Firstly, the performance of different supervised pattern recognition methods will be discussed on the three NIRS data sets which concern quality of sugar beet. In this study, linear and non-linear methods [16] are used. Eight well-known algorithms [1] are computed: Linear Discriminant Analysis [17] (LDA), K-Nearest Neighbors method [18,19] (KNN), Soft Independent Modeling of Class Analogy [20] (SIMCA), Discriminant Partial Least Squares [21] (DPLS), Procrustes Discriminant Analysis [22] (PDA), Classification and regression tree [23] (CART), Probabilistic Neural Network [24] (PNN) and Learning Vector Quantization [25,26] (LVQ) neural network. Secondly, McNemar’s tests will compare classifications using NIR data or wet chemical data in order to underline the advantages of NIR spectroscopy.

2. Experimental 2.1. Near-infrared spectroscopy A NIR reflectance spectrophotometer (Model 6500, Foss NIRsystem® , Silver Spring, MD, USA) with a large cup (Natural Product Sample Cup® IH 0314P)

containing 100 g of beet brei is used. During 1 min, the reference (ceramic) is scanned 10 times then the beet sample is scanned 20 times at wavelength ranging from 400 to 2498 nm. The resolution is 2 nm. Between two samples, the cup is washed with distilled water at room temperature and dried. The washing and the drying steps take 2 min long. Just after the NIR measurement, wet chemical analyses are realized. The NIR spectra (Fig. 1) are pre-treated firstly with Standard Normal Variate (SNV) [27] algorithms and then derived two times [28,29] in order to enhance the spectral information and to reduce the base line drift. SNV is a mathematical transformation of log(1/R) spectra. The spectral data are reduced and centered by the use of the following calculation: SNVi = 

xi − x¯ i (xi

− x)2 /(w − 1)

,

for i ∈ [400, 2498 nm] where xi is the log(1/R) value at the wavelength i, w the number of wavelengths, x¯ the mean of the log(1/R) values and SNVi is the corrected log(1/R) value at the wavelength i. 2.2. Wet chemical analysis (WCA) All samples are analyzed by the “Syndicat National des Fabricants de Sucre” laboratories (Villeneuve d’Ascq, France). The sucrose, glucose, amino nitrogen, sodium and potassium concentrations are determined for each sample. The samples are analyzed twice and mean values are used. For 26 g of sugar beet brei, a weight of 177 g of lead acetate solution is added. The solution is blended 5 min long and filtered on a simple filter paper [30]. Wet chemical analyses are done on clarified juice. The sucrose content is determined by a polarization measurement [31]. Sodium and potassium rates are measured by flame-photometry [32]. Amino nitrogen is estimated by colorimetric method [32] using ninhydrine (Verbièse® , Wasquehal, France). Glucose is determined by enzymatic test (GOD-PAP Method— Reference L94111) [33] provided by Hycel® (Pouilly en Auxois, France). Glucose, nitrogen, sodium and potassium are measured by an automatic instrument (LCA Instruments® , La Rochelle, France). Chemical

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

189

Fig. 1. NIRS sugar beet spectra pre-treated by SNV and second derivative.

composition of sugar beet (Table 1) shows the complexity of natural products. 2.3. Data sets Three qualitative parameters—disease resistance, sugar beet origins (SBO) and crop periods—are predicted with chemical or NIR data. 2.3.1. Disease resistance Beets, which have rhizomania, have a massive proliferation of feeder roots and the tap root is atrophied. The disease resistance training set is composed of 83

beet samples: 42 samples are resistant (R) to rhizomania and 41 other samples are not resistant (NR) in each data set. The DR validation set is also composed of 83 samples (42 NR and 41 R samples). NR and R samples are randomly split into training and validation. 2.3.2. Sugar beet origins Origins are representative of different production areas of France. Origins are coded by a integer between 0 and 7. For each origin, samples are randomly split into training and validation sets. The sugar beet origin training set is composed of 320 samples having eight different geographical origins (40 samples for

Table 1 Chemical composition of sugar beet

Minimum Maximum Mean Standard deviation

Sucrose (g/100 g)

Glucose (g/100 g)

Potassium (10−3 mol/100 g)

Sodium (10−3 mol/100 g)

Amino nitrogen (10−3 mol/100 g)

14 21 17.6 1.0

0.01 0.10 0.05 0.04

2 5 3.5 0.5

0.05 1.70 0.32 0.20

0.2 2.3 0.6 0.3

190

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

each origin). SBO validation set contained 320 samples (40 samples for each origin). 2.3.3. Crop periods Crop is divided into four periods from September to December (coded by 0–3). For each period, samples are split into training and validation set. The crop period (CP) training set contains 262 samples: 37 samples are harvested in September, 75 in October, 75 in November and 75 in December. CP validation set is composed of 262 samples (37 for period coded by 0, 75 for period 1, 75 for period 2 and 75 for period 3). 2.4. Software and hardware All methods (except DPLS) are computed with Matlab® (Version 6.0, The Math Works Inc., Natick, USA). Classification tree and PDA are developed by the authors. SIMCA and LDA are available with the PLS Toolbox® (Eigenvector Research). PNN and LVQ are computed with neural network Toolbox® (Version 4.0, The Math Works Inc., Natick, USA). DPLS is computed with Winisi (Infrasoft® , Port Matilda, USA). The computer has an AMD AthlonTM processor (1.4 GHz) with 768Mb of RAM.

3. Method description 3.1. Classification methods There are three main differences between supervised pattern recognition algorithms. A first distinction is between methods focusing on discrimination, such as LDA, and those that put the emphasis more on similarity within a class, for example SIMCA. The second difference concerns linear and non-linear methods. The third distinction comes from parametric and non-parametric methods. In the parametric techniques such as LDA, statistical parameters of the normal distribution of samples are used in the decision rules. The supervised pattern recognition algorithms used in this study are the following. 3.1.1. LDA Linear Discriminant Analysis is a linear and parametric method with discriminating character [17,35]. LDA focus on finding optimal boundaries between

classes. LDA, as principal component analysis (PCA), is a feature reduction method. However, while PCA selects a direction which retains maximal structure in a lower dimension among the data, LDA select direction, which achieves maximum separation among the different classes [1]. We can notice that LDA uses Euclidean distance to classify unknown samples. Linear functions DC = m j =1 vj x¯ C,j are computed for each class [16]. v j are the weights given to the variables, x¯C,j is the mean of variable j in class C and m is the number of variables. For an unknown sample uwith variable values xu,j , the same function Du = m j =1 vj x¯ u,j is obtained. u is classified with the class C which implies DC is nearest to Du . The only condition to apply LDA is that the number of variables (m) should be lower than the sample number (n). When NIR data are used, m is the number of wavelengths. A dimension reduction using PCA is needed. The validation set objects are projected onto the 20 principal components (PC) axis of the calibration set. When WCA data sets are used, m correspond to the number of components which are analyzed (m = 5). 3.1.2. KNN K-Nearest Neighbors [18,19] is a non-parametric method. An unknown sample of the validation set is classified according to the majority of its K-Nearest Neighbors in the training set [34]. KNN is simple to compute. The matrix of distances of the validation set samples to all others points of the training set is computed. Euclidian distance  (di ,j ) between two samples m 2 i and j is: di,j = k=1 (xik − xjk ) where m is the number of variables, xi ,k is the value of the variable k for the sample i. When WCA sets are used, the variables are the concentrations (m = 5). When NIRS data set is used, the variables are the absorbencies. The K-neighbors of an unknown sample are the training samples which have the lowest Euclidian norms. Prediction class is the class having the largest number of objects among the K-neighbors. The optimum K number is determined by a cross-validation procedure [10], i.e. each object in the training set is taken out and considered as a validation sample. The procedure is performed for K = 1 to n − 1 and the number K, which gives the lowest error rate, is chosen.

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

3.1.3. Modeling methods 3.1.3.1. SIMCA. Soft Independent Modeling of Class Analogy [20] uses the modeling properties of principal component technique (PCA). It’s a parametric method. This method considers each class separately. For each class C, a PCA is performed which leads to a principal component model [35]: X C = T C · P C + E C where XC is the centered X matrix for the class C, P C the loading matrix in class C, P C the P C transposed matrix, T C the score matrix in class C, E C is the residuals matrix in class C. As many models as the number of classes is obtained. The validation set is used with all class models. Unknown sample is assigned to the class of the model that produced the smallest residue. SIMCA puts the emphasis more on similarity within a class than on discrimination between classes. 3.1.3.2. DPLS. Discriminant Partial Least Squares [21] is a parametric and linear method. Regression is used for the classification. PLS is a well-known regression method [36]. PLS find latent variables in the features spaces which have a maximum covariance within the predictor variables. But the studies using DPLS for classification are few [10]. Regression coefficient matrix B is calculated with the training set: Y training = Xtraining · B where Y is constructed with ones and zero in each column. PLS algorithm gives the scores matrix T and the weight loadings matrix P to have T = Xtraining · P [37]. When two classes have to be discriminated, algorithm PLS1 is applied. If there are more classes, PLS2 is used [16]. The suitable number of factors is selected by using a cross-validation technique [1]. B estimated by PLS = P (T  · T )−1 T  · Y The prediction of dependant variables on a new set of objects is done with the same formula Y validation = Xvalidation · B estimated by PLS . DPLS is computed with Winisi (Version 1.04) Software (Infrasoft International® ). 3.1.3.3. PDA. Procrustes Discriminant Analysis [22,38] may be considered equivalent to DPLS. The only difference is that in PDA the eigen vectors are obtained from the covariance matrix Z  Z (where

191

Z = Y  · X). The fundamental step in this technique is the Procruste’s transformation of scores into the true target matrix Y according to a transformation matrix. Procruste’s transformation is combination of rotation, translation and stretch of the score matrix of each class. Procruste’s transformation is a non-orthogonal oblique transformation. Like DPLS, the original matrix is decomposed T = X training · P . A Procruste’s transformation of the score matrix T into the class membership matrix Y is the key of the classification: Y = T · W with W estimated = (T  · T )−1 T  · Y . An independent validation set is used to evaluate the accurancies of the classification. The X validation matrix is decomposed like the X matrix calibration and the predicted classes are calculated: Y validation = T validation · W estimated . 3.1.4. CART Classification And Regression Tree [23] is a non-parametric, univariate rule induction method. The training set is recursively divided into many subgroups, according to a splitting criterion to split the node into two daughter nodes [39]. The splitting condition is that descendant nodes are more homogeneous to class content than their parents. First, every split on each predictor variable is examined at each node, the splits are selected and executed and the full tree is obtained when the stopping rule is satisfied. The choice of the final tree is based on the apparent error rate. An estimate of the error rate is obtained through a cross-validation [40]. 3.1.5. Neural networks Artificial Neural Networks (ANN) are non-linear and non-parametric classification methods. ANN are composed of several layers of neurons: input, hidden and output layers. A neuron is a processing unit of which inputs are transformed by an activation function into the outputs [11]. In this study, two types of ANN are used: LVQ and PNN. 3.1.5.1. LVQ. The Kohonen neural networks belong to the class of self organizing feature maps. It is usually performed in unsupervised classification. However, the Kohonen neural network can be extended as a supervised pattern recognition method. Learning Vector Quantization neural network [25,26] has three layers [41]: input, competitive and

192

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

Table 2 Principle of McNemar’s contingence table n00 : number of samples misclassified by both algorithm A and B n10 : number of samples misclassified by algorithm B but not by A

n01 : number of samples misclassified by algorithm A but not by B n11 : number of samples misclassified by neither algorithm A nor B (|n01 − n10 | − 1)2 McNemar’s value = n01 + n10

n = n00 + n10 + n01 + n11 is the total number of samples in the validation set.

linear layer. LVQ1 learning rule is used. The competitive layer classifies input vectors. The competitive transfer function accepts an input vector and returns output of zero for all neurons except for one, called the winner: winner’s output is one. The linear layer transforms the competitive layer’s classes into target classifications. LVQ uses the class membership of the training set samples for the fine adaptation of the network after the Kohonen map. 3.1.5.2. PNN. Probabilistic Neural Network [24] is a feed forward network with no backpropagation [42]. The input layer is used to store the new samples of the validation set. The activation function is f (x) = 2 e−(x·v i −1)/σ where x is the normalized inputs vector, v the weights vector, i the class number, and σ is the smooth factor. An adjustment of σ can improve the network performance. PNN training is accomplished by simply copying each pattern in the training set to the pattern units (hidden layer). PNN are a class of neural networks which implement a Bayesian decision strategy. The summation layer consists of one neuron for each data class and sums the outputs from all hidden neurons. The products of the summation layer are forwarded to the output layer where the probability of a new sample to belong to a class is calculated [11]. 3.2. Statistical method: McNemar’s test Prediction rate is the percentage of correctly predicted samples. The aim is to have a statistical procedure to decide if two methods have the same accuracy. McNemar’s test [12] is a particular case of Fisher’s sign test [43]. Two algorithms A and B are trained and the same validation set is used. Null hypothesis is the two algorithms A and B have the same error rate. McNemar’s test is based on a χ 2 -test with one degree of freedom (if the number of samples is higher than 20).

Table 2 shows how the McNemar’s value is calculated. Value −1 in the numerator is a “continuity” correction term to take into account the fact that the statistic is discrete while χ 2 distribution is continuous [15]. The χ 2 critical value with a 5% level of significance (α = type I error), written χ 2 (1,0.95) , is 3.8414. If the null hypothesis is true, the probability to have McNemar’s value greater than χ 2 (1,0.95) is less than 5%. If McNemar’s value is greater than χ 2 (1,0.95) , the null hypothesis is false and the two algorithms are significantly different. In practice, a contingency table is constructed with samples of validation set (Table 2). McNemar’s contingency tables and tests are computed with Matlab® .

4. Results and discussion 4.1. Principal component analysis PCA is applied to explore the data structure. For each set, the plots of the four first principal components are shown in Fig. 2. No clear separations are apparent for SBO and CP sets. For DR set, the separation between NR and R samples is possible according to the third and fourth principal component plots (Fig. 2). But the boundaries are not well-defined and the two groups are overlapped. Geometrical exploration based on PCA score plots does not necessarily give the clusters which are sought. 4.2. Comparison of eight supervised pattern recognition methods 4.2.1. Results 4.2.1.1. Method optimization. For each method, parameters have to be optimized in order to maximize the prediction rate. For PDA and DPLS, the PC numbers are determined by a cross-validation. DPLS and

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200 Fig. 2. PCA on NIRS data. Two graphs: (a) 1st PC vs. 2nd PC, (b) 3rd PC vs. 4th PC—are plotted for each data set. (I) 83 samples of DR validation set, (II) 75 (randomly selected) samples of SBO validation set, (III) 75 (randomly selected) samples of CP validation set. 193

194

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

Fig. 2. (Continued ).

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

PDA used 16 PC for the tree sets. The number of PC influences the results: for example, 12 PC DPLS model has a validation prediction rate of 63.9% and a 16 PC model 80.2% on the DR validation set. The PC number is high but we can notice that in previous study [30], 16 terms are used to determine the sucrose content which represents 20% of the sugar beet, with Principal Component Regression. PC numbers are determined for each class with the SIMCA methods. The optimization of this method is longer but relatively simple with the Coomans’ plot [1] which determines if the discrimination of a class is accurate. For CART, the number of nodes is determined by a cross-validation. For DR, SBO and CP sets, respectively, 6, 12 and 14 terminal nodes are used. Parameters have a high influence on prediction rate for methods like KNN, PNN and LVQ neural networks. The number of neighbors K is important for the KNN method. Several values (1–9) of K are tested by a cross-validation procedure. The cross-validation recognition rates vary between 61 (K = 11) and 77% (K = 3) for DR set, between 49 (K = 13) and 58% (K = 5) for SBO set and between 61 (K = 9) and 67% (K = 3) for CP set. The K values are three for the DR set, five for SBO set and three for CP set. Concerning LDA, 20 principal components (m = 20) are extracted and the resulting score vectors are used as inputs. The 20 PCs represent 99.99% of the total variance for the DR, SBO and CP training set. PNN is deeply influenced by the smoothing factor (σ ). If σ is near zero the network will act as a nearest neighbor classifier. The prediction rate for the DR validation set is 48.8% if σ is 0.1 and the optimum value

195

is 10 (prediction success is 75.6%). For SBO data set, σ is 20. The number of hidden neurons and the number of epochs are two parameters which influence the LVQ results. The minimum number of epochs is 100. The optimum number of hidden neurons is 9 for DR set, 20 for SBO and 9 for the CP set. Hidden neuron number influences significantly the prediction success. For example, the prediction success is 14% with 2 neurons, 54% with 4, 62.5% with 9 and 59% with 10 on the CP validation set. 4.2.1.2. Method comparison. The method comparison is based on the results of independent validation sets (i.e. 83, 320 and 262 samples for DR, SBO and CP validation sets, respectively). We can notice that the algorithms are able to classify sugar beet samples. Nevertheless, LVQ on SBO set and PNN on CP set are not able to generalize rules. The average prediction rates for the eight methods are as follow: 74.4% with DR set, 53% with SBO set and 68.4% with CP set. Tables 3–5 compare the eight different supervised pattern recognition algorithms, respectively, on the three data sets. The prediction rates are summarized in these tables. All the methods are compared by pairs with McNemar’s tests. The significant McNemar’s values are written in bold in these three tables. PDA result is not significantly different from SIMCA, LDA and KNN results on the DR set. SIMCA and DPLS have similar results on the SBO set. Concerning the CP set, DPLS is significantly more accurate than SIMCA and PDA. PDA, SIMCA and DPLS are, respectively, the most accurate methods on the DR, SBO and CP set.

Table 3 Comparison of method efficiency on DR validation set with McNemar’s test Methods (prediction rate)

PDA (85.5%)

SIMCA (81.9%)

LDA (80.7%)

KNN (75.9%)

PNN (69.8%)

LVQ (69.8%)

DPLS (66.2%)

CART (65.1%)

PDA SIMCA LDA KNN PNN LVQ DPLS CART

NA

0.16 NA

0.41 0 NA

2.22 0.76 0.37 NA

5.33 3.38 3.22 1.56 NA

7.58 4.05 2.78 0.64 0.03 NA

8.65 3.69 5.28 2.7 0.13 0.45 NA

7.75 5.63 9.33 3.7 0.26 0.45 0 NA

The numbers are the Mcnemar’s value. When the two methods are significantly different, the value is written in bold. NA: not a number.

196

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

Table 4 Comparison of method efficiency on SBO validation set with McNemar’s test Methods (prediction rate)

SIMCA (69.0%)

DPLS (66.5%)

PDA (61.5%)

LDA (59.3%)

CART (53.4%)

KNN (52.1%)

PNN (48.1%)

LVQ (12.5%)

SIMCA DPLS PDA LDA CART KNN PNN LVQ

NA

0.43 NA

5.38 2.83 NA

8.37 3.94 0.28 NA

20.8 11.83 5.48 2.79 NA

23.02 15.34 7.37 4.21 0.07 NA

33.25 26.27 14.34 10.37 2.4 2.25 NA

155.02 137.6 121.04 120.65 94.41 95.06 80.18 NA

When the two methods are significantly different, the value is written in bold. NA: not a number.

4.2.2. Discussion 4.2.2.1. Supervised pattern recognition methods 4.2.2.1.1. Accuracy comparison. We can consider that the three most accurate methods are SIMCA, PDA and DPLS on these data sets. These methods have many common points. They are linear, parametric and modeling methods. PDA, PLS and SIMCA have the advantage of being able to handle collinear X-variables, missing data and noisy variables and can deal with overlapped classes. LDA, KNN and CART are less accurate than modeling methods. The reason why KNN with the Euclidean distance does not give accurate prediction rate is that there are many redundant variables in NIR data and that the variables are highly correlated [44]. The disadvantage of LDA is that this method cannot be applied to data sets having more variables than objects.

LDA can be optimally applied when the dispersion of the classes is equal and when they have the same direction [1]. Non linear methods (LVQ and PNN) do not give satisfactory results on our data sets. One of the LVQ requirements is that the competitive layer must have enough neurons: each class must be assigned to enough competitive neurons. 4.2.2.1.2. Other criterions. Supervised pattern recognition methods are hard classification methods. Samples are classified in one of the defined classes. Distance based methods (all methods exempt CART) have advantages for outlier detection [11]. In general, if the distance between samples being tested and the pattern in the training sets is large, then the test pattern can be rejected as an outlier. But the choice of the rejection criterion is not straightforward [11]. The memory requirement can also be an important criterion, especially when large data sets are

Table 5 Comparison of method efficiency on CP validation set with McNemar’s test Methods (prediction rate)

DPLS (94.2%)

SIMCA (81.3%)

PDA (77.1%)

CART (73.2%)

LDA (69.8%)

KNN (65.2%)

LVQ (62.5%)

PNN (28.6%)

DPLS SIMCA PDA CART LDA KNN LVQ PNN

NA

21.875 NA

30.68 1.42 NA

37.87 3.81 0.9 NA

49.61 19.11 3.02 0.57 NA

62.5 14.24 10.58 4.21 1.14 NA

71.8 21.63 141.01 10.22 4.13 1.23 NA

168.14 132.17 85.81 78.66 102.42 58.8 38.32 NA

When the two methods are significantly different, the value is written in bold. NA: not a number.

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

used. KNN and PNN need more memory than other methods. Both methods store the information of the training set to classify the validation samples and NIRS data require a high memory capacity. The main advantage of CART is the interpretation of model which is relatively easy. The analysis of the tree is useful to determine which implies X-variables have an influence on the classification rules. For example, some wavelengths used to classify resistant samples are 2132, 714 and 792 nm. These wavelengths are, respectively, assigned to combination of N–H stretch and C=O stretch, C–H fourth overtone and N–H third overtone [45]. The last qualitative criterion is the simplicity of the training. Except neural networks, all methods are relatively simple to train. All methods have the same training speed except PNN which requires less training time. That’s why PNN networks are very efficient to solve real time problems. Table 6 summarizes method characteristics and choice criterions. 4.2.2.2. McNemar’s test. McNemar is a conservative test on the small data set (100 samples). The significant differences are high for DR set: the difference between two prediction rates need to be upper than 10% to be significant. With larger data sets containing several classes, the significant difference between prediction rates is higher: 5% for SBO set and 4% for CP set. McNemar’s value is compared to a critical value. Some tests are difficult to analyze because critical value is a threshold. For example, on the DR set,

197

SIMCA have 81.9% of success and is significantly different to LVQ which have 69.8% of prediction success. But SIMCA is not different to DPLS (66.2% of good classification). The McNemar value is 3.69 when DPLS is compared to SIMCA. This value is just under the critical value χ 2 (1,0.95) (3.8414). McNemar is a useful tool to determine if two classification methods have significantly different prediction rates. If two algorithms have the same error rate, the choice of the method can be based on a qualitative criterion like training speed. 4.2.2.3. Determination of qualitative parameters of sugar beet. The three qualitative parameters of sugar beet are predicted with a high prediction rate: 85.5% for the disease resistance, 69% for the geographical origin and 94.2% for harvesting period. For DR set, we can see that non-resistant samples are well classified (Table 7). Most of errors come from the resistant samples which are misclassified. For SBO set, geographical regions 7 and 2 have been better predicted than others. There is some confusion between classes. Classes 3 and 0 have special specific characteristics because only few samples are misclassified into these two classes. On the contrary, class 2 overlaps the others and has no specific characteristics. For CP set, the prediction rate is high but we can notice that misclassification is due to confusions between classes 0 and 1 and between classes 2 and 3. It seems to have a greater difference between the beginning (classes 0 and 1) and the end of the harvest

Table 6 Qualitative and quantitative comparisons of supervised pattern recognition algorithm Method

SIMCA

DPLS

PDA

LDA

CART

KNN

PNN

LVQ

Method’s characteristics Linear Parametric Discriminant (D), modeling classes (MC)

Yes Yes MC

Yes Yes MC

Yes Yes MC

Yes Yes D

Yes No D

Yes No D

No No D

No No D

Qualitative comparison Simple and fast to train Memory requirement

+ Low

++ Low

++ Low

++ Low

++ Low

++ High

+ High

+ Low

Quantitative comparison Accuracy DR set CP and SBO sets

++ ++

+ ++

++ ++

+ +

+ +

+ +

– –

– –

198

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

Table 7 Confusion matrix for the most accurate methods Results of PDA for the DR data set α

β

Predicted class

R

NR

Class R Class NR

30 1

11 41

Predicted class

0

1

2

3

4

5

6

7

α

β

Class Class Class Class Class Class Class Class

24 1 0 1 0 0 1 0

0 26 1 2 5 0 1 0

9 6 33 6 6 5 7 2

0 0 0 28 0 2 0 0

4 1 3 0 26 9 1 2

1 0 1 3 1 20 0 0

2 3 0 0 1 0 28 0

0 3 2 0 1 4 2 36

0.4 0.35 0.17 0.3 0.35 0.5 0.3 0.1

0.01 0.04 0.17 0.01 0.08 0.02 0.02 0.05

Predicted class

0

1

2

3

α

β

Class Class Class Class

33 2 0 0

4 72 1 0

1 0 70 3

0 0 4 72

0.27 0.02

0.02 0.27

Results of SIMCA for the SBO data set

0 1 2 3 4 5 6 7

Results of DPLS for the CP data set

0 1 2 3

0.05 0.04 0.07 0.04

0.01 0.03 0.02 0.02

Contingency tables represent the real classes in function of the predicted classes. When the two methods are significantly different, the value is written in bold. For a class i, α = 1 − ni /ni where ni is the number of sample ∈ i, ni is the number of samples ∈ i and classified as a / i and n¯ i is the number of samples ∈ / i and classified as non-i. member of class i, and β = 1− n¯ i /n¯ i where n¯ i is the number of samples ∈

(classes 2 and 3) than between classes 0 and 1 or between classes 2 and 3. 4.3. Comparison of classifications using NIRS or wet chemical data In the second part, SIMCA is used with WCA (five concentrations) or NIRS data. Spectral data sets (RAN) where the classes are randomly assigned are also used. RAN data sets have the same size as the other sets. McNemar’s tests compare the classifications. 4.3.1. Results Prediction rates for classification using NIRS data are as follow: 82% with DR set, 69% with SBO set and 81.3% with CP set. Concerning the classification with WCA data, the percentages of good classification are: 81% with DR set, 30% with SBO set and 60% with CP set. The classifications rate with random classes are: 45.8% with DR set, 12% with SBO set and 22%

with CP set. SIMCA models use 16 PC with NIRS data and 3 PC with WCA. Table 8 compares the classifications using NIRS and WCA data. For two sets (SBO and CP), NIRS Table 8 Comparison of classification using WCA or NIRS data DR samples n00 = 2 n10 = 15 SBO samples n00 = 63 n10 = 148 CP samples n00 = 41 n10 = 114

n01 = 16 n11 = 50 McNemar’s value = 0 (NS) n01 = 28 n11 = 81 McNemar’s value = 80.46 (S) n01 = 8 n11 = 99 McNemar’s value = 90.37 (S)

McNemar’s contingence tables. Method A is NIRS data and method B is WCA data (Table 2). S: significant at α = 5%, NS: not significant at α = 5%.

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

data improve the classification results. Concerning the determination of the disease resistance, there is no significant difference between the two classifications. 4.3.2. Discussion Prediction rates obtained with NIRS data are higher than those obtained with WCA data. Results obtained with WCA data are always higher than those obtained with RAN sets. The SIMCA method models true class information. For the DR set, both sets give results, which are not significantly different. The explanation is that sugar beets which are resistant to the disease have lower sucrose content. The sucrose content can be determined by NIR spectroscopy [30] as well as with the WCA. So the classification results are not different. For the two other sets, classifications with WCA give higher error rate than with NIRS data. WCA data lack of information. We can assume that important components for classification are not analyzed. We can suppose that water content can be useful to determine the origin and the crop period of the beet samples. NIRS data contains this information through the water band at 1450 and 1950 nm, but the WCA data do not. NIR spectra have information about the most of beet components. The use of the NIRS data improves the prediction. Spectroscopic data contain more information than WCA data which contain only five concentrations. When NIRS data are used, the spectral pre-treatment improves classifications. In our study, pre-treatment tests are realized with DPLS method. SNV and second derivative give accurate results. Prediction rates on the three data sets (DR, SBO, CP) are, respectively, 73.9, 62 and 85% without pre-treatment and 85.5, 69 and 94.2% with pre-treatment. McNemar’s tests show that pre-treatments improve significantly the classification results. This study agrees with previous work [46], which shows the influence of spectral pre-processing in pattern recognition. This data transformation reduced the within variance due to particle size effects and avoid β error. 5. Conclusion Our study proposes a simple procedure, McNemar’s test, to compare classifications.

199

Eight supervised pattern recognition methods are compared two by two. On the three data sets, three methods are the most accurate: SIMCA, DPLS and PDA. Linear, parametric, modeling methods are the most appropriated methods to classify sugar from spectral data. McNemar’s tests underline the advantage of using NIRS with supervised pattern method, NIR spectroscopy is faster and gives more information on the samples than WCA analysis. The prediction of qualitative parameter is optimized by the use of NIRS data and a supervised pattern recognition method like SIMCA. Finally, the possibility to determine qualitative parameters of sugar beet is underlined. The resistance to Rhizomania, the geographical origin and the crop period of the sample can be determined with high prediction rates of 85, 69 and 95%, respectively.

Acknowledgements The authors wish to thank the “Syndicat National des Fabricants de Sucre” (23 rue d’Iena, 75016 Paris, France), M. Bruandet and M. Noé. References [1] M. Sharaf, D. Illman, B. Kowalski, Chemometrics, Wiley/Interscience, NY, 1986, p. 228. [2] M. Martin, F. Pablo, A. Gonzalez, Anal. Chim. Acta 350 (1996) 191. [3] D. Gonzalez-Arjona, V. Gonzalez-Gallero, F. Pablo, A. Gonzalez, Anal. Chim. Acta 381 (1999) 257. [4] C. Armanino, R. De Acutis, M. Festa, Anal. Chim. Acta 454 (2002) 315. [5] L. Simon, M. Karim, Biochem. Eng. J. 7 (2001) 41. [6] A. Candolfi, W. Wu, D. Massart, S. Heuerding, J. Pharma. Biomed. Anal. 16 (1998) 1229. [7] R. Bucci, A.D. Magri, A.L. Magri, D. Marini, F. Marini, J. Agric. Food Chem. 20 (2002) 413. [8] N. Smola, U. Urleb, Anal. Chim. Acta 410 (2000) 203. [9] K. Krämer, S. Ebel, Anal. Chim. Acta 420 (2000) 155. [10] B. Alsberg, R. Goodacre, J. Rowland, D. Kell, Anal. Chim. Acta 348 (1997) 389. [11] R. Shaffer, S. Rose-Pehrsson, R. McGill, Anal. Chim. Acta 384 (1999) 305. [12] B. Everitt, The Analysis of Contingency Tables, Chapman and Hall, London, 1977. [13] W. Chapman, M. Fizman, B. Chapman, P. Hary, J. Biomed. Inform. 34 (2001) 4.

200

Y. Roggo et al. / Analytica Chimica Acta 477 (2003) 187–200

[14] C. Tan, Y. Wang, C. Lee, Inform. Process. Manage. 38 (2002) 329. [15] T. Dietterich, Neural Comput. 10 (1998) 1895. [16] G. Vandeginste, D. Massart, L. Buydens, S. De Jong, P. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics, Elsevier, NY, 1988, p. 207. [17] D. Coomans, M. Jonckheer, D. Masssart, I. Broeckaert, P. Blockx, Anal. Chim. Acta 103 (1978) 409. [18] T. Cover, IEEE Trans. Inform. Theor. 13 (1967) 21. [19] M. Derde, L. Buydens, D. Massart, P. Hopke, Anal. Chem. 59 (1987) 1868. [20] S. Wold, Pattern Recognition 8 (1976) 127. [21] L. Stahle, S. Wold, J. Chemometr. 1 (1987) 185. [22] D. Gonzalez-Arjona, G. Lopez-Perez, A. Gonzalez, Chem. Int. Lab. Syst. 57 (2001) 133. [23] L. Breiman, R. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Wadsworth, Pacific Grove, CA, 1984. [24] D. Specht, Neural Networks 3 (1990) 109. [25] T. Kohonen, Self-Organization and Associative Memory, Springer, Berlin, 1989. [26] T. Kohonen, Proc. IEEE 78 (1990) 1464. [27] R. Barnes, M. Dhanoa, S. Lister, Appl. Spectrosc. 43 (1989) 772. [28] H. Martens, T. Naes, Multivariate Calibration, Wiley, Chichester, USA, 1989. [29] FOSS, Manuel d’utilisation, FOSS France Support Application, Nanterre, France, 2001. [30] Y. Roggo, L. Duponchel, B. Noé, J.-P. Huvenne, J. Near Infrared Spectrosc. 10 (2002) 137.

[31] International Commission for Uniform Methods of Sugars Analysis (ICUMSA), ICUMSA Method GS 6-1, 1994. [32] Y. Pomeranz, C. Meloan, Food Analysis: Theory and Practice, Chapman and Hall, London, 1994. [33] D. Barham, P. Trinder, Analyst 97 (1972) 142. [34] D. Coomans, D. Massart, Anal. Chim. Acta 138 (1982) 153. [35] D. Massart, B. Vandeginste, S. Deming, Y. Michotte, L. Kaufman, Chemometrics: A Textbook, Elsevier, NY, 1988. [36] M. Tenenhaus, La Regression PLS. Technip, Paris, 1998. [37] A. Burnham, R. Viveros, J. Macgregor, J. Chemometr. 10 (1996) 31. [38] D. Gonzalez-Arjona, G. Loperez-Perez, A. Gonzalez, Talanta 49 (1999) 189. [39] C. Cappelli, F. Mola, R. Siciliano, Comput. Stat. Data Anal. 38 (2002) 285. [40] A. Pallara, Stat. Appl. 4 (1992) 255. [41] H. Demuth, M. Beale, Neural Network Toolbox® : User’s Guide, Version 4, The Math Works Inc., Natick, 2001. [42] L. Simon, M. Karim, Biochem. Eng. J. 7 (2001) 41. [43] R. Fisher, The Design of Experiments, Oxford University Press, Oxford, 1935. [44] W. Wu, D. Massart, Anal. Chim. Acta 349 (1997) 253. [45] B. Osborn, T. Fearn, P. Hindle, Practical NIR Spectroscopy with Applications in Food and Beverage Analysis, PrenticeHall, London, 1993. [46] A. Candolfi, R. De Maesschalck, D. Jouan-Rimbaud, P. Hailey, D. Massart, J. Pharma. Biomed. Anal. 21 (1999) 115.