Use of particle swarm optimization for machinery fault detection

Use of particle swarm optimization for machinery fault detection

ARTICLE IN PRESS Engineering Applications of Artificial Intelligence 22 (2009) 308–316 Contents lists available at ScienceDirect Engineering Applicat...

382KB Sizes 3 Downloads 42 Views

ARTICLE IN PRESS Engineering Applications of Artificial Intelligence 22 (2009) 308–316

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage:

Use of particle swarm optimization for machinery fault detection B. Samanta , C. Nataraj Department of Mechanical Engineering, Villanova University, Villanova, PA 19085, USA

a r t i c l e in f o

a b s t r a c t

Article history: Received 13 December 2007 Received in revised form 23 June 2008 Accepted 28 July 2008 Available online 14 October 2008

A study is presented on the application of particle swarm optimization (PSO) combined with other computational intelligence (CI) techniques for bearing fault detection in machines. The performance of two CI based classifiers, namely, artificial neural networks (ANNs) and support vector machines (SVMs) are compared. The time domain vibration signals of a rotating machine with normal and defective bearings are processed for feature extraction. The extracted features from original and preprocessed signals are used as inputs to the classifiers for detection of machine condition. The classifier parameters, e.g., the number of nodes in the hidden layer for ANNs and the kernel parameters for SVMs are selected along with input features using PSO algorithms. The classifiers are trained with a subset of the experimental data for known machine conditions and are tested using the remaining set of data. The procedure is illustrated using the experimental vibration data of a rotating machine. The roles of the number of features, PSO parameters and CI classifiers on the detection success are investigated. Results are compared with other techniques such as genetic algorithm (GA) and principal component analysis (PCA). The PSO based approach gave a test classification success rate of 98.6–100% which were comparable with GA and much better than with PCA. The results show the effectiveness of the selected features and the classifiers in the detection of the machine condition. & 2008 Elsevier Ltd. All rights reserved.

Keywords: Computational intelligence Feature selection Machinery condition monitoring Swarm intelligence Signal processing

1. Introduction Particle swam optimization (PSO) was proposed by Kennedy and Eberhart (1995) as a population based stochastic optimization technique inspired by the social behavior of bird flocking or fish schooling. PSO is an algorithm based on the group (swarm) behavior. The algorithm searches for the optimal value by sharing the cognitive and social information among the individuals (particles) in the global solution space. PSO has many advantages over other evolutionary computation techniques (for example, genetic algorithms (GAs)) such as simpler implementation, faster convergence rate and fewer parameters to adjust (Kennedy et al., 2001; Poli et al., 2007). The popularity of PSO is growing with applications in diverse fields of engineering, biomedical and social sciences (Poli, 2008). Some of the recent applications of PSO in engineering include machinery condition monitoring and diagnostics (Lin et al., 2008; Yuan and Chu, 2007). Condition based maintenance (CBM) is gaining importance in industry because of the need to increase machine availability. The use of vibration and acoustic emission (AE) signals is quite common in the field of condition monitoring of rotating machinery with potential applications of artificial neural

 Corresponding author. Tel.: +1 610 519 7018; fax: +1 610 519 7312.

E-mail address: [email protected] (B. Samanta). 0952-1976/$ - see front matter & 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2008.07.006

networks (ANNs) and statistical learning techniques like support vector machines (SVMs) in automated detection and diagnosis (McCormick and Nandi, 1997; Jack and Nandi, 2000a, b; Samanta and Al-Balushi, 2003). GAs have been used to make the classification process faster and accurate using the minimum number of features which primarily characterize the system conditions with optimized structure or parameters of ANNs and SVMs (Jack and Nandi, 2002; Samanta, 2004a, b; Samanta et al., 2006). In a recent paper (Samanta et al., 2003) results of GA based selection of characteristic features and classifier parameters were presented for fault detection of bearings using only time domain features of vibration signals. The features were extracted from finite segments of two signals: one with normal condition and the other with a defective bearing. Results were presented to compare the performance of GA based selection process using ANN and SVM. In this work, an application of PSO is presented combining it with other computational intelligence (CI) techniques like ANN and SVM for automated selection of features and detection of machinery fault. The present approach combines the advantages of both PSO and CI techniques. The effects of the number of selected features and variations in PSO models on the classification success have also been studied. Comparisons are made between the performance of ANN and SVM with PSO for the dataset of Samanta et al. (2003). Results are also compared with other techniques like GA and PCA. Fig. 1 shows flow diagram of the

ARTICLE IN PRESS B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316


Machine system with data acquisition Feature extraction Initialize PSO parameters Training dataset

Initialize Population

Test dataset

PSO based CI (ANN/SVM) Training

Evaluate fitness function

Check and Update pbest, Pi

Check and Update gbest, Pg

Update particle velocity and position, Vi , Xi


Termination criteria reached? Y Trained CI with optimal Features and parameters

Diagnosis of machine condition Fig. 1. Schematic of PSO based machine condition detection system.

proposed procedure. First the vibration signals are processed for extraction of statistical features in time domain. These features, namely, the mean, root mean square (rms), variance, skewness, kurtosis and normalized central moments of higher order (up to ninth) are used to distinguish between normal and defective bearings. Moments of order higher than nine are not considered in the present work to keep the input vector within a reasonable size without sacrificing the accuracy of diagnosis. Next, part of the dataset is used for training the combined PSO–ANN or PSO–SVM model to optimize the selection of suitable input features and the classifier parameters. The effectiveness of the trained model is tested using the remaining test dataset. The results show the effectiveness of the extracted features from the acquired and preprocessed signals in diagnosis of the machine condition. The procedure is illustrated using the vibration data of an experimental setup with normal and defective bearings. The paper is organized as follows. In Section 2, the experimental setup and the datasets are briefly discussed. Main features of PSO algorithms are described in Section 3. In Sections 4 and 5, two CI techniques, ANN and SVM, are briefly explained in the context of the present work. The implementation of PSO based selection process with ANN and SVM is discussed in Section 6. Results are presented in Section 7 followed with Conclusions in Section 8.

2. Vibration data and feature extraction 2.1. Experimental setup Fig. 2 shows the schematic diagram of the experimental test rig. The rotor is supported on two ball bearings MB 204 with eight rolling elements. The rotor is driven with a three-phase AC induction motor through a flexible coupling. The motor can be run in the speed range of 0–10,000 rpm using a variable frequency drive (VFD) controller. For the present experiment, the motor speed was maintained at 600 rpm. Two accelerometers were mounted at 901 on the right hand side (RHS) bearing support to measure vibrations in the vertical and horizontal directions (x and y). Separate measurements were obtained for two conditions, one with normal bearings and the other with an induced fault on the outer race of the RHS bearing. The outer race fault was created as a small line using electro-discharge-machining (EDM) to simulate the initiation of a bearing defect. It should be mentioned that only one type of bearing fault has been considered in the present study to see the effectiveness of the proposed approach for two-class recognition. Diagnosis of different types and levels of bearing faults are important for optimal maintenance purposes but are outside the scope of the present work. Each accelerometer signal was connected through a


B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316

Fig. 2. Experimental test rig.

charge amplifier and an anti-aliasing filter to a channel of a personal computer based data acquisition system. One pulse per revolution of the shaft was sensed by a proximity sensor and the signal was used as a trigger to start the sampling process. The vibration signals were sampled simultaneously at a rate of 49,152 samples/s per channel. The lower- and higher-cut-off frequencies of each charge amplifier were set at 2 and 100 kHz, respectively. The cut-off frequency of each anti-aliasing filter was set at 24 kHz, almost half of the sampling rate. The number of samples collected for each channel was 49,152. In the present work, these time domain data were preprocessed to extract the features for using as inputs to the classifiers (ANNs and SVMs). 2.2. Signal statistical characteristics One set of experimental data each with normal and defective bearings was considered. For each set, two vibration signals consisting of 49,152 samples were obtained using accelerometers in vertical and horizontal directions to monitor the machine condition. The magnitude of vibration was constructed from the two component signals, z ¼ O(x2+y2). These samples were divided into 48 bins of 1024 (n) samples each to extract the following features (1–9): mean, rms, variance, skewness, kurtosis and normalized fifth to ninth central moments. Similar features were extracted from the derivative and integral of the signals (10–27), and low- and high-pass filtered signals (28–45) (Samanta et al., 2003). Each of the feature vectors was normalized dividing by its absolute maximum value for better speed and success of the network training. The total set of normalized features consists of a 45  144  2 array, where each row represents a feature and the columns represent the number of bins (48) per signal multiplied by the total number of signals (3) and two (2) bearing conditions: normal and defective.

3. Particle swarm optimization (PSO) In this section, a brief introduction to PSO algorithm is presented, for details a standard text (for example, Kennedy et al., 2001) can be referred to. Poli et al. (2007) also gave a recent overview of PSO. For a problem with n-variables, each possible solution can be thought of as a particle with a position vector of dimension n. The population of m such individuals (particles) can be grouped as the swarm. Let Xi and Vi represent, respectively, the current position and the velocity of ith particle (i ¼ 1, m). The

fitness of a particle is assessed by calculating the value of the objective function at the current position of the particle. If the value of the objective function for the current position of the particle is better than its previous best value then the current position is designated as the new best individual (personal) location pbest, Pi of the particle. The best current positions of all particles are compared with the historical best position of the whole swarm (global or neighborhood) gbest, Pg, in terms of the fitness function, and the global best position is accordingly updated if any of the particle individual best (pbest, Pi) is better than the previous global best (gbest, Pg). The current velocity and the position are updated to meet the desired objective. The process is repeated until any of the termination criteria is satisfied. The schematic of the PSO implementation is shown in the context of machine condition detection in Fig. 1. Though the basic structure of PSO is the same, there are some variations based on the parameters used. Here, four popularly used PSO versions are briefly discussed; in all cases, the vectors of current position, Xi, velocity Vi, the personal best position (pbest) Pi and the global best position (gbest) Pg are defined as follows: X i ¼ ðxi1 ; xi2 ; . . . ; xin Þ, V i ¼ ðvi1 ; vi2 ; . . . ; vin Þ, Pi ¼ ðpi1 ; pi2 ; . . . ; pin Þ, Pg ¼ ðpg1 ; pg2 ; . . . ; pgn Þ.


3.1. Original PSO In the original version of PSO, the position and the velocity of the particles in the swarm at the next time step (k+1) are expressed in terms of the values at current time step k as follows: vij ðk þ 1Þ ¼ vij ðkÞ þ Uð0; c1 Þ½pij ðkÞ  xij ðkÞ þ Uð0; c2 Þ½pgj ðkÞ  xij ðkÞ, xij ðk þ 1Þ ¼ xij ðkÞ þ vij ðk þ 1Þ;

i ¼ 1; m; j ¼ 1; n


where U(0,ci) represents uniformly distributed random numbers in the range of [0,ci], i ¼ 1,2. These random numbers represent the stochastic nature of the search algorithm. The constants c1 and c2 define the magnitudes of the influences on the particle velocity in the directions of the individual and the global optima, Pi and Pg, respectively. These are also termed as ‘acceleration coefficients’. The velocity vij is bounded within a range [Vmax, Vmax] to prevent the particles from flying out of the solution space. The choice of Vmax influences the balance between the exploration (coverage of the entire solution space) and the exploitation (fine adjustments near global optima) in the solution space. A larger value of Vmax enhances global exploration whereas a smaller value enables local exploitation. However, with a very high Vmax, the particles may fly out of the solution space whereas with a very low Vmax, the particles may be trapped near the local optima and may not reach the global optima. Different parameters that need to be fixed in implementing a PSO algorithm include m, c1, c2 and Vmax. The population size m is decided on the basis of the dimensionality and perceived difficulty of the problem but typical values in the range of 20–50 are quite common. In early versions of PSO, the acceleration coefficients c1 and c2 were each chosen as 2.0 and Vmax as 4.0 (Poli et al., 2007). 3.2. PSO with inertia weight Eq. (2) was modified by introducing an inertia term o to reduce the dependence of the search process on the hard bounds of the velocity (Vmax), (Shi and Eberhart, 1998): vij ðk þ 1Þ ¼ ovij ðkÞ þ Uð0; c1 Þ½pij ðkÞ  xij ðkÞ þ Uð0; c2 Þ½pgj ðkÞ  xij ðkÞ,

ARTICLE IN PRESS B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316

xij ðk þ 1Þ ¼ xij ðkÞ þ vij ðk þ 1Þ;

i ¼ 1; m; j ¼ 1; n


In many early applications, good results were obtained when the inertia term o was decreased from 0.9 to 0.4 linearly over the solution process. The velocity equation can be rearranged to define the change in velocity as Dvij ¼ fij(1o)vij where fij represents the random forcing function given by the last two terms of Eq. (3). The term (1o) can be interpreted as the viscous damping coefficient for the particle motion equation (Poli et al., 2007). The variation of o from high (0.9) to low (0.4) can be thought of changing the viscosity of the medium (in which the particles are moving) from low to high enhancing the exploration at the beginning and exploitation towards the end of the solution process. In this work, this version is chosen as the base PSO Model I. Trelea (2003) investigated the dynamic equations (3) for stability and based on extensive simulation results, proposed two sets of parameters as follows: o ¼ 0.6, c1 ¼ 1.700, c2 ¼ 1.700 and o ¼ 0.729, c1 ¼ 1.494, c2 ¼ 1.494. In this work, these two versions are designated as PSO Models II and III, respectively. 3.3. PSO with constriction factor The particle velocity equation was modified by introducing a ‘constriction coefficient’ w to prevent particle explosion, control convergence and eliminate the use of Vmax (Clerc and Kennedy, 2002) as follows: vij ðk þ 1Þ ¼ wðvij ðkÞ þ Uð0; f1 Þ½pij ðkÞ  xij ðkÞ þ Uð0; f2 Þ½pgj ðkÞ  xij ðkÞÞ,

xij ðk þ 1Þ ¼ xij ðkÞ þ vij ðk þ 1Þ;

i ¼ 1; m; j ¼ 1; n (4) qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 where f ¼ f1+f244 and w ¼ 2=ðf  2 þ f  4fÞ. With f1 ¼ f2 ¼ 2.05, w becomes 0.7298. Each of the multipliers for the last two terms of velocity Eq. (4) becomes a random number with an upper limit of 1.4962 ( ¼ 2.05  0.7298). In this work, this model is designated as PSO Model IV. Also, the present paper limits Vmax to Xmax (the dynamic range of each variable) thus leading to the PSO model without any problem-specific parameter as proposed by Eberhart and Shi (2000). In the present work, all four models of PSO (I–IV) were applied to examine their effects on the classification success of the present problem.

4. Artificial neural networks (ANNs) ANNs have been developed in the form of parallel distributed network models based on the biological learning process of the human brain. There are numerous applications of ANNs in data analysis, pattern recognition and control (Haykin, 1999). Among different types of ANNs, multi-layer perceptron (MLP) neural networks are quite popular and used for the present work. Here, a brief introduction to MLPs is given to explain the basic steps and different parameters involved. Readers are referred to text (Haykin, 1999) for details. MLPs consist of an input layer of source nodes, one or more hidden layers of computation nodes or ‘neurons’, and, an output layer. The number of nodes in the input and the output layers depend on the number of input and output variables, respectively. The number of hidden layers and the number of nodes in each hidden layer affect the generalization capability of the network. For smaller number of hidden layers and neurons, the performance may not be adequate, whereas with too many hidden nodes, it may have the risk of over-fitting the training data and poor generalization on the new data. There are various methods, both heuristic and systematic, to select the number of hidden layers and the nodes (Haykin, 1999). For illustration, a typical MLP


is considered consisting of three layers with l, N and M nodes for input, hidden and output layers, respectively. The input vector x ¼ ðx1 x2 . . . xl ÞT is transformed to an intermediate vector of ‘hidden’ variables u using the activation function j1. The output uj of the jth node (j ¼ 1,N) in the hidden layer is obtained as follows: ! l X 1 uj ¼ j1 w1i;j xi þ bj (5) i¼1 1

where bj and wi,j1 represent ,respectively, the bias and the weight of the connection between the jth node in the hidden layer and the ith input node. The superscript 1 represents the connection (first) between the input and the hidden layers. The output vector y ¼ ðy1 y2 . . . yM ÞT of the network is obtained from the vector of intermediate variables u through a similar transformation using the activation function j2 at the output layer. For example, the output of the neuron k (k ¼ 1,M) can be expressed as follows: ! N X 2 w2l;k ul þ bk yk ¼ j2 (6) l¼1

where the superscript 2 denotes the connection (second) between the neurons of the hidden and the output layers. There are several forms of activation functions j1 and j2, such as logistic function, hyperbolic tangent and piece-wise linear functions. The training of an MLP network involves finding values of the connection weights and biases which minimize an error function between the actual network output and the corresponding target values in the training set. One of the widely used error functions is the mean square error (MSE) and the most commonly used training algorithms are based on back-propagation. In the present work, an MLP with one hidden layer was used. The input layer had nodes representing the normalized input features. The number of input nodes (l) was varied from 3 to 5 and the number of output nodes was two. The number of hidden nodes (N) was varied between 10 and 30 based on prior trials and the actual value of N was selected automatically in PSO. The target value of the first output node was set 1 and 0 for normal and failed bearings, respectively, and the values were interchanged (0 and 1) for the second output node. The sigmoidal activation functions were used in the hidden and the output layers to maintain the outputs within 1. The training algorithm of Levenberg–Marquardt was used along with back-propagation. The ANN was trained iteratively using the training dataset to minimize the performance function of MSE between the network outputs and the corresponding target values. No validation data were used in the present work. The prediction performance of the trained MLP was assessed using the test dataset which had no part in training. The gradient of the performance function (MSE) was used to adjust the network weights and biases. In this work, a MSE of 102, a minimum gradient of 106 and maximum iteration number (epoch) of 500 were used. The training process would stop if any of these conditions were met. The initial weights and biases of the network were generated automatically by the program. In the PSO based approach presented here, the actual features (i.e., their indices i, 1pip45) and the number of neurons (N, 10pNp30) in the hidden layer were automatically selected minimizing the classification error for the training dataset.

5. Support vector machines SVMs are based on the statistical learning theory introduced by Vapnik in the late 1960s. However, since the middle of 1990s, the algorithms used for SVMs started emerging with greater availability of computing power, paving the way for numerous practical applications (Vapnik, 1999, 2000; Burges, 1998; Gunn,


B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316

1998; Guyon and Christianini, 1999; Scholkopf, 1998). The basic SVM deals with two-class problems—in which the data are separated by a hyperplane defined by a number of support vectors (SVs). A simple introduction of SVM is presented here for better understanding of the steps and the relevant parameters involved. Readers are referred to the tutorials on SVMs (Burges, 1998; Scholkopf, 1998) for details. The basic principles of SVM are explained through an example of a two-class dataset consisting of r-points in l-dimensional real space as S ¼ {xi, yi}, i ¼ 1,2,y,r, xiARl. Each l-dimensional vector xi belongs to one class yi,A{1, 1}, also designated as A and A+, respectively. The full dataset is represented by the matrices A ¼ [x1,x2,y,xr]T and D ¼ diag(yi) with sizes r  l and r  r, respectively. The aim of SVM is to find an optimal separating hyperplane wTx ¼ g maximizing the margin between two bounding planes (wTx ¼ g+1 and wTx ¼ g1) which bound most of the data points (Fig. 3) and minimizing the error corresponding to misclassification. The orientation vector w (wARl) is normal to the bounding planes and bias g (gAR) denotes the location of the separating plane from the origin. The margin between the bounding planes is 2/JwJ2. The standard SVM is formulated as a constrained quadratic minimization problem as follows (Vapnik, 1999): Minimize Q ðw; nÞ ¼ 12wT w þ CeT n


such that DðAw  egÞ þ nXe (8)

nX0 where

n ¼ ðx1 ; x2 ; . . . ; xr ÞT e ¼ ð1; 1; . . . ; 1ÞT




xi are termed as slack (error) variables and e represents an r  1 vector of ones. The first part of the objective function (7) denotes the inverse of the margin between the bounding hyperplanes and the second part represents the contribution of error with C as a user-defined positive parameter controlling the relative weight of error. The quadratic programming (QP) problem of (7)–(8) is

transformed to a dual optimization problem as follows: Maximize

LD ¼

r X



0pai pC,

subject to r X

r X r 1X a a y y xT x 2 i¼1 j¼1 i j i j i j



ai yi ¼ 0



where ai, i ¼ 1,r represent the non-negative Lagrange multipliers. The solution of (11)–(13) is given as follows: w¼


ai yi xi



1 g¼ NS





! T

w xi



where NS is the number of SVs corresponding to the non-zero Lagrange multipliers (ai,). The SVs are the critical elements of the training set as these contain all the necessary information to determine the separating hyperplane and all other data in the training set can be ignored for further analysis. The separating hyperplane acts as a classifier for any sample vector x (of length l) as follows: 8 þ > < 40 then x 2 A T x w  g o0 then x 2 A (16) > : ¼ 0 then x 2 Aþ or x 2 A In cases where the linear boundary in input spaces will not be enough to separate two classes properly, it is possible to create a hyperplane that allows linear separation in the higher dimension (corresponding to curved surface in lower-dimensional input space). In SVMs, this is achieved through the use of a transformation f(x) that converts the data from an l-dimensional input space to q-dimensional feature space: s ¼ fðxÞ

(17) l


where xAR and sAR . The nonlinear boundary in the input space gets transformed into a linear boundary in feature space, Fig. 4. The objective function of the dual optimization problem of (11) with same constraints gets modified as follows: Maximize

LD ¼

r X



r X r 1X a a y y fðxi ÞT fðxj Þ 2 i¼1 j¼1 i j i j


The transformation into higher-dimensional feature space is relatively computation-intensive. A kernel can be used to perform this transformation and the product in a single step. This helps in reducing the computational load and at the same time retaining the effect of higher-dimensional transformation. The kernel function K(xi, xj) is defined as Kðxi ; xj Þ ¼ fðxi ÞT fðxj Þ


The separating hyperplane is accordingly modified for classification of any sample data (x) as 8 þ > NS < 40 then x 2 A X (20) ai Kðx; xi Þ  g o0 then x 2 A > : i¼1 ¼ 0 then x 2 Aþ or x 2 A There are different kernel functions like polynomial, sigmoid and radial basis function (RBF) used in SVM. In the present work, two commonly used kernel functions, namely, polynomial and RBF are used. Fig. 3. SVM.

K poly ðx1 ; x2 Þ ¼ ðxT1 x2 þ 1Þd


ARTICLE IN PRESS B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316


x x x


y x


y x



X Fig. 4. SVM nonlinear to linear mapping.

K RBF ðx1 ; x2 Þ ¼ expðkx1  x2 k2 =2s2 Þ


The exponent (d) and the width of the RBF kernel parameter (s) can be determined in general by an iterative process selecting an optimum value based on the full feature set. In case there is an overlap between the classes with nonseparable data, the range of parameters ai can be limited to reduce the effect of outliers on the boundary defined by SVs. For nonseparable case, the constraint is modified (0oaioC). For separable case, C is infinity while for non-separable case, it may be varied, depending on the number of allowable errors in the trained solution: few errors are permitted for high C while low C allows a higher proportion of errors in the solution. To control generalization capability of SVM, there are a few free parameters like the limiting term C and the kernel parameters like d or s. In the present work, C was chosen as 100 (Samanta et al., 2003), and the kernel parameter (d or s) and l input features (i.e., their indices i, 1pip45) were selected automatically using the PSO based approach. The numerical implementation of SVM was mainly based on QP with options of decomposing a largescale QP problem into a series of smaller-size QP problems. In the present work, the SVMs were trained using an adapted version of decomposition methods and working set selection strategies similar to that of Joachims (1999).

SminpxnpSmax. The parameters Smin and Smax represent, respectively, the lower and the upper bounds on the classifier parameter. In the present work, number of selected input features (l) was varied between 3 and 5 (3plp5 within the range of 1–45). X ¼ fx1 x2 . . . xn1 xn gT


6.1.1. ANN training For ANN, the last element (xn) of particle position vector X was taken as the number of neurons in the hidden layer (N) in the range of 10 (Smin) to 30 (Smax). 6.1.2. SVM training For SVMs, the last element, xn represents exponent (d) of the polynomial kernel or the RBF kernel width (s). In general, the value of kernel parameter is expected to have some correspondence with the nature of the data of each class for good classification accuracy. However, through PSO based iterations the final selection can be achieved from an approximate range of values. For the present work, the range for d and s was taken between 0.1 and 2.0 with a step size of 0.1. The selection of the range and the step size was based on the range of the standard deviation of the features and the results of initial trials (Samanta et al., 2003).

6. PSO implementation 6.2. Initialization In the present work, PSO was used to select the most suitable l input features (3plp5) from the pool of (45) statistical features extracted from the machine vibration signals for detection of the machine condition, Fig. 1. PSO was also used to select automatically one variable parameter related to the particular classifier, i.e., the number of neurons in the hidden layer (N) for ANNs, and exponent (d) or the RBF kernel width (s) for SVMs. All four variants of PSO algorithms (I–IV) were implemented. The steps of PSO implementation are discussed next in relation with Fig. 1. 6.1. Definition of particle position For both ANN and SVM, each particle position (X) consists of n numbers, as shown in Eq. (23). The first (n1) entries represent the row numbers (indices i, 1pip45) of the selected features from the total set of statistical features related to the machine condition and the last one as the classifier parameter. The first n1 numbers (xj, j ¼ 1, n1) in X are constrained to be in the range 1pxjp45 whereas the last number xn has to be within the range:

The basic parameters for PSO models I–IV (1–4), namely, c1, c2,

o, w, maximum number of generation (NGen ¼ 3000), number of individuals in the population (m ¼ 30) were assigned. Each individual (particle) of the population consists of n elements with first (n1) elements corresponding to the selected input features and the last one, corresponding to the classifier parameter (N for ANN and d or s for SVM). Each individual in the population represents the possible solution of the problem of minimizing the classification errors. The solution process would start with position and velocity of each particle being randomly initialized keeping the velocity in each direction within the given value (Vmax). 6.3. Fitness function evaluation The classification error for the training dataset (outputs of ANN or SVM) would be used as the fitness function and each PSO model would be trained to minimize this fitness function. For each particle, the classification error of the training dataset is evaluated


B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316

through ANN and SVM to update the personal best, pbest pij and the global best, gbest pgj.

Table 2 Classification results of SVM with PSO Model I

6.4. Updating of velocity and position

Number of features

At each iteration, the velocity and the position equations for the particles in the population are updated using Eqs. (1)–(4) for different PSO models (I–IV).

Kernel type

Kernel parameter (d/s)

Iteration number

Test results (%)


Type I error

Type II error

3 4 5

Poly Poly Poly

0.6 1.5 1.9

23 11 12

99.3 97.9 97.2

0 0 1.4

1.4 4.2 4.2

3 4 5


0.6 1.1 1.9

61 31 12

98.6 96.5 97.9

2.8 2.8 2.8

0 4.2 1.4

6.5. Termination criteria Three termination criteria were used based on (a) target goal of zero error (no misclassification), (b) rate of convergence (change in the best fitness value being less than a preset value, 0.01, over a number of iterations, 100) and (c) maximum generations (epochs). In the present work, the maximum generation was set at 3000. The solution process would continue between steps 6.3 and 6.4 till any of the termination criteria of 6.5 were met as depicted in Fig. 1.

Table 3 Classification results of ANN with six features and different PSO Models (I–IV) PSO model


Iteration number

Test Success (%)

Type I error (%)

Type II error (%)

95.8 97.9 99.3 94.4

1.4 1.4 0 0

6.9 2.8 1.4 11.1

7. Simulation results The dataset (45  144  2) consisting of forty-five (45) normalized features for each of the three signals (3) split in form of 48 bins of 1024 samples each with two (2) bearing conditions were divided into two subsets. The first 24 bins of each signal was used for training the ANNs and SVMs giving a training set of 45  72  2 and the rest (45  72  2) was used for testing. For ANNs, the target value of the first output node was set 1 and 0 for normal and failed bearings, respectively, and the values were interchanged (0 and 1) for the second output node. For SVMs, the target values were specified as 1 and 1, respectively, representing normal and faulty conditions. Results are presented to see the effects of the number of features, PSO models and classifiers (ANN or SVM) on the classification success. The training success for each case with PSO was 100%. The effectiveness of the automatic selection process of parameters and features using PSO was also compared with a similar number of PCs (Jolliffe, 2002).


10 13 24 22

5 1 12 2

classification success was 100% with 3 and 5 features. Errors of type I and II were 6.9% and 4.2%, respectively, with four features. Classification results of SVMs with base PSO (Model I) are presented in Table 2. Two types of kernel functions (polynomial and RBF) were considered. The number of iterations in training the SVMs increased with decrease in the number of features. The classification success varied between 96.5% and 99.3%. Errors of type I and II were in the ranges of 0–2.8% and 0–4.2%, respectively. The difference in classification performance between ANN and SVM was not very significant. However, training time for ANN was much higher than for SVM. 7.2. Results with different PSO models

7.1. Results with base PSO (Model I) Table 1 presents the classification results of ANNs with feature selection using base PSO (Model I). The features and the number of neurons in the hidden layer (N) were automatically selected in PSO algorithm. The test results in terms of correct classification along with the errors of types I (false positive) and II (false negative) are shown. The selected number of hidden layer neurons and the iteration numbers for training ANNs are also presented for different number of selected features (3–5). The number of iterations increased with smaller number of features. The test

Table 1 Classification results of ANN with PSO Model I Number of features

3 4 5

Number of hidden layer neurons (N)

28 21 10

Iteration number

Test results (%)

Table 3 presents the classification results of ANN with different PSO algorithms (Models I–IV) for the same number of selected features (6) and population size of 30. Test success was in the range of 94.4–99.3%. Misclassification rates were 0–4.2% (type I) and 1.4–11.1% (type II). There was no significant difference in classification success among the different PSO models. Lowerpopulation size was chosen for the higher number of features. The results had no significant change when the population size was increased to 60. Classification results of SVMs with polynomial and RBF kernel are presented in Table 4 for different PSO models (I–IV) using six features and a population size of 30. Correct classification did not vary much (95–98.6%) with the PSO model. Misclassification errors were also reasonable with 1.4–4.2% for type I and 0–5.6% for type II. 7.3. Separability of datasets

85 20 7


Type I error

Type II error

100 94.4 100

0 6.9 0

0 4.2 0

To investigate the separability of datasets with and without bearing faults, plots of three features selected by PSO in ANN and SVM are shown in Figs. 5(a) and (b), respectively. In the case of ANN, Fig. 5(a), the data clusters are separated quite clearly explaining 100% classification success even with three features. In

ARTICLE IN PRESS B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316

7.4. Comparison with PCA

Table 4 Classification results of SVM with six features and different PSO Models (I–IV) PSO model

Kernel type

Kernel parameter (d/s)

Iteration number


Success (%)

Type I error (%)

Type II error (%)


Poly Poly Poly Poly

1.21 0.66 1.20 1.93

76 2 3 166

97.9 98.6 98.6 95.1

2.8 1.4 1.4 4.2

1.4 1.4 1.4 5.6



0.13 1.12 0.60 0.88

16 1 9 41

97.9 98.6 98.6 98.6

2.8 1.4 2.8 2.8

1.4 1.4 0 0

1 Normal Faulty

3rd Feature



0.6 0.4 0.2

PCA is used to reduce dimensionality of data by forming a new set of variables, known as PCs, representing the maximal variability in the data without loss of information (Jolliffe, 2002). Fig. 6 shows the plots of first three PCs for normalized feature sets. These PCs account for more than 60% of variability of the datasets. The separation between the data clusters for two classes is not very prominent. The classification success of using the first three to six PCs of normalized data in ANNs and SVMs is presented in Table 5. Some of training success was 100% and others were quite low (78.5–87.5%). Test success was found to be very unsatisfactory: 68%–72.2% for ANNs, 54.2–68.8% for RBF SVM and 52.8–63.9% for polynomial SVM compared to the feature selection procedure of PSO which gave almost 100% classification success for both classifiers (ANN and SVM). The deterioration of test success with 3–6 PCs may be attributed to the insufficiency of the PCs in representing the machine conditions. This also shows the superiority of the present approach of PSO based feature selection over using the PCs. To summarize, the classification success of ANNs and SVMs, with PSO based selection of features and classifier parameters was close to 100% for most of the test cases. The effects of different PSO models on the classification success were not significant. The test success of PSO based approach was comparable to GA for the same dataset. Results were better than using the entire feature set which gave a classification success of 85% for ANN and 99% for SVM (Samanta et al., 2003). The numbers of iterations for training the classifiers were substantially smaller than the GA based approach. Computation time (on a PC with Pentium processor of 1.83 GHz and 1 GB RAM) for SVM training was in the range of

0 1 1










ea 1st F


1 Normal Faulty

0.8 0.6 3rd Feature

2.5 3rd Principal component



1 0.5 0 -0.5 -1


-1.5 4




Normal Faulty



2 rin

-0.2 -0.4 0


3 al


0 co



-2 on




0 -1




m al co



Fig. 6. Scatter plot of first three principal components (PCs).









1 1


eat 1st F

Table 5 Classification results with different number of PCs with ANN and SVM Number of PCs

ANN success (%)

Fig. 5. Scatter plots of three selected features (a) ANN and (b) SVM.

Fig. 5(b) (using SVM), the datasets are separated with some amount of overlap. This would explain a lower-classification success for SVMs, although it is quite close to 100%, and not much different from ANN.

SVM success (%) RBF

3 4 5 6

Training 100 100 98.6 100

Test 68.1 72.2 70.8 72.2

Training 80.6 78.5 80.6 87.5

Polynomial Test 68.8 68.1 66.7 54.2

Training 73.6 75.7 78.5 86.1

Test 63.9 58.3 59.0 52.8


B. Samanta, C. Nataraj / Engineering Applications of Artificial Intelligence 22 (2009) 308–316

0.11–0.25 s which was relatively much lower than the training time of ANNs (0.86–222.95 s). However, direct comparison in computation time is difficult due to the difference in code efficiency.

8. Conclusions Results are presented for diagnosis of bearing condition using two classifiers, namely, ANNs and SVMs with PSO based feature selection from time domain vibration signals. The selection of input features and the appropriate classifier parameters were optimized using the PSO based approach. The roles of the number of features, PSO models and the classifiers were investigated. The difference of classification results with different PSO models was not significant. The performance of ANNs and SVMs were similar with feature selection. The results were substantially better than using first few (3–6) PCs. The results using selected features were much better than using the entire feature set. The use of PSO with only 3–6 features gave almost 100% classification for both ANNs and SVMs. The training time was substantially less for SVMs than ANNs. The results show the potential application of PSO for selection of features and classifier parameters in machine condition detection.

Acknowledgments The work was carried out with partial support from Naval Sea Systems Command (NAVSEA) grant N00024-07-C-4212. This support of NAVSEA is gratefully acknowledged (Monitors: Marc Steinberg, R. Wagner and John Metzer). References Burges, C.J.C., 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 955–974. Clerc, M., Kennedy, J., 2002. The particle swarm-explosion, stability, and convergence in a multidimensional complex space. IEEE Transactions on Evolutionary Computation 6, 58–73. Eberhart, R., Shi, Y., 2000. Comparing inertia weights and constriction factors in particle swarm optimization. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, San Diego, CA, Piscataway, pp. 84–88. Gunn, S.R., 1998. Support vector machines for classification and regression. Technical Report, Department of Electrical and Computer Science, University of Southampton. Guyon, I., Christianini, N., 1999. Survey of support vector machine applications. In: Proceedings of NIPS’99 Special Workshop on Learning with Support Vector / Haykin, S., 1999. Neural Networks: A Comprehensive Foundation, second ed. Prentice-Hall, Englewood Cliffs, NJ, USA.

Jack, L.B., Nandi, A.K., 2000a. Genetic algorithms for feature extraction in machine condition monitoring with vibration signals. IEE Proceedings—Vision Image and Signal Processing 147, 205–212. Jack, L.B., Nandi, A.K., 2000b. Support vector machines for detection and characterisation of rolling element bearing faults. Proceedings of Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 215, 1065–1074. Jack, L.B., Nandi, A.K., 2002. Fault detection using support vector machines and artificial neural networks, augmented by genetic algorithms. Mechanical Systems and Signal Processing 16, 373–390. Joachims, T., 1999. Making large-scale SVM learning practical. In: Scholkopf, B., Burges, C.J., Simola, A. (Eds.), Advances in Kernel Methods—Support Vector Learning. MIT Press, Cambridge, MA, USA, pp. 169–184. Jolliffe, I.T., 2002. Principal Component Analysis, second ed. Springer, New York, USA. Kennedy, J., Eberhart, R.C., 1995. Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, IV. IEEE Service Center, Piscataway, NJ, pp. 1942–1948. Kennedy, J., Eberhart, R.C., Shi, Y., 2001. Swarm Intelligence. Morgan Kaufmann, San Francisco. Lin, S.-W., Ying, K.-C., Chen, S.-C., Lee, Z.-J., 2008. Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Systems with Applications 35, 1817–1824. McCormick, A.C., Nandi, A.K., 1997. Classification of the rotating machine condition using artificial neural networks. Proceedings of IMechE, Part C: Journal of Mechanical Engineering Science 211, 439–450. Poli, R., 2008. Analysis of the publications on the applications of particle swarm optimization. Journal of Artificial Evolution and Applications, doi: 10.1155/ 2008/685175, 10p. Poli, R., Kennedy, J., Blackwell, T., 2007. Particle swarm optimization an overview. Swarm Intelligence 1, 33–57. Samanta, B., 2004a. Gear fault detection using artificial neural networks and support vector machines with genetic algorithms. Mechanical Systems and Signal Processing 18, 625–644. Samanta, B., 2004b. Artificial neural networks and genetic algorithms for gear fault detection. Mechanical Systems and Signal Processing 18, 1273–1282. Samanta, B., Al-Balushi, K.R., 2003. Artificial neural network based fault diagnostics of rolling element bearings using time domain features. Mechanical Systems and Signal Processing 17, 317–328. Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A., 2003. Artificial neural networks and support vector machines with genetic algorithm for bearing fault detection. Engineering Applications of Artificial Intelligence 16, 657–665. Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A., 2006. Artificial neural networks and genetic algorithm for bearing fault detection. Journal of Soft Computing 10, 264–271. Scholkopf, B., 1998. SVMs—a practical consequence of learning theory. IEEE Intelligent Systems 13, 18–19. Shi, Y., Eberhart, R.C., 1998. A modified particle swarm optimizer. In: Proceedings of IEEE International Conference on Evolutionary Computation, Piscataway, NJ, pp. 69–73. Trelea, I.L., 2003. The particle swarm optimization algorithm: convergence analysis and parameter selection. Information Processing Letters 85, 317–325. Vapnik, V.N., 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 988–999. Vapnik, V.N., 2000. The Nature of Statistical Learning Theory, Second ed. Springer, NY. Yuan, S.-F., Chu, F.-L., 2007. Fault diagnostics based on particle swarm optimization and support vector machines. Mechanical Systems and Signal Processing 21, 1787–1798.