Design of local fuzzy models using evolutionary algorithms

Design of local fuzzy models using evolutionary algorithms

Computational Statistics & Data Analysis 51 (2006) 398 – 416 www.elsevier.com/locate/csda Design of local fuzzy models using evolutionary algorithms ...

734KB Sizes 8 Downloads 23 Views

Computational Statistics & Data Analysis 51 (2006) 398 – 416 www.elsevier.com/locate/csda

Design of local fuzzy models using evolutionary algorithms Piero P. Bonissone∗ , Anil Varma, Kareem S. Aggour, Feng Xue General Electric Global Research, One Research Circle, Niskayuna, NY 12309, USA Available online 11 May 2006

Abstract The application of local fuzzy models to determine the remaining life of a unit in a fleet of vehicles is described. Instead of developing individual models based on the track history of each unit or developing a global model based on the collective track history of the fleet, local fuzzy models are used based on clusters of peers—similar units with comparable utilization and performance characteristics.A local fuzzy performance model is created for each cluster of peers. This is combined with an evolutionary framework to maintain the models. A process has been defined to generate a collection of competing models, evaluate their performance in light of the currently available data, refine the best models using evolutionary search, and select the best one after a finite number of iterations. This process is repeated periodically to automatically update and improve the overall model. To illustrate this methodology an asset selection problem has been identified: given a fleet of industrial vehicles (diesel electric locomotives), select the best subset for mission-critical utilization. To this end, the remaining life of each unit in the fleet is predicted. The fleet is then sorted using this prediction and the highest ranked units are selected. A series of experiments using data from locomotive operations was conducted and the results from an initial validation exercise are presented. The approach of constructing local predictive models using fuzzy similarity with neighboring points along appropriate dimensions is not specific to any asset type and may be applied to any problem where the premise of similarity along chosen attribute dimensions implies similarity in predicted future behavior. © 2006 Elsevier B.V. All rights reserved. Keywords: Fuzzy models; Evolutionary algorithms; Instance-based models; Similarity measures; Prediction

1. Introduction Soft computing (SC) literature is expanding at a rapid pace, as evidenced by the numerous congresses, books, and journals devoted to the topic. Its original definition provided by Zadeh (1994) denotes systems that “. . . exploit the tolerance for imprecision, uncertainty, and partial truth to achieve tractability, robustness, low solution cost, and better rapport with reality.” As discussed in previous articles (Bonissone, 1997; Bonissone et al., 1999), we view soft computing as the synergistic association of computing methodologies that includes as its principal members fuzzy logic, neurocomputing, evolutionary computing, and probabilistic computing. We also stress the synergy derived from hybrid SC systems that are based on a loose or tight integration of their constituent technologies. This integration provides complementary reasoning and search methods that allow us to combine domain knowledge and empirical data to develop flexible computing tools and solve complex problems.

∗ Corresponding author. Tel.: +1 518 387 5155; fax: +1 518 387 6845.

E-mail address: [email protected] (P.P. Bonissone). 0167-9473/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2006.04.013

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

399

1.1. Fuzzy and statistical models From the breadth of SC components, we focus on the use of (local) fuzzy models in our case study with the use of evolutionary computing as a methodology to design the local fuzzy models. It has been noted that analytical models capture well understood systems and relations involving few variables, while statistical models operate at the other extreme, dealing with large amounts of randomness and large numbers of variables (Klir and Yuan, 1995). These two extremes correspond to the concepts of organized simplicity and disorganized complexity, originally proposed by Weaver (1948). Fuzzy models are commonly used outside of these two extremes, when there is enough vagueness in the relationships to prevent us from applying analytical models, and yet there is not enough data to allow for meaningful statistical analysis. In the literature we can find many hybrid fuzzy-statistical approaches, such as statistical decisions using fuzzy probabilities or fuzzy utilities, testing fuzzy hypotheses, and regression analysis with fuzzy data, to name a few. Readers interested in the intersection and synergy between fuzziness and statistics should refer to (Corral et al., 1998; Ralescu, 1996). In our work we focus on the development of local fuzzy models, which are closely related to kernel regression and locally weighted regression. We will review the latter under related work in Section 3.2, while the former will be described in Section 4.2.

1.2. Characteristics of real world applications When addressing real-world problems, we are typically dealing with systems that are ill defined, difficult to model, and possess large solution spaces. Under these circumstances, precise models are usually impractical, too expensive, or non-existent. Therefore, we need to generate approximate solutions by leveraging the two types of resources that are generally available: problem domain knowledge of the process (or product) and field data that characterize the system’s behavior. The relevant available domain knowledge is typically a combination of first principles and empirical knowledge. This knowledge is often incomplete and sometimes erroneous. The available data are typically a collection of input–output measurements representing instances of the system’s behavior, and are generally incomplete and noisy. Soft computing is a flexible framework in which we can find a broad spectrum of design choices to perform the integration of knowledge and data in the construction of approximate models. In real-world applications, before we can use a model in a production environment we must address the model’s entire lifecycle from its design and implementation to its validation, tuning, production testing, use, monitoring and maintenance. By maintenance we mean all the steps required to keep the model vital (e.g., non-obsolete) and able to adapt to changes. Two reasons justify our focus on maintenance. Over the lifecycle of the model, maintenance costs are the most expensive (as software maintenance is the most expensive in the life of a software system). Second, when dealing with mission-critical software we need to guarantee continuous operation or at least fast recovery from system failures or model obsolescence to avoid lost revenues and other business costs. In the next section we describe a variety of soft computing models and some typical search methods that could be used to generate those models. Then we cover related work in case-based reasoning, statistics, and evolutionary search for model design. In Section 4 we describe the fuzzy instance based models, while in Section 5 we explain the use of evolutionary algorithms for their design. Section 6 illustrates a case study with a fleet of locomotives and describes the experiments and results performed to validate our approach. In Section 7 we show the impact of model updates, and in Section 8 we present our conclusions.

2. Models and fuzzy models 2.1. Model generation In general terms, we can consider a model to be characterized by its representation (structural and parametric information) and its associated reasoning mechanism, which is usually related to the representation. The generation (and updating) of an optimal model requires a search method to define the model’s representation and to characterize its reasoning mechanism. We will illustrate this concept with examples taken from conventional and soft computing models.

400

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

2.1.1. Differential equations Classical engineering models based on linear differential equations have a representation that can be decomposed into a structure (the order of the differential equation that determines the number of required coefficients), and a set of parameters (the values of those coefficients). The reasoning mechanism is based on the (exact or approximated) solution of the differential equations. The search method used to generate the model could be based on energy-balancing approaches (e.g., Bond Graphs), least mean squared error minimization, etc.

2.1.2. Bayesian belief networks and neural networks For graphical models we have a similar situation. In the case of Bayesian belief networks (BBNs) their structure (graph topology) and parameters (prior probabilities on the node and conditional probabilities on the links) define the BBN’s representation. Their underlying reasoning mechanism is based on conditioning, and consists of calculating and propagating posterior probabilities along the network topology, according to Bayes’ rule. The search methods used to generate the BBN’s range from manual construction to evolutionary algorithms (EAs) (Larrañaga and Lozano, 2002) and more. Other graphical models, such as neural networks (NNs), have a similar representation: their structure is the network topology, while their parameters are the biases on the nodes and weights on the links. Reasoning with NNs is a similar process, consisting of evaluating and propagating values along the network topology according to the operators defined in each neuron. The search methods used to generate the NN structure include manual, forward or backward pruning, and evolutionary algorithms (Vonk et al., 1997; Yao, 1999). Both local search methods such as backpropagation, conjugate gradient, etc., and global search approaches such as evolutionary algorithms can be used to generate the NN parameters.

2.1.3. Fuzzy systems and radial basis functions Many knowledge-based models, which on the surface might appear quite different from graphical models, can be easily transformed and represented graphically. For instance, the domain knowledge in a fuzzy system is usually  to an output space Y . In a represented by a set of if-then rules that approximate a mapping from a state space X Mamdani-type fuzzy system (Mamdani and Assilian, 1975) the knowledge base (KB) is completely defined by a set of scaling factors that determine the ranges of values for the state and output variables, a term set defining the membership function of the values taken by each state and output variable, and a rule set characterizing a syntactic mapping of  to Y . The structure of the underlying model is the rule set, while the parameters are the scaling symbols from X factors and term sets. The reasoning mechanism is based on generalized modus ponens, and consists of interpolating among the outputs of all relevant rules. A Takagi–Sugeno–Kang (TSK)-type fuzzy system (Takagi and Sugeno, 1985) increases its representational power by allowing the use of a first-order polynomial defined on the state space to be the output of each rule in the rule set. This enhanced representational power, at the expense of local legibility (Babuska et al., 1994), results in a model that is equivalent to radial basis functions (RBFs) (Bersini et al., 1995). RBFs have a topology similar to classical NNs with the sigmoid node replaced by one containing a localized distribution. The same model can be translated into a structured network, such as in adaptive neural fuzzy inference systems (ANFIS) (Jang, 1993). In ANFIS, the rule set determines the topology of the net (model structure), while dedicated nodes in the corresponding layers of the net (model parameters) define the term sets and the polynomial coefficients. In all of these cases, the search method for the structure is either manual or based on evolutionary algorithms, while the determination of the parameters is usually based on local (backpropagation) or global (EA) searches.

2.1.4. Instance-based models Non-graphical models, such as instance-based models, follow a similar pattern. Their representation contains structural information (such as the dimensionality and definition of their attribute space), and parametric information (such as the weights assigned to each attribute, etc.). Their reasoning mechanism is based on a retrieval of instances, a similarity-driven ordering of the instances, the evaluation and potential modification of their outcomes, and an aggregation of the (modified) outputs. This mechanism might require the value of some parameters, such as the relevance of each attribute in computing similarity, the extent of the retrieval, the type of aggregation, etc. For simplicity, we include all these parameters in the model representation.

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

401

2.2. Search Regardless of the underlying representation of the model, evolutionary algorithms can provide a robust, global search method to define the models’ structure and parameters. Given a fixed model structure, EA’s can be used to determine the inputs and pre-processors (attribute selection, weighting, and construction) as well as the parameter values required by the models. The appealing aspect of this approach is the ease of incorporating knowledge in the EA to control the search, as described by Bonissone et al. (2006). Furthermore, the same methodology can be applied after deployment to retune and/or reconfigure the model, extending its vitality. It is also possible to intertwine global search (more robust but less efficient) with local search approaches such as greedy induction, gradient-based techniques, etc., in the quest to design and maintain better models. 3. Related work 3.1. Case-based reasoning In the maintenance example detailed in Section 6, we view an equipment fleet as a case base whose cases evolve with time, with extended time histories describing their state. The notion of a ‘case’ in a case-based system is often episodic in nature. A case traditionally captures a set of attributes that define an episode of interest at a particular instant in time—whether a customer call, meal plan, or a failure requiring diagnosis, with an associated outcome. The episode, once captured as a case, is itself not expected to change (though its relevance calculation may be weighed by the age of the case). The notion of peers and peer-based reasoning is a useful specialization of the general case-based reasoning (CBR) approach when dealing with a case base of complex equipment. Representing and reasoning with time-extended cases has been discussed by a variety of authors in different domains. The notion that the state of each case in a case base can be a function of time was referred to as ‘continuous case-based reasoning’ and identified as distinct from discrete, symbolic representations by Ram and Santamaria (1997). They describe continuous CBR in the context of a driving task, where problem solving is incremental due to limited prior knowledge, and continuous adaptation and learning is essential to incorporate new experiences. An important issue raised by the authors is case representation—whether the entire experience to date is a single case, or if there should be a criterion that defines the scope of a single case. In our application, a major maintenance or overhaul could define such a splitting criterion. Jaczynski (1997) focuses on the retrieval aspects of cases that reflect time-extended situations. He distinguishes between sampled numeric series and event-based time series. In our application, each unit is associated with a sequence of maintenance events, and to that extent meets the second definition. While we have not made an attempt to characterize the series of maintenance events in a richer context, this is a logical extension of our work. Schlaefer et al. (2001) and Fritsche et al. (2002) describe using CBR on a series of medical tests for kidney transplant recipients. Their data is described as a “series of infrequent measurements at irregular intervals.” They utilize dynamic time warping to normalize multiple series for similarity computation and retrieval. In our application, events in a locomotive’s history are not represented as a time series to be used for similarity matching. We use the ‘odometer’ approach, where the entire history is summarized in a state vector. We feel that this is a limitation, since the type and sequence of maintenance events, upgrades, and missions are likely to have a significant impact on readiness. In our approach we also address the concept of model maintenance, including both adaptation and optimization, and its particular importance in peer-based reasoning. As time passes, the attributes of the object under consideration change, and so do its peers. This is in addition to the expected turnover in the case base as units are decommissioned and new units enter service. The learning task here is to update the ‘similarity metric’ or the criterion for peer selection. Research in the CBR community has focused extensively on case base maintenance. Leake and Wilson (1998), in their review of CBR maintenance dimensions, point out that the indexing scheme is an integral part of the case base along with the cases themselves. Smyth (1998) describes a pruning approach for case deletion with minimal impact on performance. Zhang and Yang (1998) also stress index maintenance to keep a CBR system current. They propose an iterative approach to weight refinement. An evolutionary algorithm handles the corresponding function in our application. Leake and Wilson (1999) explicitly address the situation when changing tasks and environment could render part of the case base obsolete or invalid. They identify problem-solution regularity as a basic premise of CBR and advocate monitoring performance over time to spot a decline in this measure.

402

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

Fig. 1. Locally weighted regression models.

In our application, unit to peer irregularity over time is almost a given, and the analogous issue becomes how often to update the ‘peer selection’ or similarity metric to maintain performance while incorporating new knowledge. In our experiments, similarity metric updates occur at each of three time slices. Results presented will highlight the need for continual adaptation and parameter optimization. 3.2. Related statistical models: locally weighted regression The idea of avoiding pre-constructed models and creating a local model when needed can be traced back to memorybased approaches (Atkeson, 1992; Atkeson et al., 1997) and lazy-learning (Bersini et al., 1998). As noted by Lejeune (1985), kernel regressions are equivalent to locally weighted regressions for data points located away from the boundaries within which they are distributed. Locally weighted regression is a further development (generalization) of kernel regression. In the early development of kernel regression, the local model is zero order (e.g. weighted average); while locally weighted regression often uses higher order local models. Linear models are often used due to a limited number of local data points and for better generalization. The construction of locally weighted regression is succinctly illustrated in Fig. 1. Given a query Q defined in the statespace X by its coordinate-vector X¯ Q , we compute a partial matching between the two points. Each stored data point X¯ j , Yj is weighted according to its proximity to the query.   Such proximity is measured by a distance d X¯ j , X¯ Q , which could be interpreted as the dissimilarity between the two points. The distance is smoothed by a parameter h that defines the scale or range of the model. Usually h is computed by minimizing a cross-validation error. Finally the weights with which the points will contribute to the solution are computed as the square root of the kernel value of the smoothed distance, i.e.      ¯ j , X¯ Q  d X w j = Kj . (1) h With these weights we create a squared diagonal matrix W where Wi,i = wi and Wi,j = 0 for i  = j . We use the weight matrix W to pre-multiply the X and Y matrices containing the coordinates of our data points, i.e.: Z = W X and V = W Y . Now, for every query defined in the X space by X¯ Q , the corresponding output yˆQ can be obtained as  −1   yˆQ = X¯ Q Z T Z Z T V . This solution assumes that the matrix Z T Z is not singular. When this assumption does not hold, well-known computational techniques, such as ridge regressions, can be used. It should be noted that the target Yj for each stored data point X¯ j may need to be derived through some local models. The targets are known in the case illustrated in Fig. 1, and therefore such local models are simply identity matrices.

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

403

The construction of locally weighted regression models require many critical design choices, such as the definition of a distance function, feature scaling, smoothing parameter, and kernel function (Atkeson et al., 1997). The distance function d(·, ·) could be global, associated with each query, or with each data point. Similarly, the smoothing or bandwidth parameter h could be derived as a fixed bandwidth, obtained from a global minimization of the crossvalidation error, derived from k-nearest neighbors, or associated to the coordinate-vector X¯ Q of each query in the state space X. The choice of the kernel function K(·) will also impact the performance. Designing the distance functions assumes that it is reasonable to compute a distance in the state space X. However, when the state variables are heterogeneous, such a distance is not always meaningful. In our approach, instead of computing an overall distance between the query and a data point and evaluating the kernel value, we consider the distance along each of the dimensions and evaluate a kernel value in each dimension. Without any need to scale the variables, we compute the similarity for each projection (characterized by a truncated generalized bell function), and the overall similarity of a data point with the query is computed by the intersection of all of these similarities. This will be described further in Section 4.2 and Figs. 6 and 7. 3.3. Evolutionary search to support model lifecycle In Bonissone (2003, 2004a, b), we stressed how in real-world applications, before deploying a model in a production environment we had to address the model’s entire lifecycle from design and implementation to validation, tuning, production testing, use, monitoring and maintenance. The objective of these efforts is to keep the model vital (e.g., non-obsolete) and adaptable. There are two main reasons to focus on maintenance. First, over the life of a model, maintenance costs are the most expensive. Second, when dealing with mission-critical software we need to guarantee continuous operation or at least rapid recovery from failures to avoid lost revenue and other business costs. Given the role played by model maintenance, it cannot afford to be an after-thought; it should be an integral part of the design process. For efficiency, we also want to minimize the amount of human intervention required. In (Bonissone et al. (2002), Bonissone (2004b), and Patterson et al. (2005), we described the development and deployment of a robust method for automating the tuning and maintenance of fuzzy decision-making systems. A fuzzy rule-based classifier (FRC) was developed for insurance underwriting (UW), a complex decision-making task that is traditionally performed by trained individuals. An underwriter must evaluate each insurance application in terms of its potential risk for generating a claim, such as mortality in the case of term life insurance. An application is compared against standards derived from actuarial principles and adopted by the insurance company. Based on this comparison, the application is classified into one of the risk categories available for the type of insurance requested. The estimated risk, in conjunction with other factors such as gender, age, and policy face value, determine the appropriate price (premium) for the insurance policy. We abstracted the UW process as a discrete classifier that maps an input vector X¯ (containing discrete, continuous, and attribute variables that represent the applicant’s relevant medical and demographic information) into a discrete decision space Y¯ (containing an ordered list of rate classes). A configurable multi-stage mutation-based evolutionary algorithm tunes the decision thresholds and internal parameters of the fuzzy rule-based system that places the insurance applications into risk categories. The tunable parameters have a critical impact on the coverage and accuracy of the decision-making, and a reliable method to optimally tune these parameters is critical to the accuracy and maintainability of these systems over time. This application exemplifies a classification problem in an almost static environment since UW policies do not vary quickly. In the example described in Section 6, we will address a prediction and classification problem in a dynamic environment. 4. Fuzzy instance based models (F-IBMs) 4.1. Contrasting (Fuzzy) instance based models with (Fuzzy) functional approximators Many soft computing techniques can produce universal functional approximations for continuous and measurable functions on compact domains, as their proofs are based on the Stone-Weierstrass theorem (Rudin, 1976). Some of the SC techniques that have been proven to be functional approximators are neural networks (Barron, 1994), radial basis functions (Park and Sandberg, 1991), fuzzy systems (Wang and Mendel, 1992), and hybrid neural fuzzy systems (Buckley and Hayashi, 1994). In this section we will describe ANFIS, a widely used hybrid neural fuzzy system

404

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

Inputs

IF-part

Rules + Norm w1 α 1 &

Input 1

THEN-part

Output

N

x1 &

N S

Input 2

Layers:

&

N

&

N

Output

x2

0

1

2

3

4

5

• • • • •

L0: Inputs layer nodes:State variables L1: Values layer nodes:State variable Termsets, computing the membership value of inputs L2: Rules layer nodes:FuzzyRules, using product to compute rule matching factor wi L3: Normalization layer nodes:Each wi is normalized (to add up to one) generating α i L4: Function layer nodes: Linear regressions fi are evaluated, generating rule outputs yi that are weighted by normalized rule matching fact α i • L5: Outputlayer node:Sum of weighted outputs –(completing weighted average of outputs yi) Fig. 2. ANFIS: a structured network.

(Jang, 1993) that can be used to generate functional approximations. We will analyze its mathematical representation and contrast it with that of the proposed fuzzy instance based models. 4.1.1. Fuzzy system (FS) The use of linguistic variables to define complex dynamic systems was first proposed by Zadeh in a pioneering paper (Zadeh, 1973) and extended by Mamdani to synthesize the first fuzzy controller (Mamdani and Assilian, 1975). As noted in Section 2.1.3, in this controller the domain knowledge is represented by a set of fuzzy if-then rules that approximate  to an output space Y . Each rule’s left-hand side describes a fuzzy partition of the state a mapping from a state space X space, while each rule’s right-hand side defines the value of the corresponding control action. In a Mamdani-type fuzzy system this value is a fuzzy set defined on the universe of control actions. In a TSK-type fuzzy system (Takagi and Sugeno, 1985) this value is represented by a linear polynomial in the state space. 4.1.2. Adaptive neural fuzzy inference system (ANFIS) InANFIS, the rule set determines the topology of the net (model structure), while dedicated nodes in the corresponding layers define the term sets and the polynomial coefficients (model parameters). ANFIS does not modify the net structure and instead tries to fine-tune the model parameters. ANFIS consists of a six-layer generalized network, as shown in Fig. 2. The first and sixth layers correspond to the system’s inputs and outputs. The second layer defines the fuzzy partitions (term sets) on the input space, while the third layer performs a differentiable T-norm operation, such as the product or the soft minimum. The fourth layer normalizes the evaluation of the left-hand side of each rule, so that their degrees of applicability sum to one. The fifth layer computes the polynomial coefficients in the right-hand-side of each Takagi–Sugeno rule. Jang’s approach is based on a two-stroke optimization process. During the forward stroke, the term sets of the second layer are kept constant while the coefficients of the fifth layer are computed using a least mean squares method. The ANFIS output is compared with the training set output to produce an error. In the backwards stroke, using a backpropagation-like algorithm, the error gradient drives the modification of the fuzzy partitions of the second layer. This process is continued until convergence is reached. This local search method produces efficient results, at the risk of getting entangled in local minima (similar to other hill-climbing methods). On the other hand, the initial values used by ANFIS are derived from the domain knowledge and as such they should not be extremely distant from the desired parameter values.

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

405

Fig. 3. Geometric interpretation of functional approximation.

Fig. 4. ANFIS: analytical description.

Fig. 3 provides a geometric interpretation of a functional approximator like ANFIS. After the training and testing stages lead to parameter convergence,ANFIS defines an (n + 1) dimensional hyper-surface in the cross-product X¯ × Y  of the state and output spaces, where X¯  = n. For any new point whose coordinates in the state space are denoted by

the vector X¯ Q = x1Q , . . . , xnQ , we can obtain its corresponding output yˆQ by projecting onto the Y dimension the intersection of the hyper-surface with the cylindrical extension of X¯ Q . Fig. 4 provides an analytical description of the same process. Given a set of k local rules (nodes in Fig. 2 Layer 2), we have each rule Rj composed of a left hand side (LHS) representing the condition for its applicability, and a right hand side (RHS) denoting its associated local model fj. The box labelled Partial matching in Fig. 4 computes the degree of applicability of each rule RHS, given an input X¯ Q . The box labelled Local models in Fig. 4 computes the output of each regression evaluated in the state X¯ Q (nodes in Fig. 2 Layer 4). The box labelled Aggregation in Fig. 4 computes the normalized convex sum of the outputs (similar to the node in Fig. 2 Layer 5).

406

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

4.2. (Fuzzy) instance based models While functional approximators require an off-line supervised learning stage to create the hyper-surface in X¯ × Y , instance-based models (IBMs) only require accessing the data in a local neighbourhood of the new point defined by X¯ Q . Thus, loosely-speaking, we could say that the model is in the data. Instance-based models (IBM) rely on a collection of previously experienced data stored in their raw representation. Unlike case-based reasoning, they do not need to be refined, abstracted and structured as cases. Like CBR, IBM is an analogical approach to reasoning since it relies upon finding previous instances of similar problems and using them to create an ensemble of local models. Hence the definition of similarity plays a critical role in the performance of IBMs. Typically, similarity will be a dynamic concept and will change over the use of the IBM. Therefore, it is important to apply learning methodologies to define and adapt them. Furthermore, the concept of similarity is not crisply defined, creating the need to allow for some degree of vagueness in its evaluation. Hence, we propose the use of fuzzy IBMs. 4.3. F-IBM design specification We address the issue of similarity by evolving the design of a similarity function in conjunction with the design of the attribute space in which the similarity is evaluated. Specifically, we use the following four steps: (1) (2) (3) (4)

Retrieval of similar instances from the database (DB); Evaluation of similarity measures between the probe and the retrieved instances; Creation of local models using the most similar instances (weighted by their similarity measures); Aggregation of outputs of local models to probe. A description of the F-IBM model and some preliminary results were described in Bonissone and Varma (2005). We will summarize the four steps and analyze a complete set of recent results.

4.3.1. Retrieval We look for all instances in the database that have a behavior similar to the probe. These instances are the potential peers of the probe. The peers and probe can be seen as points in an n-dimensional feature space. For ¯ instance, let us assume that a probe

Q is characterized by an n-dimensional vector of feature values XQ , and O(Q) = D1,Q , D2,Q , . . . , Dk(Q),Q represents the history of its operational availability durations:



Q = X¯ Q ; O(Q) = x1,Q , . . . , xn,Q ; D1,Q , . . . , Dk(Q),Q . (2) Each other unit ui in the fleet has a similar characterization:

  uj = X¯ j ; O uj = x1,j , x2,j , . . . , xn,j ; D1,j , D2,j , . . . , Dk(j ),j .

(3)

For each dimension i we define a truncated generalized bell function, TGBFi (xi ; ai , bi , ci ), centered at the value of the probe ci, , which represents the degree of similarity along that dimension. Specifically:  ⎫ ⎧  −1  −1    xi − ci 2bi  xi − ci 2bi ⎬ ⎨     if 1 +  > , 1+   TGBFi (xi ; ai , bi , ci ) = (4) a a i i ⎭ ⎩ 0 otherwise where  is the truncation parameter, e.g.  = 10−5 (Fig. 5). Since the parameters ci in each TGBFi are determined by the values of the probe, each TGBFi has only two free parameters, ai and bi , to control its spread and curvature. In a coarse retrieval step, we extract an instance in the DB if all of its features are within the support of the TGBFs. We then formalize the retrieval  step. P (Q), the set of potential   peers of Q, is composed of all units within a range from the value of Q : P (Q)) = uj , j =  1, . . . , m uj ∈ N X¯ Q ,   ¯ defined by the constraint xi,Q − xi,j  < Ri for all where N X¯ Q is the neighborhood of Q in the state space X, potential attributes i for which the corresponding weight is non-zero. Ri is half of the support of the TGBFi , centered on the probe’s coordinate xi,Q .

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

407

Fig. 5. Retrieval phase of instance-based models.

4.3.2. Similarity evaluation Each TGBFi is a membership function representing the partial degree of satisfaction of constraint Ai (xi ). Thus, it represents the closeness of the  instance around  the probe for that particular attribute. For a given peer Pj , we evaluate the function Si,j = TGBFi xi,j ; ai , bi , xi,Q along each potential attribute i. The values (ai , bi ) are design choices manually initialized, and later refined by the EA. Since we want the most similar instances to be the closest to the probe along all n attributes, we use a similarity measure defined as the intersection of the constraint-satisfaction values. Furthermore, to represent the different relevance that each criterion should have in the evaluation of the similarity, we attach a weight wi to each attribute Ai . Therefore, we extend the notion of a similarity measure between Pj and the probe Q as a weighted minimum operator: 

     (5) Sj = Minni=1 Max (1 − wi ) , Sj,i = Minni=1 Max (1 − wi ) , TGBFi xi,j ; ai , bi , xi,Q , where wi∈ [0, 1]. The set of values for the weights {wi } and parameters {(ai , bi )} are critical design choices that impact the proper selection of peers. In this section we assume a manual setting of these values. In Section 5, we explain their selection using evolutionary search. 4.3.3. Local models Within the scope of this paper, we will focus on the creation of local predictive models used to forecast each unit’s remaining life. First, we use each local model to generate an estimated value of the predicted variable for each peer of the query. We then use an aggregation mechanism based on the similarities of the peers to determine the final output. The generation of local models can vary in complexity, depending on the task difficulty. The differences among the local models are related to the type of experiments performed in our case study. In the first two experiments we used the median operator as the local model, hence we did not need to define any parameter:

yj = Median D1,j , D2,j , . . . , Dk(j ),j . (6) In the third experiment we used an exponential forecasting, requiring the definition of a forgetting factor: yi = Dk(j )+1,j = D¯ k(j ),j =  × Dk(j ),j + (1 − ) × D¯ k(j )−1,j ,

(7)

where D¯ 1,j = D1,j . 4.3.4. Aggregation 4.3.4.1. Convex sum. We need to combine the individual outputs yj of the peers Pi (Q) to generate the estimated output yQ for the probe Q. To this end, we compute the weighted average of the peers’ individual outputs using their

408

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

Fig. 6. Analytical description of fuzzy IBM with convex sum aggregation.

Fig. 7. Analytical description of fuzzy IBMs with similarity-weighted regressions.

normalized similarity to the probe as a weight, namely: m j =1 Sj × yj yQ = MedianQ = m , j =1 Sj

(8)

where yi = Median D1,j , D2,j , . . . , Dk(j ),j , for experiments I and II, and m yQ = DNext,Q =

j =1 Sj × yj m j =1 Sj

,

where yj = Dk(j )+1,j , for experiment III. The entire process is summarized in Fig. 6.

(9)

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

409

The structure of this aggregation is similar to the Nadaraya–Watson estimator for non-parametric regressions using locally weighted averages where the weights are the values of a kernel function K: m yQ = MedianQ =





j =1 K x − xj × yj   m j =1 K x − xj

,

(10)

  where K x − xj is the evaluation of the kernel function K at the jth point. From this structural analogy, we observe that the similarity metric Sj plays the same role in the aggregation as the   kernel function K x − xj . We then repeated the third experiment using the exponential average generated by the local models followed by a kernel-based regression as an aggregation. This last method required the definition of  and the computation of the regression coefficients, as illustrated in Fig. 7. Given the critical design roles of the weights {wi }, the parameters {(ai , bi )}, and the exponent , it was necessary to create a methodology that could identify the best values according to our desired objective—maximizing classification precision.

5. Evolution of F-IBM design structure and parameters After testing several manual peer-based models, we decided to use evolutionary searching to develop and maintain the fuzzy instance based classifier. In the wrapper methodology detailed in Bonissone et al. (2002) we defined the use of evolutionary algorithms to tune the parameters of a classifier used to underwrite insurance applications. In this application, we extend evolutionary search beyond parametric tuning to include structural search via attribute selection and weighting (Freitas, 2002). The EA is composed of a population of individuals (“chromosomes”), each of which contains a vector of elements that represent distinct tuneable parameters within the F-IBM configuration. Examples of tuneable parameters include the range of each parameter used to retrieve neighbor instances and the relative parameter weights used for similarity calculation. The EA used two types of mutation operators (Gaussian and uniform), and no crossover. Its population of 30 individuals was evolved over 200 generations. Each chromosome defines an instance of the attribute space used by the associated classifier by specifying a vector of weights [w1 , w2 , . . . , wn ]. If wi ∈ {0, 1}, we perform attribute selection, i.e., we select a crisp subset from the universe of potential attributes. If wi ∈ [0, 1], we perform attribute weighting, i.e., we define a fuzzy subset from the universe of potential attributes. [w1 w2 . . . wn ] [(a1 , b1 ) , (a2 , b2 ) , . . . , (an , bn )] []

(11)

where wi ∈ [0,  1] for attribute weighting, wi ∈ {0, 1} for attribute selection, n = Cardinality of universe of features U, |U |=n, d = ni wi (fuzzy) cardinality of selected features, (ai , bi )=parameters for GBFi and = Parameter for exponential average. The first part of the chromosome, containing the weights vector [w1 , w2 , . . . , wn ], defines the attribute space (the F-IBM structure) and the relevance of each attribute in evaluating similarity. The second part of the chromosome, containing the vector of pairs [(a1 , b1 ) , . . . (ai , bi ) , . . . (an , bn )], defines the parameters for retrieval and similarity evaluation. The last part of the chromosome, containing the parameter , defines the forgetting factor for the local models. The fitness function is computed using a wrapper approach (Freitas, 2002). We instantiate a F-IBM corresponding to each chromosome represented by Eq. (11). Following a leave-one-out approach we use the F-IBM to predict the expected life of the probe unit following the four steps described in the previous subsection. We repeat this process for all units in the fleet and sort them in decreasing order, using their predicted duration DNext,Q . We then select the top 20%. The fitness function of the chromosome is the precision of the classifier, TP/(TP + FP), where TP is the count of true positives and FP is the count of false positives. This is illustrated in Fig. 8.

410

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

Chromosome Decoder

XML Config File

Leave-One-Out Testing Retrieval, Similarity, Local models: Weights, GBF Parameters, 

Select Probe

Retrieve Nearest Neighbors

Mutation Uniform Mutation Gaussian Mutation Original

Best

Maintenance & Utilization Case Base

Pop.(i) Fitness P(selection)

(best from Pop i)

Fitness Function: Quantify Quality of Chromosome f = TP /(TP − FP)

EVOLUTIONARY ALGORITHM

Pop.(i+1)

Rank All Cases by Remaining Life

Instance of FIBM Engine

T T TP

F FP

F FN

TN

Object Similarities & Local Models

Make Prediction

Prediction-based Selection of best units

Ground Truth

Predicted

Elitist

Quantify Similarity

Fuzzy IBM Decision

Fuzzy IBM EVALUATION

Fig. 8. Fuzzy IBM optimization using evolutionary algorithm.

6. Case study: selecting the best units in a fleet 6.1. Problem description We analyze the task of selecting the most reliable units in a fleet of locomotives and formulate it as a prediction and classification problem. The focus of our work is mission reliability. This can be broadly defined as: given a mission of duration X days, what percentage of units assigned to that mission are able to complete the mission without a critical failure. This application exemplifies a prediction and classification problem in a dynamic, complex environment. Locomotives, tanks, and aircraft vary considerably across different phases of their lifecycle. Assets that are identical at the time of manufacture ‘evolve’ into unique systems based on their usage and maintenance history. Utilizing these assets efficiently requires (a) the ability to create a model characterizing their expected performance, and (b) keeping the model updated as the behavior of the underlying assets change. 6.1.1. Data sources and data segmentation To train our models and validate our experiments we used maintenance and utilization data collected from four sources: (1) Locomotive Design & Engineering Data from GE Rail, (2) Locomotive Diagnostics Data from GE Rail EOA姠 remote monitoring and diagnostics service, (3) Locomotive Maintenance Data from Repair Shops, and (4) Locomotive Utilization data from a selected railroad. We wanted to determine how environmental, operational, and maintenance changes could affect our experiments and whether our learning techniques could adapt to such changes. Furthermore, we needed to assess how incremental data acquisition could improve our learning techniques’ performance. Thus, we decided to create three data slices, on May 22, 2002, November 1, 2002, and May 1, 2003. The size of the fleet increased over time (from 262 to 634 to 845 units, respectively), as new units were placed in service and as we collected more utilization and maintenance data on the existing units. As a result, the number in the top 20% increased with the size of the fleet to 52, 127, and 169 units, respectively. These three data slices were used to determine the target units for our selection. For each data slice we constructed three sets of targets (i.e., ground truth) for the experiments: (1) the best 20% past performers, (2) the best 52 past performers, (3) the best 20% future performers. We identified the first two target sets by sorting the units in decreasing order using the median time-between-failure of past operational durations, until we identified the best 20%

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

22 May 2002

01 Nov 2002

411

01 May 2003

01 Nov 2003 Recommendation Repair

Slice 1 Slice 2 Slice 3 Full fleet 262 units

634 units 20% Threshold 20% = 52 units

845 units 20% Threshold 20% = 127 units

965 units 20% Threshold 20% =169 units

Fig. 9. Three time slices for experiments.

or the best 52 units. We identified the last target set by sorting the units in decreasing order using the duration of the first period of operation after the data slice (Fig. 9). 6.1.2. Performance metrics and baselines The primary metric was the ability of a classifier to select the best N units for any given time slice, i.e., its precision. We investigated two approaches to define the top ‘N’ units: Fixed percentage approach: the classifier selects the top 20% of the units. The actual number increased with the fleet size as we progressed from slice 1 to slice 3. Fixed number approach: the classifier selects a constant number of units. We used 20% of the first slice, i.e. 52 units. As the size of the fleet increased this selection task became more difficult, as 52 units represented 8% and 6% of the fleet size in slices 2 and 3, respectively. We used two baselines to measure the increase in capability provided by the peer-based algorithms: Random: the first baseline measured the expected performance if selection of the best ‘N ’ units was done randomly. This obviously represented a worst-case scenario. Heuristics: the second baseline calculated the best performance achieved by using a single (or multiple) heuristic to rank the fleet and pick the top N . In this approach, the best performance achieved from any one of all the heuristics was used as the baseline. 6.2. Definition of experiments I, II, and III First experiment (Top 20% past performers): we predict the best 20% of the fleet, based on their past performance, which is represented by their median time between failure. In this case a random selection would yield 20%. Second experiment (top 52 units past performers): since the size of the fleet at each start-up time was different, we repeated the same experiments keeping the number of units constant (52 units) over the three start-up times instead of keeping a constant top 20%. Thus the random selection baseline changed from [20—20—20%] to [20—8—6%], i.e., 52/262 = 20%; 52/634 = 8%; 52/845 = 6%. Also, given the superior performance of the evolved peers over the

412

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

manually constructed peers, we decided to use the optimized peer design instead of the manual design in all subsequent experiments. Third experiment (top 20% future performers): in the previous two experiments the targets for selection were the best-known units in the fleet based on past performance. Now we perform prediction and target the selection of the units with the best next-pulse duration. In this case a random selection would yield 20%. 6.3. Local models for first F-IBM version (addressing experiments I and II) In the fist two experiments we wanted to test the entire architecture and opted for a relatively simple problem, i.e., estimate the best current performers based on their Median Time Between Failures. Thus the first local model used with these experiments was the median of the previous availability intervals. Let us assume that for a given probe Q we have retrieved m peers, Pj (Q), j = 1, . . . , m. Each peer Pj (Q) has a similarity measure   S j with the probe. Furthermore,

each peer Pj has a track record of operational availability between failures O Pj = D1,j , D2,j , . . . , Dk(j ),j . Each peer Pj (Q) will have k(j ) availability pulses in its history. In this case, for each peer Pj , the goal was to use its own median duration availability and combine these values for all peers Pj , j = 1, . . . , m) to estimate the median duration availability for the probe Q. 6.3.1. Experiments I and II results—using updated models for each time slice For each time slice we used an EA to generate an optimized weighted subset of attributes, search parameters, and forgetting factor (for experiment III), to define the peers of each unit. We used the evolved peer approach to run all three experiments. First experiment: top 20% past performers Time slice 1 (fleet size = 262 units; top 20% = 52 units): Evolved Peers outperformed both manually-designed peers and heuristics/fleet-based approach: 48% vs. 41% vs. 32%. Time slice 2 (fleet size = 634 units; top 20% = 127 units): Evolved Peers outperformed both manually-designed peers and heuristics/fleet-based approach: 56% vs. 55% vs. 46%. Time slice 3 (fleet size = 845 units; top 20% = 169 units): Evolved Peers outperformed both manually-designed peers and heuristics/fleet-based approach: 60% vs. 54% vs. 50%. Second experiment: top 52 units past performers Time slice 1 (fleet size = 262 units; top 52 units = 20%): Evolved peers outperformed heuristic/fleet-based approach: 48.1% vs. 32% (vs. 20% random selection). Time slice 2 (fleet size = 634 units; top 52 units = 8%): Evolved Peers outperformed heuristic/fleet-based approach: 55.8% vs. 37% (vs. 8% random selection). Time slice 3 (fleet size = 845 units; top 52 units = 6%): Evolved Peers outperformed heuristic/fleet-based approach: 63.5% vs. 37% (vs. 6% random selection) (Figs. 10 and 11). 6.4. Local models for second F-IBM version (addressing experiment III) Once we were satisfied with the performance of the evolved peers, we attacked the real problem of predicting the best future performers. We also noted that the knowledge of the current best performers (in terms of knowing their median time between failures) was not at all useful for future prediction (as shown in Fig. 12). In this case, for each peer Pj , the goal was to determine the duration of the next availability duration Dk(j )+1,j. Then we want to combine the prediction   of all the peers Dk(j )+1,j (j = 1, . . . , m) to estimate the availability duration for the probe Q. Since there were not enough historical data to reliably generate local regressions, we experimented with simpler models such as averages and medians. We determined thatthe most reliable way of generating

the next availability duration Dk(j )+1,j from the operational availability vector O Pj = D1,j , D2,j , . . . , Dk(j ),j was to use an exponential average that gives more relevance to the most recent information, namely: Dk(j )+1,j = D¯ k(j ),j =  × Dk(j ),j + (1 − ) × D¯ k(j )−1,j = (1 − )k(j )−1 D1,j +

k(j )  i=2

(1 − )k(j )−i ×  × Di,j .

(12)

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

65.0% 60.4% Selection Performance

60.0%

55.6%

55.0% 50.0%

48.1%

54%

55%

50%

45.0% 40.0%

46%

41%

Evolved Peers Peers Non Peer

35.0% 32% 30.0%

Random

25.0%

20%

20%

20%

20.0% 1

2

3

Time Slices Fig. 10. Experiment I results.

Selection Performance

75.0% 65.0%

Evolved Peers Non Peer-Heuristics Random

10x better than random 1.7x better than heuristics

63.5% 55.8%

55.0% 45.0% 35.0%

48.1% 32%

37%

25.0% 15.0%

37%

20% 8%

6%

5.0% 1

2

3

(52 out of 262 units)

(52 out of 634 units)

(52 out of 845 units)

Time Slices Fig. 11. Experiment II results.

60.0% Selection Performance

55.0% 50.0% 45.0% 40.0% 35.0% 30.0% 25.0% 20.0% 15.0%

Evolved Peers Non Peer - Heuristics Random Own (Time Series) Own (Median)

2.5x better than random 1.5x better than heuristics

55%

42%

43% 42% 32%

36% 34%

34%

27% 22% 20%

20%

20%

19%

20%

10.0% 1

2

3

(52 out of 262 units)

(127 out of 634 units)

(169 out of 845 units)

Time Slices Fig. 12. Experiment III results.

413

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

Selection Performance

414

60.0% 55.0% 50.0% 45.0% 40.0% 35.0% 30.0% 25.0% 20.0% 15.0% 10.0%

Evolved Dynamic Peers Random Evolved Static Peers 43% 43%

55%

The cost of not maintaining the models.

42%

29%

25%

20%

20%

20%

1

2 Time Slices

3

Fig. 13. Experiment III results: comparing static and dynamic models.

Again, critical to the performance of this model was the choice of the value of  and its best value was determined by evolutionary search. 6.4.1. Experiment III results—using updated models for each time slice Third experiment: top 20% future performers Time slice 1 (fleet size = 262 units; top 20% = 52 units): Evolved Peers outperformed both heuristic/fleet and unit’s own history-based approach: 43% vs. 32% vs. 27%. Time slice 2 (fleet size = 634 units; top 20% = 127 units): Evolved Peers outperformed both heuristic/fleet and unit’s own history-based approach: 42.2% vs. 42% vs. 34%. Time slice 3 (fleet size = 845 units; top 20% = 169 units): Evolved Peers outperformed both heuristic/fleet and unit’s own history-based approach: 55% vs. 36% vs. 34%. 6.4.2. Analysis of experiments III results This was the most difficult experiment, as indicated by the fact that the median of the past performers was not a useful indicator for a single unit. From Fig. 12, we can see that it performed as well as random search. The use of an exponential average time series based on each unit’s historical data, though better than random, approximately compared with the use of heuristic searches based on information content. The peer-based approach, designed by evolutionary algorithms, had a significantly better performance (Fig. 13).

7. Dynamic and static models: the benefit of automated updating 7.1. Repeating the experiments using static models First experiment: top 20% past performers: the models developed in time slice 1 exhibited poor precision when applied to the remaining two time slices: 48.1% → 48.41% → 47.92%. As a comparison, the dynamic models had: 48.1% → 55.60% → 60.40% precision. Second experiment: top 52% units past performers: again, the time slice 1 models applied to the remaining two slices were inferior to the dynamic model: 48.1% → 50.00% → 46.15% compared with the refreshed, dynamic models: 48.1% → 55.80% → 63.50%. Third experiment: top 20% future performers: this was the most difficult experiment and the original models showed significant deterioration over time: 42.85% → 25.86% → 24.81%. In contrast, the dynamic models exhibited more robust precision: 42.85% → 42.24% → 54.88%.

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

415

8. Conclusions The peers designed by the evolutionary algorithms provided the best accuracy overall: −60.3% = over 3x random, 1.2x heuristics on selection for top 20% (based on past performance). −63.5% = over 10x random, 1.7x heuristics on selection for top 52 units (based on past performance). −55.0% = over 2.5x random, 1.5x heuristics on selection for top 20% (based on future performance). Utilizing dynamic criteria for peer recognition (by evolving attribute weights) demonstrated the ability to adapt to changing operational and maintenance environments. The evolution of search parameters {(ai , bi )} provided an additional performance improvement, reducing the number of null queries while improving the overall precision of the classifier. The dynamic models performed much better than the static models, which showed minor performance deterioration in experiment 2 and extreme drops in performance in experiment 3. The updating of the models was achieved with minimal cost, as we automated the required search capabilities. These experiments show the applicability of evolutionary algorithms used in a wrapper approach to select the best attributes for representing peers and to define similarity measures for identifying the most similar peers for a given unit. By evolving the models over different time slices, we have shown our ability to dynamically adapt the neighborhood of peers using operational and maintenance data. References Atkeson, C.G., 1992. Memory-based approaches to approximating continuous functions. In: Casdagli, M., Eubank, S. (Eds.), Nonlinear Modeling and Forecasting. Addison-Wesley, Harlow, UK, pp. 503–521. Atkeson, C.G., Moore, A., Schaal, S., 1997. Locally weighted learning. Art. Intell. Rev. 11 (1–5), 11–73. Babuska, R., Jager, R., Verbruggen, H.B., 1994. Interpolation issues in Sugeno–Takagi reasoning. In: Proceedings of the Third IEEE International Conference on Fuzzy Systems, Orlando, FL, pp. 859–863. Barron, A.R., 1994. Universal approximation bound for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39, 115–133. Bersini, H., Bontempi, G., Decaestecker, C., 1995. Comparing RBF and fuzzy inference systems on theoretical and practical basis. In: Proceedings of the International Conference on Artificial Neural Networks, Paris, France, vol. 1, pp. 69–74. Bersini, H., Bontempi, G., Birattari, M., 1998. Is readability compatible with accuracy? From neuro-fuzzy to lazy learning. In: Freska, C. (Ed.), Proceedings in Artificial Intelligence, vol. 7. Infix/Aka, Berlin, Germany, pp. 10–25. Bonissone, P., 1997. Soft computing: the convergence of emerging reasoning technologies: soft computing—a fusion of foundations. Methodologies Appl. 1 (1), 6–18. Bonissone, P., Chen, Y-T., Goebel, K., Khedkar, P., 1999. Hybrid soft computing systems: industrial and commercial applications. Proc. IEEE 87 (9), 1641–1667. Bonissone, P., Subbu, R., Aggour, K., 2002. Evolutionary optimization of fuzzy decision systems for automated insurance underwriting. In: Proceedings of the IEEE International Conference on Fuzzy Systems.Honolulu, Hawaii, pp. 1003–1008. Bonissone, P., 2003. The life cycle of a fuzzy knowledge-based classifier. In: Proceedings of the 2003 North American Fuzzy Information Processing Society, Chicago, IL, pp. 488–494. Bonissone, P., Subbu, R., Eklund, N., Kiehl, T., 2006. Evolutionary algorithms + domain knowledge = real-world evolutionary computation. IEEE Trans. Evolut. Comput., to appear Bonissone, P., 2004a. Automating the quality assurance of an on-line knowledge-based classifier by fusing multiple off-line classifiers. In: Proceedings of the IPMU, Perugia, Italy. Bonissone, P., 2004b. Development and maintenance of fuzzy models in financial applications. In: Lopez-Diaz, M., Gil, M., Grzegorzwski, P., Hyrniewicz, O., Lawry, J. (Eds.), Soft Methodology and Random Information Systems. Springer, Berlin, Germany, pp. 50–66. Bonissone, P., Varma, A., 2005. Predicting the best units within a fleet: prognostic capabilities enabled by peer learning, fuzzy similarity, and evolutionary design process. In: Proceedings of the FUZZ-IEEE, Reno, NV. Buckley, J.J., Hayashi, Y., 1994. Can Fuzzy Neural Nets approximate continuous fuzzy functions? Fuzzy Sets and Systems 61, 41–51. Corral, N., Gil, M.A., Lopez, M.T., Salas, A., Bertoluzza, C., 1998. Statistical models. In: Ruspini, E., Bonissone, P., Pedrycz, W. (Eds.), Handbook of Fuzzy Computation, C2.3, IOP Publishing, pp. 1–18. Freitas, A., 2002. Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer, Berlin, Germany. Fritsche, L., Schlaefer, A., Budde, K., Schröter, K., Neumayer, H.H., 2002. Recognition of critical situations from time series of laboratory results by case-based reasoning. J. Am. Med. Informat. Assoc. 9 (5), 520–528. Jang, J.S.R., 1993. ANFIS: adaptive-network-based-fuzzy-inference-system. IEEE Trans. Syst. Man, Cybern. 665–685. Jaczynski, M., 1997. A framework for the management of past experiences with time-extended situations. In: Proceedings of the Sixth International Conference on Information and Knowledge Management.ACM Press, New York, pp. 32–39. Klir, G., Yuan, B., 1995. Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice-Hall PTR, Upper Saddle River, NJ. Leake, D.B., Wilson, D.C., 1998. Categorizing case-base maintenance: dimensions and directions. In: Proceedings of the Fourth European Workshop on Case-Based Reasoning, Springer, pp. 196–207. Leake, D.B., Wilson, D.C., 1999. When experience is wrong: examining CBR for changing tasks and environments. In: Proceedings of the Third International Conference on Case-Based Reasoning.Springer, Berlin, pp. 218–232.

416

P.P. Bonissone et al. / Computational Statistics & Data Analysis 51 (2006) 398 – 416

Larrañaga, P., Lozano, J., 2002. Synergies between evolutionary computation and probabilistic graphical models. Int. J. Approx. Reas. 31(3) (special issue). Lejeune, M., 1985. Estimation non-parametrique par noyaux: regression polynomial mobile. Revue de Statistique Appliquee 23 (3), 43–67. Mamdani, E.H., Assilian, S., 1975. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Man Mach. Stud. 7 (1), 1–13. Park, J., Sandberg, I.W., 1991. Universal approximation using radial basis function networks. Neural Comput. 3 (2), 246–257. Patterson, A., Bonissone, P., Pavese, M., 2005. Six sigma quality applied throughout the lifecycle of an automated decision system. Int. J. Qual. Reliab. 275–292. Ram, A., Santamaria, J.C., 1997. Continuous case-based reasoning. Art. Intell. 90, 25–77. Ralescu, D., 1996. Statistical decision-making without numbers. In: Proceedings of the 27th Iranian Mathematical Conference Shiraz, pp. 403–417. Rudin, W., 1976. Principles of Mathematical Analysis. McGraw-Hill, New York, NY. Schlaefer, A., Schröter, K., Fritsch, L., 2001. A case-based approach for the classification of medical time series. In: Proceedings of the Second International Symposium in Medical Data Analysis, pp. 258–263. Smyth, B., 1998. Case-base maintenance. In: Proceedings of the Eleventh International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems 2, pp. 507–516. Takagi, T., Sugeno, M., 1985. Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Syst. Man, Cybern. 15, 116–132. Vonk, E., Jain, L.C., Johnson, R.P., 1997. Automatic Generation of Neural Network Architecture Using Evolutionary Computation. World Scientific, Singapore. Wang, L., Mendel, J.M., 1992. Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Networks 3 (5), 807–814. Weaver, W., 1948. Science and complexity. Amer. Scientist 36 (4), 536–544. Yao, X., 1999. Evolving artificial neural networks. Proc. IEEE 87(9). Zadeh, L.A., 1973. Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. Syst. Man, Cybern. 3, 28–44. Zadeh, L.A., 1994. Fuzzy logic and soft computing: issues, contentions and perspectives. In: Proceedings of the Third International Conference on Fuzzy Logic Neural Nets and Soft Computing.Iizuka, Japan, pp. 1–2. Zhang, Z., Yang, Q., 1998. Towards lifetime maintenance of case base indexes for continual case based reasoning. In: Proceedings of the Eighth International Conference on AI Methodologies, Systems and Applications, pp. 489–500.