- Email: [email protected]

Handbook of Statistics, Vol. 24 ISSN: 0169-7161 © 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24008-1

Pattern Recognition

David J. Hand

1. Background The aim of statistical pattern recognition is to assign objects into one of a set of classes. In supervised pattern recognition or supervised classification we are given a sample of objects, for each of which we have observed a vector of descriptive measurements, and for each of which we know its true class, and the aim is to use this sample to construct a rule which will allow us to assign new objects, for which only the vector of measurements is known, to an appropriate class. In unsupervised pattern recognition, we are given an unclassified sample of objects and the aim is to divide them into (usually mutually exclusive) classes. This latter exercise also goes under terms such as cluster analysis, segmentation analysis, or partitioning. This article focuses on supervised methods. The problem of supervised pattern recognition is ubiquitous and as a consequence has been explored in a variety of data analytic disciplines as well as a range of different application areas. Researchers in statistics, machine learning, computational learning theory, data mining, as well as those working primarily in statistical pattern recognition have all made very substantial contributions to the area. Although tackling a common problem, researchers from these distinct areas have brought their own particular interests and emphases to bear, so that there are differences in flavor between the methods which they have developed. This will be illustrated below. The differences have also led to intellectual tensions, which have subsequently resulted in increased understanding and deeper theory. This theory has its counterparts in other areas of statistics and data analysis, so that statistical pattern recognition might be described as encapsulating, in microcosm, the main problems of wider statistical modelling and data analysis. The areas of application of statistical pattern recognition are too many to list. They include areas such as medical diagnosis and prognosis, genomics and bioinformatics, speech recognition, character recognition, biometrics (areas such as face, fingerprint, iris, and retinal recognition), credit scoring, stock market analysis, vehicle classification, target recognition, fault and anomaly detection, fraud detection, and a huge variety of other areas. Some of the applications are exotic, while others have already surreptitiously entered our lives (a consequence of the ubiquity of computers being increasingly embedded in our everyday environment). It will be clear from this illustrative list that 213

214

D.J. Hand

the vectors of descriptive measurements will depend very much on the application domain. In medical diagnosis these measurements will include symptoms (which may be continuous, binary, categorical, or even merely ordinal or nominal variables), while in speech recognition they will be descriptions of the waveform of the sound, and in credit scoring they will be characteristics of the person seeking a loan. The diversity of literatures which have explored the problem mean that different names have been used for these descriptive measurements, including variables, features, characteristics and covariates.

2. Basics We begin by describing how we could tackle supervised classification if we had complete knowledge about the various distributions involved, rather than merely a sample from these distributions. Later sections are concerned with how these basic ideas must be modified to allow for the fact that we do have only a sample of data. Suppose we have N classes, ωi , i = 1, . . . , N , and suppose that their relative sizes are P (ωi ), i = 1, . . . , N . These relative sizes are often called priors in the statistical pattern recognition literature, reflecting the fact that, before obtaining any other knowledge about an object (such as vectors of descriptive characteristics), they give the probabilities that the object will come from each class. On this basis, in the absence of further information, we would minimise the proportion of new objects misclassified if we assigned an object to class i whenever P (ωi ) > P (ωj )

for all j = i.

(1)

Ties can be broken arbitrarily. When we do have further information on an object, given by the vector of descriptive variables, x, we can do better if we condition on x, and use the rule: assign an object with measurement vector x to class i if P (ωi | x) > P (ωj | x)

for all j = i.

(2)

As will be seen, many methods are based on the direct estimation of the P (ωi | x) probabilities. This leads to the diagnostic paradigm, because of its focus on the (conditional) distinction between the classes. Contrasted with this is the sampling paradigm, which estimates the posterior probabilities in (2) indirectly from estimates of the class conditional distributions via Bayes theorem: P (ωi | x) =

P (x | ωi )P (ωi ) . P (x)

(3)

Note that this approach will require separate estimates of the class priors to be provided. Since the rule given in (2) will lead to minimising the proportion of objects with measurement vector x which are misclassified, for all x, it will also (by integrating over the distribution of x) lead to minimising the overall misclassification rate (also called error rate). The simple rule in (2) (or its equivalent form via (3)) is all very well, but we may not want simply to minimise the proportion of new objects which are misclassified. Often,

Pattern recognition

215

certain types of misclassification are regarded as more severe than others. For example, misclassifying someone with an easily curable but otherwise fatal disease as healthy will generally be far more serious than the reverse. To cater for this, we need to take the relative costs of the different kinds of misclassification into account. So, let cj i be the cost of assigning an object to class ωi when it comes from class ωj . Then the overall expected cost (also called the risk) is N N r= (4) cj i P (ωj | x)P (x) dx i=1 Ω j =1 i

where the classification rule assigns points in region Ωi to class ωi . The risk (4) is minimised by choosing the Ωi such that x ∈ Ωi if N

cj i P (ωj | x)

j =1

N

cj k P (ωj | x),

k = 1, . . . , N.

(5)

j =1

Note that, if crs = 0 for correct classifications (that is, when r = s) and crs = 1 for incorrect classifications (when r = s), this reduces to the rule given in (2). The two class case is especially important, partly because this is the most common special case and partly because multiple class cases can often be reduced to multiple two class cases. In this case, the rule implied by (5) becomes: assign an object to class 1 if P (ω1 | x) c22 − c21 (6) > P (ω2 | x) c11 − c12 (assuming correct classifications cost less than incorrect classifications) and this is further specialised, when c11 = c22 = 0, to: assign an object to class 1 if c21 P (ω1 | x) . > P (ω2 | x) c12

(7)

Note that if the sampling approach is adopted, (7) can be expressed as P (x | ω1 ) c21 P (ω2 ) > . P (x | ω2 ) c12 P (ω1 )

(8)

A key feature of (8) is that the prior probability for class i and the costs of misclassifying points from class i appear only when multiplied together. Thus changing costs has the same effect as changing class sizes. Risk, and its specialisation to misclassification rate, is not the only criterion which has been used. Others include the Neyman–Pearson criterion, which minimises one type of misclassification subject to a fixed level of the other, and the minimax criterion, which seeks to minimise the largest type of misclassification. Yet other measures appear in the context of assessing the performance of classification rules. At this point it is convenient to introduce some further terminology. The decision surface is the surface separating regions of the x space in which new objects are assigned to different classes. For example, in the two class case derived from rule (2) this is the set of x for which P (ω1 | x) = P (ω2 | x). In the two class case, a discriminant

216

D.J. Hand

function is any function which is monotonically related to the ratio P (ω1 | x)/P (ω2 | x) – such a function allows one to discriminate between the classes. To use a discriminant function to make a classification, we have to compare it with a classification threshold. For example, in rule (7), the classification threshold is the ratio c21 /c12 .

3. Practical classification rules Up to this point we have assumed that all the required probability distributions were known. Unfortunately, in practice we have to estimate these and any other necessary functions from a sample of data. These data will consist of the measurement vectors and known true classes for a sample of objects. The usual considerations apply about the representativeness of this sample: if it is a distorted sample, then any classification rule built using it has the potential to be misleading. Because the classification rule will be built from this sample, it is called the design set (in statistics) or the training set (in machine learning). And it is because the true classes are known, as if they were provided by a ‘supervisor’, that the methods are known as ‘supervised’ pattern recognition. The performance of classification rules is often evaluated on a separate independent data set, and this is called a test set. Sometimes, choice between competing classifiers is made on the basis of yet a third data set, called a validation set. In the vast majority of supervised pattern recognition problems, the support regions of the class distributions overlap. This means that there will not exist any decision surface which will give an error rate of zero. The minimal possible error rate is called the Bayes error rate. In a few rare problems the classes are perfectly separable: at each x points from only one class arise, so that the Bayes error is zero. Since such problems are rare, generally occurring in artificial situation where the classes are defined symbolically by humans, we will not discuss them. This has little practical consequence, though it has led to interesting theoretical developments in computational learning theory. Our aim is to use the design data to build a model which can then be used to predict the likely class value from a new x vector. At first glance, a sensible aim seems to be to construct our model (whether it is based on the P (ωi | x), the P (x | ωi ), or a discriminant function), so that it maximises classification accuracy on the design set: to set out deliberately to perform poorly on the design set would be perverse. However, our aim is not simply to classify the design set points (we already know their classes). Rather, our aim is to generalise to new points from the same distribution. Since the design data are merely a sample from this distribution, a model which has too much fidelity in fitting them will be unlikely to generalise to new points very well. A classic illustration of this is a perfect interpolation between points which have arisen from a true regression model of the form y = α + βx + ε. The perfect interpolation will be a wobbly line which may depart substantially from the true straight line. This means that our model should not be too flexible: too highly parameterised a model will risk fitting aspects of the data which may change substantially between different random design sets (called overfitting). Rather, we want to capture that aspect which is essentially common to different design sets. Put another way, the variance of a flexible, highly parameterised model is

Pattern recognition

217

large: its prediction at any particular value of x may change dramatically from design set to design set. In contrast to this, and working in the opposite direction, our model should not be too simple. Once again, a classic illustration of this is fitting a quadratic regression ‘truth’ with a straight line. Although the fitted straight line is likely to vary little between different design sets, it is possible that it will be substantially biased for some values of the predictor variable (i.e., the prediction from the linear model will differ substantially from the quadratic truth). These two aspects, the high variance arising from the over-complex model, and the high bias arising from the over-simple model, work in opposite directions, and the aim of modelling (in general, not merely in pattern recognition) is to attain some acceptable or optimal compromise. The overfitting arising from too complex a model appears in various disguises, and will be discussed further below. Whether or not a model is ‘too complex’ will also depend on the size of the data set. A model with 100 free parameters will tend to overfit a sample of just 200 points, but will not overfit a sample of 100 000 points. 3.1. Linear discriminant analysis The earliest method to have been formally developed appears to be (Fisher’s 1936) linear discriminant analysis (LDA). For the two class case, he considered linear combinations of the components of x, w x. He showed that the linear combination for which the difference between the means relative to the pooled within class variance is largest is given by w ∝ S−1 (¯x1 − x¯ 2 ).

(9)

Here S is the within group pooled covariance matrix of the x, and x¯ i is the centroid of the ith group. A classification of a new point x is made by comparing w x (with some constant of proportionality) with a classification threshold. This classification procedure is equivalent to using a hyperplanar decision surface in the original x space. Note that this is very much a discriminant function approach, and that no distributional assumptions have been made. However, since the formulation is entirely expressed in terms of first and second order statistics, it will probably come as no surprise to discover that this is also the optimal solution if one assumes that the two classes have multivariate normal distributions (which are defined in terms of their first and second order moments) with common covariance matrices. The equality of the covariance matrices is important, because it means that the quadratic terms, x −1 x, which arise from the exponents of the normal distributions of the two classes, −{(x − µi ) −1 i (x − µi )}/2, cancel out, leaving only the linear terms. From this perspective, the method clearly belongs to the sampling paradigm. This sampling paradigm perspective leads to immediate generalisation to multiple classes, with each class being estimated by a multivariate normal distribution, P (x | ωi ) = MVN(µi , ), and classifications being made on the basis of which of the estimated P (x | ωi )P (ωi ) is the largest. One often reads that Fisher’s LDA method requires the assumption of multivariate normality. As can be seen from the above, this is incorrect. Certainly the method is optimal for such distributions, but

218

D.J. Hand

it is more generally optimal under any ellipsoidal distributions (with equal covariance matrices). More generally still, however, the method has been proven by many decades of experience to perform well even with data which depart very substantially from this. Because it was one of the earliest methods to be developed, a solid software base has built up around the method, with a wide variety of diagnostic measures and supplementary devices, such as variable selection, readily available in standard software packages. The method continues to be widely used, especially in areas such as psychology. However, in such applications interest often lies more widely in interpreting the weights (‘which variables are most influential in discriminating between the groups?’) rather than merely in classification problems. Although we have described the linear discriminant analysis method as linear in the variables x = (x1 , . . . , xp ) which were originally measured, it is easy enough to extend it to include transformed xj (e.g., squared values, exp(xj ), etc.) and combinations of the xj (e.g., products). The resulting decision surface will then be linear in the new variables, and hence nonlinear in the original (x1 , . . . , xp ). This means a very flexible classifier can result, but two issues must be borne in mind. Firstly, of course, there is the danger of the method being too flexible and overfitting the design data. Secondly, there is the problem of deciding which of the transformed variables to include, since the range of possibilities is unlimited. With too many, overfitting will certainly be a problem. One extension of linear discriminant analysis is particularly important because it is a natural extension of the basic linear form. This is quadratic discriminant analysis (QDA). If, in the multivariate normal derivation in the two class case, the covariance matrices are not assumed to be equal, then the quadratic terms do not cancel out and a quadratic decision surface results. Clearly a quadratic decision surface is more flexible than a linear one, and so is less susceptible to bias. However, it is also adding many more parameters and so risks overfitting. In fact, early empirical studies (before the overfitting issues were properly understood) on data sets of at most a few hundred data points showed that LDA generally outperformed QDA, for this very reason. The development of theoretical understanding of overfitting issues, and of the bias/variance trade-off, opened up the possibility of other ways of tackling the problem. In particular, one strategy is to develop a complex model, and shrink it towards a simpler model. We will see illustrations of this below, but in the context of LDA, Friedman (1989) proposed shrinking a quadratic model towards a linear model. This is equivalent to shrinking the separate covariance matrix estimates towards a common form. Of course, if something is known about the likely pattern of covariances between the measurements then this can be used to good effect in constraining the number of parameters in . For example, if the measurements are repeated measurements of the same variable, taken at different times (e.g., at 1 hour, 2 hours, 3 hours, etc., after ingestion of a medicine) then one might expect measurements which are closer in time to be more highly correlated. 3.2. Logistic discrimination In some application domains (for example, medicine and credit scoring) linear discriminant analysis has been replaced in popularity by logistic discrimination. This also yields a decision surface which is linear in the components of x, but is based on the diagnostic

Pattern recognition

219

paradigm. In particular, in the two class case, it estimates the probabilities P (ω1 | x) from a linear combination w x using a logistic transformation: P (ω1 | x) =

exp(w x) . 1 + exp(w x)

(10)

Inverting this transformation yields P (ω1 | x) = w x. ln P (ω2 | x)

(11)

This method originates in work of Cox (1966) and Day and Kerridge (1967), but was developed in detail by Anderson (reviewed in Anderson, 1982). It was subsequently extended to handle more than two classes, for example in the form P (ωi | x) = wi x, i = 1, . . . , N − 1, ln (12) P (ωN | x) and is now conveniently described as a type of generalised linear model (McCullagh and Nelder, 1989). The parameters are generally estimated by maximum likelihood via an iteratively weighted least squares approach. Because of the natural log transformation in (11), it turns out that the model in (10) is optimal if the distributions of the two classes are multivariate normal with equal covariance matrices (the equality of these matrices again leads to the terms quadratic in x cancelling). This leads to an obvious question: since both LDA and logistic discrimination are optimal under the assumption of multivariate normal distributions with equal covariance matrices, in what ways do they differ? The answer is that, since LDA is based on the sampling paradigm, if the data really do arise from ellipsoidal distributions with equal covariance matrices, then this method will use more information, and will lead to more accurate estimates. On the other hand, the logistic approach is also optimal under a wider variety of situations, so that if the data do not arise from ellipsoidal distributions, then this may yield less biased estimates. 3.3. The naive Bayes model The logistic approach is closely related to a popular and very simple sampling paradigm model (called, variously, the naive Bayes model, the independence Bayes model, or idiot’s Bayes) which makes the assumption that the variables are conditionally independent, given each class: P (x | ωi ) =

p

P (xj | ωi ),

i = 1, . . . , N.

(13)

j =1

Clearly this assumption is generally unrealistic, and yet the model has often been found to perform surprisingly well in real applications. Several explanations have been put forward for this, including the fact that, in many studies, the variables have been preselected to remove those highly correlated with variables which have already been selected, that there are fewer parameters to be estimated (the overfitting, bias/variance point mentioned above), and that bias in the probability estimates may not matter in a

220

D.J. Hand

classification rule, where all that counts is whether a predicted probability is above the classification threshold, not by how much. Such matters are discussed in Hand and Yu (2001). By virtue of the independence assumption, construction of such classification rules is very simple, being restricted to construction of univariate marginal distributions. The relationship between the logistic method and the naive Bayes model can be seen in the two class case by considering the ratio P (ω1 | x)/P (ω2 | x). In general, making no simplifying assumptions at all, we have P (ω1 | x) P (x | ω1 )P (ω1 ) = . P (ω2 | x) P (x | ω2 )P (ω2 )

(14)

The naive Bayes model then makes the stringent independence assumption, P (x | ωi ) = j P (xj | ωi ), yielding P (ω1 | x) j P (xj | ω1 )P (ω1 ) = . (15) P (ω2 | x) j P (xj | ω2 )P (ω2 ) However, a weaker assumption would be that P (x | ωi ) = H (x)

p

Q(xj | ωi ),

i = 1, 2,

(16)

j =1

where H is an arbitrary function common to the two classes. The H function subsumes the dependencies between the components of x, and its commonality means that these dependencies, while not having to be zero, are the same in the two classes. The common H assumption in (16) means that P (ω1 | x) j Q(xj | ω1 )P (ω1 ) (17) = , P (ω2 | x) j Q(xj | ω2 )P (ω2 ) the same structure as in (15), though the Q functions are not (necessarily) the marginal P (xj | ωi ) probability distributions, so that this model is more flexible. By taking logs of (17) we see that it reduces to a function linear in the ln[Q(xj | ω1 )/Q(xj | ω2 )]. That is, the model in (17) is equivalent to the logistic model based on simple univariate transformations of the xj . Since the Q are not the marginal distributions, estimation is more complicated – the iteratively weighted least squares of logistic discrimination. 3.4. The perceptron Linear discriminant analysis has its origins in the statistical research community. Another classifier linear in the xj has its origins in the machine learning and early pattern recognition communities. This is the perceptron. The LDA method described above seeks weights w which maximise a measure of separability between the projections of the data points on w, and finds an explicit solution. In contrast, the perceptron is based on an iterative updating (error-correcting) procedure, one version of which cycles through the design set points sequentially, and adjusts estimates of w whenever a point is misclassified. An alternative procedure classifies all the design set points, and updates w using all those which have been misclassified, in one pass, and then repeats

Pattern recognition

221

this. The details are as follows. Our aim is to find a vector w which assigns as many as possible of the design set elements to the correct class. For any vector of measurements x define y = (x , 1) , and define v = (w , −t) . Now the classification rule ‘assign x to class ω1 if w x > t and to class ω2 otherwise’ becomes rule ‘assign x to class ω1 if v y > 0 and to class ω2 otherwise’, and we seek a v which maximises the number of design set points correctly classified using this rule. The exercise is simplified yet further if, for the design set, we define z = y for design set objects in class ω1 and z = −y for design set objects in class ω2 . In fact, with this, we now need to find that vector v for which v z > 0 for all design set objects, if we can, since this would mean that v was correctly classifying all design set points. With this as the aim, the perceptron algorithm proceeds as follows. If vt is the current estimate of v, classify all the design set points using vt . Let M be the set of those which are misclassified. Then update vt to zi vt = v t + ρ (18) zi ∈M

where ρ is the iterative step size. In fact, a little algebra shows that this is the iterative step in a gradient minimisation procedure using the perceptron criterion function −v z . CP (v) = (19) zk ∈M

If there does exist a hyperplanar surface which can separate the two design set classes, then this algorithm is guaranteed to find it. There are various extensions to this algorithm, to guarantee convergence when there is no true linear separating surface, such as letting ρ decrease at each step, or letting ρ depend on the size of the errors v z. Another variant, important in the linearly separable case, maximises the distance between the decision surface and the closest design set point. This is important because the idea has been taken up in support vector machines, about which more later. A point about perceptron methods worth noting is that they do not require global models to be assumed. In particular, they make no assumption about the forms of the distributions far from the decision surface. If one has made such an assumption (e.g., multivariate normality) and this assumption is incorrect, then it will distort the fit of the decision surface in the region where it matters. 3.5. Tree classifiers Linear decision surfaces have the attraction of simplicity, at least if one regards a weighted sum as simple. This is one reason that they are very widely used in commercial applications. Another type of classifier, the tree or recursive partitioning classifier, is also simple, though in a very different way. Given a design set, we might observe that we can classify many of the points correctly merely by assigning all those with x1 value greater than some threshold t1 to class 1, and all the others to class 2. We might further observe that, if we then classified all those with x1 t1 which also had x2 > t2 to class 1 and all those with x1 t1 which also had x2 t2 to class 2, then even better design set classification accuracy resulted. We could repeat this splitting exercise as long as we liked.

222

D.J. Hand

The basic idea here is to take a data set and find a single split on a single variable which increases the overall purity when the data are split into subsets, where a set is more pure the more it is dominated by objects from a single class. Each such subset is then further split, again leading to increased purity of each of its component subsets. Splitting the design set in this way leads to a partition of the x space. This exercise has an obvious graphical representation as a tree graph, with the entire x space as the root node, with edges leading to nodes representing the cells into which it is split, and with edges leading from these to nodes representing the cells into which it is split, and so on. In practical implementations, each split is chosen by an extensive search, over all possible splits on all of the variables, to find that one which leads to maximum improvement in overall purity. Various decisions have to be made in building such a tree. Should each split be binary? Binary trees are the most common, though some have argued for the merits of shallower trees in which each split is into multiple cells. For how long should the process continue? If it is continued for too long, there is a real danger of overfitting. The most popular modern method for tackling this is to grow a large tree (many splits) and prune it back, measuring classification accuracy using cross-validation or some similar approach. How should one measure purity? That is, how should one decide on how effective a split is? Various impurity measures have been proposed, all tapping into the decrease in heterogeneity of classes within each node which results when a split is made. New points are classified by dropping them through the tree, deciding which branch to follow simply by making the appropriate comparisons with the thresholds at each node. When a leaf node is reached, the new object is classified as belonging to the majority class in that node (appropriately modified by relative costs, if necessary). The simplicity of the sequential process is especially attractive in some disciplines, such as medicine, where it can be written down as a protocol to be followed in making a decision, such as a diagnosis. 3.6. Local nonparametric methods All of the methods discussed so far have found parametric models for the decision surface (that is, decision surfaces which are described in terms of relatively few parameters). An alternative strategy is to use notions of nonparametric function estimation. Broadly speaking, there are two approaches (though, of course, as with the other methods described in this article, they have been extended in many directions). The first, based on the sampling paradigm, uses kernel density estimation to provide estimates of the class conditional distributions P (x | ωi ), which are then inverted via Bayes theorem to give classifications. The second is the k-nearest neighbour method, based on the diagnostic paradigm. At its simplest, this finds the k nearest design set points to the point to be classified, in terms of distance in x space, and assigns the new point to the majority class amongst these k. Of these two classes of method, the nearest neighbour approach seems to be the most widely used. Nearest neighbour methods are attractive for a variety of reasons. They produce a highly flexible decision surface, with the degree of flexibility being controlled by the

Pattern recognition

223

choice of k. They can very easily be updated, simply by adding new cases to the database (and deleting old ones, if that is felt necessary) – no parameters need to be recalculated. In general, they do not require extensive preprocessing, since no parameters need to be estimated (and are therefore a type of ‘lazy learning’, in machine learning terminology). Of course, there are downsides. One is that they involve an extensive search through the design set to locate the k nearest neighbours. If classification speed is important, this can be an issue. This problem has been tackled by accelerated search procedures (such as branch and bound) and also by reduced, condensed, and edited nearest neighbour procedures, which preprocess the design set to eliminate points which do not influence the classification. For example, a single isolated point from class ω1 in a region dense with points from class ω2 will be irrelevant if k is not too small. And in regions which are very densely populated with points from a single class the classification results would be the same even if some of these points were removed. Of course, in order to use nearest neighbour methods one has to choose a value for k and a metric through which to define ‘near’. There is early theoretical work relating k to things such as design set size and number of variables, but a more modern approach is to use cross-validation. For the two class case, an optimal metric would be distance orthogonal to the contours of constant probability of P (ω1 | x). Since, of course, we do not know these contours (indeed, it is these probabilities which we are trying to estimate) iterative procedures have been proposed. Since, moreover, we will not know the contours exactly, one should relax the notion of strict distance from the contours. Thus, for example, suppose that the direction orthogonal to the contours at x is estimated to be w. Then one can define distance between points in the vicinity of x as

1/2 d(x, y) = (x − y) I + λww (x − y)

(20)

where λ is a parameter which can be chosen by fit to the data (e.g., by cross-validation). The larger λ is, the more strictly the metric measures deviation from the estimated contour. More sophisticated variants let the orientation of these contours depend on the position in the x space. 3.7. Neural networks Neural networks were originally proposed as models of biological nervous systems, which consist of large numbers of relatively simple processing units connected in a dense network of links. However they have subsequently been intensively developed as tools for pattern recognition and classification. Their earliest manifestation is in the perceptron, outlined above. However, the perceptron fits a linear decision surface. Such a decision surface has some strong limitations. In particular, it cannot perfectly solve the ‘exclusive-or’ problem. For example, if, in a bivariate x space, the points (0, 0) and (1, 1) belong to one class and the points (0, 1) and (1, 0) to another, then no linear decision surface can perfectly classify them. This problem can only be solved by introducing nonlinear aspects. Neural networks, or multilayer perceptrons, solve this problem by combining several simple weighted sums of the components of x (i.e., several simple perceptrons), but nonlinearly transforming them before combining them.

224

D.J. Hand

The final model has the form f (x) =

m

vk φk wk x .

(21)

k=1

Here wk is the weight vector for the kth linear combination of the raw variables, φk is a nonlinear transformation, and v = (v1 , . . . , vm ) is a weight vector to produce the final result. The model in (21) has a graphical representation in which the xj correspond to input nodes, linked by edges weighted by the wj k to hidden nodes corresponding to the φk (wk x), which are in turn linked by edges weighted by the vk to the output node f . In the case of multiple classes, there will be several output functions (and nodes), fr , defined via different weight vectors. Note that, without the nonlinear transformations φk , (21) would reduce to a linear combination of linear combinations, and this is simple a linear combination. That is, the nonlinear transformations are essential: without them, we are back to a simple linear separating surface. Various nonlinear transformations are in common use, with one of the most popular being a logistic transformation. This means that logistic discrimination, outlined above, can be regarded as a particularly simple neural network, with no hidden node (since it has only one φk (wk x), which will then equal f ). The basic model in (21) has been extended in a wide variety of ways, most importantly by including further hidden layers. Thus the fr may themselves not be directly observed, but (nonlinear transformations of) linear combinations of them may contribute to a weighted sum which is the final output. The result of all this is a highly flexible (and highly parameterised) model, which has great power to fit arbitrarily complex decision surfaces. Of course, as has been noted above, such power carries its own danger and various smoothing methods have been developed to overcome the risk of overfitting. A further complication arises in fitting such a model, again because of the large number of parameters. In fact, the idea for generalising perceptrons in this way arose in the 1960s, but was not pursued because of the computational difficulties involved in the estimations. It was not until a couple of decades had elapsed, and the power of computers had progressed substantially, that effective fitting algorithms arose. The first of these was a steepest descent algorithm, called back propagation, which sought to minimise the discrepancies between the target class labels and the predicted class labels. Since then, however, a wide variety of other algorithms have been developed. 3.8. Support vector machines If we have three different points, belonging to two classes, lying in a bivariate space, then (apart from cases where the three points lie in a univariate subspace) we can always guarantee being able to separate the classes by a linear surface. This idea generalises. If we have 100 points from two classes lying in a 99-dimensional space (again ignoring pathological special cases) we can guarantee being able to find a hyperplane which will separate the classes. The basic principle behind support vector machines is to take the original design set, defined in terms of (say) a p-dimensional measurement vector x, and increase the dimensionality dramatically by adding combinations and transformations

Pattern recognition

225

of the original variables, so that perfect separation by a hyperplane is guaranteed in this new high-dimensional space. Once one is in the situation of guaranteed separability by a linear function, one can apply the error correcting ideas of perceptrons. We see from this that the perceptron was the intellectual progenitor of both neural networks and support vector machines, although they are quite different kinds of offspring. In mathematical terms, the function which arises from this process can be expressed as f (x) =

K

vk φk (x)

(22)

k=1

where K is the large number of the derived variables φk (x) obtained by transforming the x, and vk are the weights in the perfectly separating hyperplane. Of course, we immediately see a problem: if K is large (even infinite), then actually estimating the vk may be difficult. This is overcome by an elegant mathematical trick, and it was this which opened up such models to practical use. It is possible to show that (22) can be re-written in the equivalent, ‘dual’ formulation f (x) =

N

αi ti φ(xi ) φ(x) =

i=1

N

αi ti K(xi , x)

(23)

i=1

where the ti are the true classes, αi are weights to be estimated, φ(x) = φ1 (x), . . . , φK (x) ,

(24)

K(xi , x) = φ(xi ) φ(x) is the kernel function (a term inherited from integral operator theory), and N is the design set size (with xi being the ith design set point). The key fact here is that we do not need to make the separate transformations, φk (x). All we need is the kernel function in order to be able to evaluate (23), and hence (22). The choice of kernel function is equivalent to a choice of features, and we can derive the prediction f (x) without explicitly evaluating the φk (x). Models of this kind have also been substantially extended, for example to allow for the case where the classes cannot be perfectly separated, even in the transformed space. Support vector machines are a relatively recent innovation, and although they have shown promise on a number of practical problems, more experience needs to be accumulated. 3.9. Other approaches Many other types of classifiers have been developed, typically variants of those described above, and many of these are described in the general references listed below. However, a particular advance which is widely applicable should be mentioned. This is the process of combining classifiers into multiple classifier systems or ensemble classifiers. This can be achieved in various ways (and, indeed, with various theoretical objectives).

226

D.J. Hand

As has been seen, flexible classifiers are susceptible to overfitting problems. This can be tackled by various methods, including adding complexity measures to the criterion which is optimised to fit such models (so that they are encouraged not to be too flexible), stopping optimisation algorithms early (a rather ad hoc strategy), or to overfit a model and then smooth it (this is the pruning strategy often adopted with tree classifiers). Another approach is to build many slightly different classification models (for example, on randomly selected subsets of the training data) and to calculate average predictions from these models. This idea of using an ensemble of classifiers, called bagging, short for bootstrap aggregating, is closely related to Bayesian ideas, and can be regarded as an attempt to allow for uncertainty in the parameter estimates. Another important strategy for combining classifiers to yield an overall improved model is boosting. This weights points according to how difficult they are to classify correctly, but has been elegantly described in terms of generalised additive models (Friedman et al., 2000).

4. Other issues This article has primarily focused on describing some of the most important kinds of classification rule, but there are other issues which are also central in constructing an effective pattern recognition system. Two important ones are those of feature selection and performance assessment. The key to an accurate pattern recognition system is that the predictor variables contain sufficient information to separate the classes well. One might think, therefore, that the more variables one had, the better. However, as we have seen, too many variables can lead to overfitting the design data. One is then often faced with having to select from the proposed variables. Choosing a subset also has the merit that then only these need be measured in the future – not important in automated data collection systems, such as microarray data, but vital in social problems, where each variable may correspond to a question in a questionnaire. Feature selection algorithms are generally based on simple measures of separability between the classes, but, of course, the feature selection and classifier construction process are really parts of an integrated whole. Of fundamental importance in building and choosing a pattern recognition system is being able to estimate its likely future performance accurately. ‘Performance’, however, can be measured in various ways. Even for the two class case, there are two different kinds of misclassification and, as noted above, these may not be regarded as equally serious. If they are, then error rate is an appropriate measure. But even this is not the end of the story, since estimating likely future error rate on the basis of the available design sample involves subtle issues. For example, simply calculating the proportion of the design set cases which are misclassified by the classification rule is likely to underestimate the future error rate on new objects. Many methods for estimating error rates have been developed. A review of recent developments is given in Schiavo and Hand (2000). If the two types of misclassification costs are not equal, but are still known, then cost weighted error rate can be used. However, all too often the costs are not known, and then a more generally measure of separability between the distributions of the estimates

Pattern recognition

227

(ω1 | x) for the two classes is required. The most popular of these is the ‘area under P the ROC curve’, equivalent to the test statistic used in the Mann–Whitney–Wilcoxon two sample test. This article has given a lightning review of generic pattern recognition systems, but particular problems will require their own particular refinements. Sometimes one class is much smaller than the other (e.g., in fraud detection in personal banking, less than 0.1% of transactions will typically be suspect). Often there are problems of missing or distorted data (sometimes even the class labels are incorrect). Modern automated data collection has led to some special problems (e.g., microarray data) in which there may be tens of thousands of variables but only a handful of cases, so that the design data lie in a subspace of the x space. Many situations are susceptible to population drift, in which the distributions P (x | ωi ) evolve over time (for example, predicting the classes of people in customer relationship management systems, when changes in the economy induces changes in spending behaviour). Some applications place a great emphasis on ease of interpretability of a classifier (e.g., the legal requirement that a bank must be able to explain the reason for rejecting an applicant for a loan). Some applications require very rapid classifications (e.g., target recognition). And so on. Given this, it is not surprising that specialist areas of pattern recognition have developed certain classes of methods to a great level of sophistication (e.g., speech recognition has developed highly sophisticated hidden Markov models).

5. Further reading There are now many books giving overall reviews of statistical pattern recognition. For example, Ripley (1996) presents a theoretically deep overview of the area. Hastie et al. (2001) is an excellent overall review of the ideas and methods. Webb (2002) gives a highly accessible introduction to the subject. McLachlan (1992) gives a detailed discussion from the statistical perspective (though not including neural networks or support vector machines, which indicates the speed with which the discipline has developed). Hand (1997) reviews the different approaches and discusses performance assessment in detail. Fine (1999) gives a thorough mathematical introduction to feed-forward neural networks. Cristianini and Shawe-Taylor (2000) provides a first class introduction to support vector machines.

References Anderson, J.A. (1982). Logistic discrimination. In: Krishnaiah, P.R., Kanal, L.N. (Eds.), Classification, Pattern Recognition and Reduction of Dimensionality. In: Handbook of Statistics, vol. 2. North-Holland, Amsterdam, pp. 169–191. Cox, D.R. (1966). Some procedures associated with the logistic qualitative response curve. In: David, F. (Ed.), Research Papers on Statistics: Festschrift for J. Neyman. Wiley, New York, pp. 55–71. Cristianini, N., Shawe-Taylor, J. (2000). An Introduction to Support Vector Machine and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge. Day, N.E., Kerridge, D.F. (1967). A general maximum likelihood discriminant. Biometrics 23, 313–324.

228

D.J. Hand

Fine, T.L. (1999). Feedforward Neural Network Methodology. Springer-Verlag, New York. Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–184. Friedman, J. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84, 165–175. Friedman, J., Hastie, T., Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion). Ann. Statist. 28, 307–337. Hand, D.J. (1997). Construction and Assessment of Classification Rules. Wiley, Chichester. Hand, D.J., Yu, K. (2001). Idiot’s Bayes – not so stupid after all? Internat. Statist. Rev. 69, 385–398. Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning Theory. Springer-Verlag, New York. McCullagh, P., Nelder, J.A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London. McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. Ripley, B.D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Schiavo, R., Hand, D.J. (2000). Ten more years of error rate research. Internat. Statist. Rev. 68, 295–310. Webb, A. (2002). Statistical Pattern Recognition, 2nd ed. Wiley, Chichester.