Optical pattern recognition using bayesian classification

Optical pattern recognition using bayesian classification

Pergamon Pattern Recognition, Vol. 27, No. 4, pp. 587 606, 1994 Elsevier Science Ltd Copyright © 1994 Pattern Recognition Society Printed in Great Br...

2MB Sizes 0 Downloads 38 Views


Pattern Recognition, Vol. 27, No. 4, pp. 587 606, 1994 Elsevier Science Ltd Copyright © 1994 Pattern Recognition Society Printed in Great Britain. All rights reserved 0031 3203/94 $7.00+.00

0031-3203(93) E0016-Z

OPTICAL PATTERN RECOGNITION USING BAYESIAN CLASSIFICATION GARY W. CARHART, BRET F. DRAAYERand MICHAEL K. GILESt Department of Electrical and Computer Engineering, Box 30001/Dept. 3-0, New Mexico State University Las Cruces, NM 88003, U.S.A.

(Received for publication 2 November 1993) Abstract--Two recognition systems which classify optical correlation data are described and experimentally compared. Both require a priori estimates of multivariate distributions generated from their training images (Tls). One system uses a quadratic classifier to partition the signal space into regions associated with single TIs and then defines object regions by forming a union of the appropriate TI regions. The other system uses a composite Bayes' classifier to partition the signal space directly into regions associated with single objects. Accordingly, it requires object class distributions which it approximates with composite algebraic functions constructed from the TI distribution estimates. Experimental results demonstrate that the composite Bayes' classifier consistently outperforms the modified quadratic classifier, albeit marginally, when non-TIs are processed. Pattern recognition Bayes' likelihood

Optical correlation

Composite filter

1. INTRODUCTION Human vision, when considered as a pattern recognition system, is remarkably successful. Yet, despite our skill at recognizing patterns that exist in visual data, our understanding of this ability is incomplete and we cannot yet rival the performance of biological systems with autonomous pattern recognition machines. It is likely that the brains of living organisms, due to the massive parallelism afforded by brain cell interconnections, can handle the processing demands of a high dimensional feature space better than our computers. Furthermore, due to self-organization, brains are able to exploit more relevant features than those that recognition system designers use. Nonetheless, even though we cannot yet rival nature in terms of adaptability and overall processing power, we do know-given sufficient a priori information--the optimum solution to the other part of the problem: how to separate features into classes. A particularly useful feature for pattern recognition is mathematical correlation. Whether implemented optically or digitally, this operation provides a good indication of the similarity between two images; however, to build a real-time recognition system, it is helpful to exploit the speed-of-light processing afforded by optical correlators. Unfortunately, although faster than their digital counterparts, optical correlators are not superior for every application because they suffer from drawbacks which are nonexistent in digital systems. For one, they produce unstable correlation patterns I Author to whom all correspondence should be addressed.

Composite distribution

which vary due to stochastic noise, i.e. laser power fluctuations, inconsistent image representations, trucks on the freeway, etc. Secondly, aberrations in the optical system can degrade correlation patterns, particularly for off-centered inputs, if the system is not precisely aligned. In light of these problems, a robust classifier of correlation data is especially important for optical, correlation-based, recognition systems. Compounding the need for a robust classifier is the necessity of using composite filters for all but the simplest correlation-based recognition systems. Despite their fast processing speed, optical correlators communicate with the rest of the world through much slower interface devices, i.e. spatial light modulators (SLMs) and cameras. To minimize the number of correlations, composite filters that represent many images have come into widespread use. Unfortunately, composite filters suffer from a serious problem when they attempt to represent too many images; namely, they lose their specificity, i.e. they respond strongly to "unmatched" objects. Due to this loss of specificity and the variability in optical correlation patterns, it is advantageous to classify unknown inputs by basing decisions on sets of correlation patterns instead of just one pattern at a time. Processing sets of correlation patterns allows identification routines to exploit interrelationships that exist between individual set members, thereby improving classification accuracy. Of course, the utility of such Set-processing procedures depends upon the accuracy and convenience with which we can determine the aforementioned interrelationships. If correlation data conform to a known distribution, Bayes' likelihood 587

G.W. CARHARTet al.


ratio test can be used to generate discriminant functions that optimally partition a classifier's signal space. Moreover, if the distributions can be characterized with low-order moments, it is possible to obtain the discriminant functions from a calibration process which estimates the necessary statistics. In this paper, we present two likelihood-based classifiers and experimentally investigate their performances using a set of toy airplanes and a miniature airport to simulate a realistic operating environment. 2. B A C K G R O U N D

How we characterize a pattern depends upon the medium in which the pattern exists. For visual patterns, all information is contained in a bounded, two-dimensional scalar function of position referred to as an ima#e. Although an image is commonly regarded as a real function (i.e. an intensity distribution), the optical field giving rise to any intensity distribution is, of course, a time-harmonic electromagnetic field. In general, wave fields have two independent degrees of freedom; however, if a wave field is linearly polarized, it may be described in the conventional way using a single complex function with no time dependence, i.e. by using the complex amplitude. F o r our purpose then, an image is a bounded two-dimensional complex function. Since continuous images have an infinite number of degrees of freedom (an infinity of pixels), the pattern space associated with these images is of infinite dimension. Furthermore, because the energy flux density at any point in an image is proportional to the square of the complex amplitude at that point, the total energy in the image is obtained by integrating the squared amplitude over the bounded region, or, equivalently, by taking the magnitude squared of the vector in the pattern space associated with the image . Since the energy in an image associated with any physically realizable field is finite, all realistic image vectors have finite length. Thus we see that our pattern space, which we also refer to as the continuous image space, is the well-known Hilbert vector space. As noted earlier, a computation with images which is useful for optical pattern recognition is mathematical correlation. Given two complex amplitude functions, f(x, y) and #(x, y), mathematical correlation is defined as +o0

c(x,y)= ~ ~f(~,~)a(a-x,/3-y)d~d[3 -


equation (1) can be implemented to high precision optically using a simple two lens coherent imaging system and two transparencies (refer to Fig. 1, assume the transparencies are film). In this system, the lenses function as Fourier transform operators, and the transparencies function as continuous spatial light modulators (SLMs); i.e. they modulate the amplitude, phase, or both (independently) at each point of an incident wave field. Coherent optical correlators perform correlation in the spatial frequency domain by multiplying the twodimensional spatial Fourier transform of one of the input images by the complex conjugate of the Fourier transform of the other to produce an inverse Fourier transform of the correlation pattern. In other words, if F(~, r/) is the transform of the input f(x, y); i.e. F(~, ~7)= ~{f(x,y)} where ~ and r/ are spatial frequency coordinates with units of inverse length, and G(~, ~/), the filter function, is the conjugated transform ofg(x, y), i.e. G(~, t/) = ( ~ {g(x, y) })*, where the asterisk denotes complex conjugation, then the correlation pattern is given by

c ( - x , - y ) = ~ ' ~ - {F(~,r/)G{~, r/)).


Note that negated coordinates are required in the correlation pattern because lenses can only implement forward Fourier transforms. Furthermore, the plane containing the input (the one defined by the first SLM) is called the input plane. The plane where the multiplication occurs (the one defined by the second SLM) is called either the Fourier plane or the filter plane, and the plane where the correlation pattern forms (the one defined by the detector) is called the correlation plane (see Fig. 1). Mathematical correlation is of interest because when fix, y) and g(x, y) are normalized to unit energy, c(x, y) has a unique maximum at the center of the correlation pattern only for an autocorrelation. Normalizing ensures that the peak autocorrelation value associated with each template is unique. To see why this is so, first note that the scalar value at the origin of a correlation pattern is equal to the standard inner product on the input vector pair: +o0

c(0,0)= S Sf(ct, fl)g(ct, fl)dctdfl = f ' g



where f =f(x, y) and g -= g(x, y) are Hilbert vectors. But, given any two vectors in an inner product space, f and g, the Cauchy-Swartz inequality states that


which is easily understood as a shift, multiply, and integrate procedure evaluated at each point in the correlation pattern, c(x, y). Whereas an autocorrelation results when f(x,y) is identical to g(x,y), all nonidentical correlations are called cross correlations. Although it is possible to recast equation (1) in terms of Hilbert vectors, infinite dimensional vectors do not fit into digital computers. However, we can perform some relevant computations with these unwieldy objects using optical techniques. F o r example, it is well known that

If'gl < Ilfll IIgl[


with the equality holding only when f and g are identical. So if we make the demand that the lengths of f and g are equal, i.e. IIf[[= [[gll, we have that If" gl < gZ V image vectors fequal in length to g


i.e. the autocorrelation origin value, g2, is the largest possible correlation plane value and will occur only when l a n d g are identical. Because of this (and the shift invariant nature of the linear system), we will define a

Optical pattern recognition using Bayesian classification Input Transducer SF





Transducer P2 L2

Burle CCD Camera

~ r




20 Milliwatt HeNe


Fig. 1. Hybrid correlator with an Epson LCTV at the input plane, a Semetex SIGHT-MOD at the filter plane, and a Burle CCD camera at the correlation plane. correlation response as the peak value found in a correlation pattern. In principle, we can exploit the uniqueness of autocorrelation peaks to recognize patterns by keeping a collection of templates, gi(x, y), for all the images to be recognized, correlating an unknown image, f(x,y), with each template gi(x, y), and then checking for the unique maximum associated with the autocorrelation responses. Although examining correlation origin values provides a way to achieve recognition via the equality test, this approach is impractical because every image requires its own template (hundreds or even thousands to characterize a single arbitrarily viewed object). Fortunately mathematical correlation of appropriately normalized images can also be regarded as a similarity metric by considering the extent to which a correlation origin value and g2 differ. Thus we do not need all possible templates of an object to be able to recognize it. Instead, it is sufficient to have a set of templates such that at least one is adequately "nearby" any view of an object we wish to recognize. Unfortunately, even with this "neighborhood" property, problems involving arbitrary orientations of objects still require a large number of templates to characterize all the viewing possibilities. Furthermore, adopting mathematical correlation as a measurement of similarity does not preserve the rigorous nature of inference from an observed correlation origin value. Although detecting an autocorrelation peak, g2, still implies unambiguously that f and g are identical, correlation origin values near g2 may be generated by an assortment of possible image vectors, f, which are not all equally similar to the template, g, when judged by humans. Therefore, unless noise is completely absent and we keep an exact copy of every

image we wish to recognize (all possible scales and orientations), a statistical assessment of recognition is the best we can do with mathematical correlation. 3. OPTICALHARDWARE--SPATIALLIGHTMODULATORS F o r real-time pattern recognition applications we must manipulate the SLMs quickly, which limits the number of useful devices (e.g. mechanically switching slides in Fig. 1 is too slow). Two types of high speed electrically addressed SLMs were developed in the early 1980s: magneto-optic devices (MODs) and deformable mirror devices (DMDs). Either of these devices can be used as the active transducer in the input or filter plane of an appropriately designed coherent optical correlator. (1'2) Since pixels in M O D s use the Faraday effect to rotate the vibration plane of incident (linearly polarized) light either left or right through a fixed angle, they must be used in conjunction with a polarizer to modulate the amplitude or phase of an input signal. The orientation of the polarizer selects either the amplitude or the phase of the incident wave field to be modulated, but not both (see Figs 2(a) and (b)). Since M O D s are commonly used as binary phase modulators, their arrival in the mid 1980s focused optical correlator research on methods of implementing binary phase-only filters (BPOFs) and improving their signal to noise performance, t3-7) Pixels in D M D s are tiny mirrors, each of which can be positioned independently. In principle, the range of available pixel positions allows a D M D to perform continuous modulation of phase, amplitude, or both. The earliest D M D used in an optical correlator was a membrane mirror driven by an array of tiny actuators, tat Cantilever and torsion beam D M D s that use hinges to


G.W. CARHARTet al.

Poi~izafioa Direction |

Rotations for

_ \ .... i Vnbl0Amp

Rotation for



....... Gray


Rotation for


.... \(a)


Output ~ " Polarization Direction

I Rotation for

l '* I 1,






Opposite Phase Outputs Co)

Fig. 2. Operation of input and filter spatial light modulators in the hybrid correlator: (a) depicts the amplitude mode of operation for the Epson LCTV display (input device), and (b) depicts the phase mode of operation for the Semetex SIGHT-MOD (filter device).

permit tilt can perform amplitude modulation, and flexure-beam D M D s that allow the tiny mirrors to move like pistons can perform phase-mostly modulation, t9'1°) Devices that can manipulate both phase and amplitude independently are not yet available; however, complex modulation can be implemented by using the flexure-beam mirrors in pairs) 11) In addition to D M D s and MODs, the development of liquid crystal technology is providing new devices which can function as real-time SLMs. Liquid crystal television sets (LCTVs) are now available from many sources and in several resolutions. In the usual operating mode, pixels in LCTVs, like M O D pixels, rotate the incident vibration plane; but, unlike MODs, they allow a continuum of angles, and can therefore provide (with a polarizer) continuous amplitude or binary phase modulations (see Fig. 2). Pixels in LCTVs can also be operated in another mode in which tunable birefringence permits continuous phase manipulation directly.(t2) Unfortunately, when television sets are designed, this mode is undesirable and is therefore suppressed. Not surprisingly then, the available television sets are difficult to use to perform continuous phase modulation. Although not yet demonstrated, cascading two LCTVs, one in continuous phase mode and one in continuous amplitude mode, promises to allow complete complex modulation.(~ 3~ Although optical techniques do permit computations with actual Hilbert vectors, we find, from our quick survey of SLMs, that the available devices which permit real-time manipulation are pixelated. Because they have a finite number of pixels, the images these devices can implement are not continuous, but require only a finite number of dimensions. While this circumstance seems disappointing, it does have some fortuitous consequences. Not only is finite linear algebra easier to comprehend, but, more important, the filter-finding, iterative, "hill-climbing" technique discussed in the next section cannot be executed in an infinite dimensional space.

4. COMPOSITE FILTERS Soon after the advent of BPOF-based, optical recognition systems, considerable attention was paid to composite filter design methods, t4-7) largely because a simple BPOF, derived from a single training image (or TI), tolerates very little variation of geometric parameters such as the rotation, aspect, and/or scale of its reference template. Consequently, a recognition system based on simple BPOFs requires a large number of filters if it is to process scenes for which the detectorto-object geometry can change significantly; so, it is not surprising that composite filters, which are generated from two or more TIs and thus represent multiple objects and/or multiple views of a single object, have come into widespread use. Because every composite filter is generated from its TI set, choosing a good TI set is critical. For example, to achieve a rotation-invariant composite filter, i.e. a filter whose peak response is ideally the same for all in-plane rotations of its reference template, the TI set contains multiple, rotated views of a single object, all at the same scale and aspect. Although the images that compose the TI set need not be related--geometrically or otherwise--to employ composite filter design techniques, the more similar the TIs are, the easier it is to find a filter which performs satisfactorily. Probably the simplest composite filter to construct is the average filter--its design amounts to nothing more than adding together all TIs and taking the conjugate of the Fourier transform of the resulting image. Unfortunately, the output of an average filter to each of its constituent TIs is not only unplanned, but varies considerably, depending upon which TI is input. To overcome this problem, synthetic discriminant function (SDF) filter design methods have been developed. Ideally, SDF filters produce output origin values that exactly meet designer specifications, and with digital correlation systems, ideal behavior is realizable. However, due to noise sources, actual SDF response origin

Optical pattern recognition using Bayesian classification values oscillate about their design specifications for optical correlations. Due to the pixelated nature of real-time SLMs, it is possible to derive the generalized SDF solution by casting output correlation constraints in matrix form, and then solving for the filter vector that satisfies the resulting system of equations. If both the input and filter transducers have the same number of pixels, say N (implying all image vectors are elements of the complex vector space c~N), and if M is the total number of constituent TIs used to generate an SDF, then let A be the N by M matrix whose columns are the TIs. If we are solving for an SDF filter, denoted h, then the equation ATh * = C


defines an M-dimensional column vector, c, whose ith component is the inner product between h* and the ith TI, i.e. the origin value of a correlation between the ith TI and the image represented by h. The SDF problem is to find values for the N components of h which satisfy the vector of correlation plane origin constraints that the designer specifies via vector e. Since the number of SLM pixels usually far exceeds the number of training images, the system of equations represented by (5) above is highly underdetermined, meaning that an infinity of solutions exist if the TIs are linearly independent. ~16) Solving for h in terms of its homogeneous and particular solutions yields the generalized SDF solution as h = A(ATA)c* -k [I N -- A(ATA) - IAT1v


where IN is the N by N identity matrix and v is an arbitrary vector ~,N. Having an infinity of solutions makes it possible to design the filter for other desirable traits, e.g. phase-only, binary phase-only, minimum variance, minimum correlation energy, etc. SDFs with only binary phase (BPOSDFs) are particularly desirable because many popular SLMs can only implement BPOFs. Since it is not known how to find B P O S D F solutions from equation (6), iterative algorithms for generating pseudo-BPOSDFs (pseudo because their TI responses are not identical) that behave almost the same as their analytical counterparts have been developed. In the relaxation algorithm of Jared and Ennis t l s ) the algorithm used to generate some of the filters for this p a p e r - - t h e designer initiates the B P O S D F design procedure by selecting an initial weight set. Next, the Fourier transforms of all TIs, each multiplied by its appropriate weight, are superposed to produce a continuous amplitude, continuous phase filter. Binarizing this filter produces a candidate B P O S D F which is correlated with each of its constituent TIs to determine how much each correlation origin value differs from its desired value. If a large difference exists, the corresponding TI is weighted more heavily in the next iteration. Iterations continue until each TI produces a correlation origin value which is deemed acceptable. Although it is unlikely that one can use the relaxation


algorithm to produce BPOSDFs that meet design objectives exactly, it can produce filters that come very close, assuming of course that the design objectives are not physically unreasonable. One useful feature of this algorithm is that the needed correlation patterns for filter design can be obtained directly from the correlator itself, thereby incorporating correlation system idiosyncrasies directly into the filter design process. Presumably then, actual fdter performance will have a greater chance of meeting design constraints. As will be shown in Section 7.4, the relaxation algorithm can be driven by the separation of TI response distributions instead of by how well TI responses agree with designer specifications, providing a powerful means of manipulating and improving object classifiability. 5. CLASSIFICATIONSTRATEGIES In Fig. 3, a block diagram depicting a generic statistical recognition scheme for identifying objects in video scenes, a feature extraction procedure operates on each input to produce a signal vector, x, which is then sent to the classifier. By assessing the distance between the observed signal vector and the vector, or vector set, which is representative of each object class, ¢~i, the classifier garners the information it needs to assign an object in the input scene to one of the object classes. Under most circumstances, the classifier assigns the input object to the object class for which the a posteriori membership probability, given the signal vector, is maximum. The variety of assumptions under which this membership probability is assessed gives rise to the various classifier possibilities discussed in what follows.

5.1. Statistical correlation One method of classifying correlation system inputs is based on collecting off-line estimates of the autocorrelation probability density functions (pdfs) associated with each TI. Consider a TI set consisting of N T TIs and an associated filter bank of N r filters. If TI response distributions are normal, we can characterize each of the NT autocorrelation pdfs in terms of its mean and variance. Note that if the filters are simple, NT equals N F and each filter is responsible for generating one pdf. On the other hand, if the filters are composite, NT is greater than NF and each filter is responsible for generating NT/NF pdfs, assuming each filter represents an equal number of TIs. Input

Signal Vector


i. . . . . . . . . . . . . . . . . . . . . .

f(x,y) J Featurel x ~ J "qE x t r a c t o r ~


Distance Between

,. . . . . . . . . . . . . . . . . . . . . .

Fig. 3. Statistical-based recognition system.



G.W. CARHARTet al.


Fig. 4. Statistical correlation assesses the "probability", denoted ~, that signal x o is part of the distribution p(x) by integrating the distribution over the hatched region. This region is defined by all signals farther from the mean than the sample signal. After characterizing the pdfs, we can assign an input to an object class by correlating it against each filter in the filter bank and evaluating the "probability" that the resulting response belongs to each filter's autocorrelation distribution--a technique known as statistical correlation. In the context of statistical correlation, the probability a sample signal belongs to a distribution is found by integrating over all points of the distribution that are farther from the mean than the signal point (see Fig. 4). Note that the integrated region, referred to as the ~-level significance region, t19) is not literally the probability the signal belongs to the distribution; instead, it is the fraction of distribution samples found in the integrated region. Like most threshold classifiers, a statistical correlation classifier (or SC classifier) assigns the input to the object class of the filter which produces the "best" autocorrelation response, assuming that the probability associated with this response exceeds a recognition threshold for the system. A classifier which uses distribution means to locate classification boundaries but does not assess variance was presented by Casasent and Chang in one of the early S D F papers, t2°) Furthermore, a classifier that not only attempts to locate classification boundaries optimally, but also attempts to position distributions optimally using an SDF design procedure, was developed by Sudharsanan et al. ~21~Exploiting the assumption that correlation response distributions are normal, their method determines the be~t values for the vector of origin value constraints (vector e in equation (5)) in terms of maximizing the square of the distance between distribution means, normalized to response variance. For small numbers of TIs and object classes, their method offers a practical way of significantly enhancing classifier performance. However, since it assumes inputs are to be classified based on the response of a single filter, the advantages derived from a multivariate approach are not available. 5.2. Walsh-Giles correlation plane filter In an alternate approach, Walsh and Giles showed that it is possible to use knowledge of response variance

to recognize objects; t22) they presented a correlation plane filter (CPF) which estimates response variance on-line by obtaining multiple data sets from the correlator without a change in the input scene. Their CPF, using a statistical technique called analysis of variance, classifies inputs by partitioning observed variance into between-class and within-class components. This scheme thus assesses signal variance membership in a variance distribution. In practice, the Walsh-Giles C P F is more accurate than any of the potential threshold classifiers (the SC version, included); it can tolerate high noise and clutter densities and still maintain over 95~o accuracyJ TM However, the volume of data required by the on-line calibration process (a time history) makes this approach inherently slow; moreover, any attempt to increase speed by decreasing the filter bank size through the use of composite filters has the negative effect of degrading the on-line statistical estimates. 5.3. Vector siffnals in an n-dimensional signal space The Walsh-Giles C P F operates on vectors in a high dimensional space defined by the number and organization of filters used. Statistical correlation, also, is readily extended to higher dimensional distributions by operating in a vector space instead of on the real number line. A vector SC classifier exploits cross correlation data by regarding the signal set as a vector in the appropriate vector space. Using cross correlation responses as a classification tool, however, requires estimating multivariate pdfs for the TIs from both autocorrelation and cross correlation samples. Although estimating multivariate pdfs requires more samples and thus takes longer, the additional information gleaned from the data is often worth the investment in calibration time. Instead of classifying a signal by determining the probability that each of the responses in a signal set belongs to its corresponding autocorrelation distribution, we can use the responses as components in a vector and determine the probability this vector belongs to each of the candidate multivariate pdfs representing the object classes. A vector classifier's signal space, then, is an n-dimensional Euclidean vector space, where n is the number of responses collected for each decision. 5.4. Classification via 2-dimensional statistical correlation To illustrate how a multivariate case works, consider a 2-dimensional signal space; i.e. an unknown input is processed by correlating it with two filters, and the two responses are regarded as an ordered pair in a plane. If we assume the TI distributions residing in this space are bivariate normal, an assumption that turns out to be quite reasonable, ~24) their equal density contours are concentric ellipses, and each distribution can be characterized as a pdf in terms of a two element mean vector and a 2 × 2 covariance matrix. In this circumstance, to find the probability that a signal belongs to one of the distributions, it is expedient to perform a coordinate transformation in the 2-dimensional signal

Optical pattern recognition using Bayesian classification


so% ~ , b u a v

X i



J )~

X t

Fig. 5. Equal probability contours ofa bivariate normal distribution are concentric ellipses, albeit arbitrarily oriented. The eigenvectors of the covariance matrix characterizing the distribution define an alternate coordinate system (x't, x~) which is aligned with the principal axes of the ellipses. space such that the appropriate pdf is centered on the origin of the new coordinates and its elliptical contours of equal density are described by equations of circles in the new coordinate system (Fig. 5). Unit size in the new system is chosen independently for each axis so that the contour enclosing 68.3% of the distribution samples is at a radius of one standard deviation (definition). The required coordinate transformation, a "whitening transform", is composed of a shift of the origin, a rotation of the axes, and an expansion (or dilation) of the units; furthermore, it is easily obtained by performing a spectral decomposition of the covariance matrix. Expressed in the normalized coordinates associated with any of the distributions, the magnitude of a signal vector is called the Mahalanobis distance (to the distribution mean) and is a 1-dimensional metric of the signal's deviation from the mean in standard units.~TM 5.5. Bayesian likelihood A relevant feature of the statistical correlation technique is that it does not account for possible differences in distribution spread. To illustrate a potential pitfall of this method, suppose we want to assign a signal to one of two candidate distributions and the s-level significance (probability) associated with each distribution is the same at the signal point, i.e. the Mahalanobis distances from the signal point to the centroids of the two distributions are identical. A statistical correlation classifier would, in this case, produce an indeterminate result--either distribution is equally valid according to its classification criterion. Note, however, that if the contours defining the two integration regions are not identical densities (see Fig. 6), the chances that samples from each distribution are found at that sional point are not equal. One of the pdfs necessarily represents more of these signals than the other, and if we classify accordingly when we see this signal, we must be correct better than half the t i m e - - t h e performance the statistical correlation classifier expects to achieve for this situation. F r o m this observation we can see that the density contours carry more valuable information than

51 <


Fig. 6. Statistical correlation vs. likelihood. Although sample point x 0 lies on the 50~o contours of both distributions, the density, 62, on the pdf 2 contour is greater than the corresponding density, 6t, of pdf I because the area enclosed by the pdf 1 contour is larger. The likelihood x 0 is part of either pdf is the ratio of the pdrs density to the sum of both. Thus, by Bayes' likelihood ratio test, x 0 is correctly assigned to pdf 2 because 62/(61 + 62) > 31/(31 +62). the probability contours. In fact, when there are multiple distributions that can produce the same signal, the likelihood that a signal was generated by one of the distributions is given by the fraction of the overall pdf density (renormalized sum of all pdf densities) that the density in question represents. Of course we do not have access to the real overall pdf, so a sum of the TI pdfs is used as an approximation. In comparing statistical correlation to likelihood we find that the main differences pertain to the number of distributions considered when membership probability is assessed. Statistical correlation interprets an integral over a region of one distribution, i.e. some fraction of one distribution's samples, as a membership probability. Likelihood, assuming one has the genuine overall pdf, considers information about all distributions when assessing a membership probability. In practice, since we only have an approximate overall pdf, we only account for the TI distributions rigorously. Still, this is a major improvement over statistical correlation which does not acknowledge any other signal sources.


A classifier based on comparing likelihoods will perform better than a SC classifier because it partitions the signal space into classification regions with better boundaries. These "new and improved" likelihood boundaries between any two classification regions are defined by equal densities rather than probabilities (see Fig. 7), and are also guaranteed to be quadratic for multivariate normal distributions. Although both statistical correlation and likelihood classifiers partition the signal space with quadratic boundaries, the likelihood classifier is the one commonly known as the quadratic classifier. This is, no doubt, because it is optimal for an appropriately constrained problem, i.e.


G.W. CARHARTet al.

[-- contours(~-.5)

[ /---.,.~


Decision Criteria QU:

',\ t ",', ,,',



Fig. 7. Quadratic boundaries vs. SC boundaries. Classifying according to equal density (quadratic classifier) rather than equal probability (SC classifier)results in a TI assignment to pdf2 instead of pdf 1 for signals in the hatched region.

one with no signal sources other than those produced by Tls. Bayes likelihood ratio test, which includes any asymmetry in the a priori frequency of occurrence of the TIs, is a familiar way to generate discriminant functions which classify according to likelihood. Although the quadratic classifier is optimal for assigning TI inputs to distributions in the presence of Gaussian signal vectors, it is not optimal for making object class assignments when using composite filters in a correlation-based recognition system. In fact, it is not uncommon for multiple TIs of the same object class to be very similar to one another to ensure that the TI set adequately characterizes geometric distortions; consequently, TI distributions associated with the same object class tend to be clustered tightly together. Thus the quadratic classifier, although optimal, is not particularly good at assigning TIs to their correct distribution under realistic conditions. However, since we are usually only interested in the object class of inputs, the quadratic classifier is valuable if--instead of reporting TI decisions--it simply reports the object class of its TI decisions. In the signal space, this assignment is equivalent to defining object class regions as a union of TI regions for all TIs representing the same object class. Thus we will refer to this classifier as a quadratic union classifier (or QU classifier). Unfortunately, assigning inputs using a union of quadratic classifier regions is not equivalent to classifying by likelihood. To regain the optimality of the quadratic classifier for object class decisions, we must assign inputs directly to an object class instead of indirectly by reporting the class of TI assignments. What this amounts to is applying Bayes' rule to composite TI pdfs obtained by summing all TI pdfs associated with the same object class and then renormalizing to unit volume (see Fig. 8). Although this procedure can be understood as constructing an appropriate composite version of the usual quadratic classifier, we will refer to the resulting classifier as a composite Bayesian classifier (or CB classifier) because there are no longer any quadratic boundaries involved. To illustrate the

Fig. 8. QU object regions(dark outlines)produced by a union of the TI regions.Given the relative sizes of the density values, 81, 82, '~3,and 6a, the QU classifierassigns xo directly to pdf 2 and, consequently,indirectlyto the objectclasscharacterized by pdfs l and 2. In contrast, the CB classifierassignsXoto the object class characterized by pdfs 3 and 4 because the density of the 3 + 4 composite distribution is greater than the density of the I + 2 composite distribution. Thus for xo, the proximity of pdf 3 influencesthe migration of the classificationboundary more strongly than the proximity of pdf l. practical issues involved in implementing either a QU classifier or a CB classifier, as well as to demonstrate the differences between them, we will implement both and experimentallyevaluate their non-TI performances. 6.1. Q U classifier discriminant functions

As was noted previously, the q.uadratic classifier is superior to any SC classifier because it bases its decisions on likelihood. Therefore, when trying to assign an input signal to one of many TI distributions, the quadratic classifier chooses the distribution for which the probability density value at the signal point weighted by the TIs a priori frequency of occurrence is highest. If each TI distribution is multivariate normal, the function N(x) -



-~(x -

m)tC - l(x - m)

(7) characterizes each distribution as a pdf in terms of its mean vector, m, and associated covariance matrix, C. With a flat filter bank architecture, the dimension of a signal vector, x, is equal to the number of filters in the filter bank. Using the quadratic classifier is equivalent to partitioning the signal space into regions with quadratic boundaries, each one representing a single TI. If the TI set, denoted T, contains N T elements, each one associated with an indexed symbol ti, i = 1, 2.... , NT, then the quadratic classifier partitions the signal space into N t + 1 regions--one for each TI plus an additional non-TI region composed of signals that are too far removed from any of the TI distributions to be assigned to one. Let the region in the signal space corresponding to t i be denoted R i. Furthermore, let Pr[tl] represents the a priori probability that t; is the input, and let m~

Optical pattern recognition using Bayesian classification and Ci be the mean vector and covariance matrix, respectively, associated with % The quadratic classifier assigns a signal, x, to the TI associated with the maxim u m likelihood. Thus the expressions.


fk(x) = - -




7 ( x - mk)TCk l(X -- ink)J,

To develop the discriminant functions of the composite Bayesian classifier, it is useful to first define a composite pdf, p(xltk), which is associated with the object classification region O)k,by forming the renormalized sum


define a set of discriminant functions that optimally assign TI inputs to the correct distribution, and the classification rule is choose TI z~ if f~(x) = maxfk(x).



p(X[tg) k = 1,2 . . . . . N~



~ p(xlzjk) Pr [Zik ]

where Pr [tk] is the a priori probability that an arbitrary element of t k is input, i.e. Nk

Pr[-tk]= ~ Pr[-rjk ].





(1 1)

Pr [tk] j=-i

Substituting the expression above into Bayes' rule

Taking the natural log of (8) and multiplying by two yields an equivalent set of discriminant functions, 9k(x), given by 9k(X) = -- (X -- mk)TCk I(X -- ink) -- In ICkl + 2Pr

Pr [tklx]

p(xltk) Pr [tk] P(Xl T)



Nk ~:1 P(Xlr Yk) Pr [r jk]

p(xl T) j :


[-'Ck] ,

k = 1,2 . . . . . N~ yR(X) = -- dE(x) -- In ICk[ + 2 Pr [Zk], k = 1 , 2 .... ,N~


where p(x[ T) is the pdf of x given that a TI is input and Pr [tklx ] is the conditional a posteriori probability that an element of tk was input given that x was observed, yields a starting point for finding the discriminant functions of the CB classifier. To minimize classification error, the CB classifier assigns an input to object class u2~ if

where dk(x) is the Mahalanobis distance from the signal x to the mean vector associated with ZR. Note that the middle terms on the right-hand side of equations (10) represent the difference between statistical correlation and likelihood partitions; these terms contain the information about the spread of each distribution in the signal space.

Carrying out the partitioning of the signal space implied by (14) reveals that the discriminant functions of the CB classifier are given by ~24~

6.2. CB classifier discriminant functions

N~ Pr[Zjk ] [- 1 gk(X) = j ~ ~ - - e x p / - - ( x - - m j k )

Recall that the set T, given by T = {z~li = l, 2 ..... NT}, denotes the TI set. Now consider a recognition system which must identify N o object classes, where each class is identified by the indexed symbol, COg,k = 1, 2 . . . . . No. Associated with each object class is a subset of T, denoted by t~, k = l, 2 . . . . . No, which contains the TIs that characterize the class. Of course the number of Tls in each of the tk, called N k , need not be the same, and no z~ should be associated with more than one object class. Using this notation, we have an alternate name, zjk, for each of the TIs so that the classification subsets can be represented by tk = {Zjkl j = 1, 2 . . . . . N k }, and the training set, T, can have any of the following representations:

T = {zili= 1,2 . . . . . Nx} = {tkl k - - 1,2 . . . . . No} = { r j k l J = 1,2 . . . . .

Nk; k = 1,2 . . . . . No}.

As noted, the quadratic classifier partitions the signal space into classification regions, called Ri, i = 1, 2..... NT, where each R~ is associated with only one z~. However, because optical recognition systems require many images of an object to characterize an object class, each classification region is associated with multiple TIs. Therefore, we wish to find the classification regions, denoted by Rk, k = 1,2 ..... No, that directly identify an input as belonging to a specific object class cok.

PrEGJx]>Pr[tklX ]

Vk:~ct, k = l , 2 . . . . . No.




C~ (x--m~k)



(15) and the resulting decision rule is choose class to~ if ff~(x) = max fiR(X).


k 7. P R A C T I C A L


When constructing discriminant functions for both the Q U and CB classifiers, we expressed the parameters that define each TI distribution as a mean vector and a covariance matrix. Since we have no detailed knowledge of the mechanisms generating response variations from our optical correlator, we cannot construct values for the required mean vectors and covariance matrices theoretically. Instead, an off-line calibration process provides data for estimates of these distribution parameters. How much time is an acceptable investment in calibration is an issue which has significant design conseq uences. 7.1. Calibration constraints on signal space dimension The n u m b e r of sample signals required to obtain an accurate estimate of a multivariate distribution is the product of the n u m b e r of samples required to estimate each of the marginal univariate distributions associated with the original multivariate d i s t r i b u t i o n - - a property


G.W. CARHARTet al.

known as Bellman's curse of dimensionality,t26~ Consequently, time required for the calibration procedure grows geometrically with the dimension of the signal space. In practice, the necessary sample count is approximated by the function Ns = A n, where n is the dimension of the signal space, and A the number of samples required to produce an accurate estimate of the most widely dispersed univariate distribution. Obviously, the dimension of the signal space cannot be too large or the volume of calibration data and/or the time required to obtain it will become prohibitive. Since the practicality of vector classifiers is dictated by this limit, we must determine the constraints it imposes. Suppose, for example, the time required to estimate one distribution cannot exceed one month, and we can get sample signals at the rate of 30 Hz (the maximum rate our current hardware can run). In a month we will have Ns = (30 samples s - 1) (60 s rain - 1)(60 min h - 1) (24 h d a y - l) (30 days) = 7.776 x 107 samples. If 10 samples are adequate to estimate the most widely dispersed marginal univariate pdf (a realistic value), the maximum number of dimensions of the signal space is given by 10x = N s ~ x ~ 8. And that was only one p d f - - n o t an encouraging result. Of course, complete calibration of an 8-dimensional space requires one month for each of the TIs. Furthermore, since typical response variances for our cotrelator do not permit accurate mean estimates with fewer than 9 samples (see Section 8.4.), this limit on dimension looks like a solid brick wall. To avoid ending this section on such a depressing note, consider calibrating a 3-dimensional signal space instead; with our video hardware (30Hz), each distribution will only require about 30 s to estimate.

7.2. Making more signal space--filter trees Evidently, a practical system employing a vector classifier requires a signal space with a small number of dimensions, implying composite filters are required for all but the simplest of systems. For typical systems, these composites will have to represent many TIs since we expect N T to be relatively large. Unfortunately, as noted earlier, highly composite filters have poor specificity. In fact, if we decompose a composite's performance statistics into a hit rate, i.e. the percentage of correct "object is present" decisions, and a false alarm rate, i.e. the percentage of incorrect "object is present" decisions, we find composite performance degrades with increasing TI count primarily in the false alarm rate. Thus, highly composite filters, although able to respond to their TIs within design parameters, are also likely to respond to many wrong inputs equally well, i.e. they are not good for determining whether a given input is a TI. In statistical terms, the problem with these filters is that the overall pdf, which is the renormalized sum of all the simple pdfs, does not have one term (component distribution) which makes a dominant contribution to the overall density, i.e. these filters do not produce high likelihood ratios. But do not give up on them yet--since rejections can be relied upon to be accurate, we can use composites to tell us what some input is not. With proper planning, such filters can indicate which more specific (less composite) filter should be tested to ensure a TI has been detected. To use this idea, we must design a composite filter to identify TI groups rather than TIs; then its response can bootstrap us into a less congested signal space associated only with the TI subset identified. Exploiting this "zoom lens" effect by arranging a hierarchy of composite filters in a tree structure provides an efficient means of locating the best candidate of many simple filters (see Fig. 9). Such an arrangement makes it possible to use a simple filter to perform recognition

Fig. 9. Binary tree filter bank architecture. Simple filters compose the bottom level (or leaves) of the tree, and higher levels are comprised of composites, each of which represents a subset of the simple filters.

Optical pattern recognition using Bayesian classification without testing all the simple filters in the bank. Thus, if a system must accommodate more decision regions than can fit in the signal space (the most likely case), a treefilter bank architecture which connects a collection of less congested signal spaces can provide a solution. With a tree filter bank, each decision point or node in the tree should be treated as an independent recognition system. Then if composite filters are used to represent TI groups corresponding to different tree paths, the dimension of the signal space for a node is equal to the number of tree branches emanating from that node, i.e. possible decisions are paths to subsequent nodes. Since the number of decisions at each node must be kept small intentionally to minimize the number of correlations, the dimension of each signal space should be manageable. Furthermore, because the earliest decisions in a tree search are critical, these decisions must be made as accurately as possible, implying optimal signal space partitions should be used. Hence the CB classifier is ideally suited to the decision making required in a tree search. When there is adequate signal space the individual pdfs characterizing object classes should be positioned in their associated classification regions to yield maximum advantage; i.e. filters should be designed with the intent to cluster pdfs characterizing the same object class while separating pdfs of different classes. Placing all the distributions via SDF design, however, may well be time poorly invested because, if we construct average filters instead, many will generate well placed distributions without further work. Thus, repositioning many of the distributions is pointless. Section 7.4 presents an iterative BPOSDF design procedure which, instead of trying to place each distribution at a specific location, attempts only to relocate the distributions that are poorly positioned. 7.3. Characterizing object classes with Tls Before we can design filters, we must select a training set of representative images to characterize anticipated input distortions. To identify an acceptable TI set, first note that all non-TI input images belong to one of two categories: (1) they are images from a known (but undetermined) object class we want to recognize or (2) they are images from an unknown object class which we want to reject. In terms of these categories we can now clarify the meaning of the phrase "a set of TIs that adequately characterizes an object class". It is equivalent to claiming all the non-TIs meet one of two conditions depending upon which category they are in. All first category non-TIs, i.e. unknown views of known objects, must be images adequately nearby a TI so that the associated distributions significantly overlap each other; then the TI and non-TI are indistinguishable (no holes in the TI set). Alternatively, all second category non-TIs, i.e. all images of "wrong" objects, should generate distributions which do not significantly overlap any TI distribution (no TI impostors). To select an initial TI set, a designer must make an educated PR 27:4-F


guess about which distortions are most relevant and map them with a collection of TIs that are representative of the distortion spans, while simultaneously avoiding such high resolution that the number of TIs becomes impractical. Although there may be two problems with initial guesses, possible holes or impostors, we know of no way to detect either problem until it occurs. While any holes which turn up obviously must be filled, handling impostors is trickier. For example, if an image of a ladder is a TI, and an image of the same ladder with one rung missing is a n o n - T I - - a n d should be classified as such--the second objective (no impostors) would be violated and the system would claim a high classification confidence for the non-TI. Of course this is undesirable, but, if detected, it does provide information which can be used to improve the design of the training set. After recognizing the impostor, we can correct the situation merely by including it in the TI set. Although we did not originally want to recognize this impostor, the fact that its pdf significantly overlaps one or more TI pdf(s) forces us to include it as a TI when designing filters so that we can move its distribution to minimize classification difficulty. 7.4. Designing filter groups for tree nodes Constructing a tree filter bank requires dividing the training set at each node into subsets which define the "next nodes". How to choose which TIs should belong to which groups is a relevant question since we can see ways to do it that are clearly wrong. Recall the ladder example: once we have realized we need both TIs (all rungs present and one rung missing) we can still fail to distinguish which is which if we place them in different groups at any of the nodes involved in a tree search. Because these two TIs are so similar, signals from simple filters for each will be necessary to distinguish them. But, if these two TIs are not part of the same group for all nodes in which they occur, signals from the two simple filters will never be compared to each other. In fact, the similarity between these two TIs is so great that a composite which rejects certain TIs, i.e. a transform ratio filter, may be required to adequately separate their distributions in the signal space, t27~Such filters can replace the simple filters as a tree's "leaves" when the simple filters produce distributions which should be better separated (such as the ladder TIs). It would seem then, TIs which are hardest to distinguish from each other should be grouped together when designing composites, implying overlap of the associated distributions in the signal space is an important measure of similarity between TIs. Unfortunately, measuring the overlap of the distributions requires locating them in a signal space which, in turn, implies filters, and therefore groups, already exist. Although we have not deduced how to untie this Gordian knot, it should be obvious that an iterative procedure can improve any major mistakes made in initial choices. To determine reasonable initial groups,


G.W. CARHARTet al.

some similarity metric in the image space is necessary. Intuitively, orthogonal vectors (images) are the least similar and parallel vectors are the most similar, which suggests the angle between vectors is a suitable metric. Although in a general complex manifold, orthogonal and parallel are defined (from the standard inner product and real scalar multiplication, respectively), real angles, unfortunately, are not. A complex angle can be obtained, but complex angles, like complex numbers, cannot be ordered, and an alternative approach is required. A viable alternative is to examine the difference between each pair of TIs in the spatial frequency domain. Physically, this makes sense because it reveals the differences in energy between corresponding spatial frequency components. If the spatial frequency spectra of two images differ significantly, it is highly doubtful that the two TIs should be considered similar. 7.5. Designing SDFs from expected performance Once a set of filters is constructed (in whatever manner) and estimates of the resulting distributions have been obtained, the resulting classifier's theoretically expected TI performances can be obtained by integrating the appropriate regions of the signal space, t24~ Such information can be readily used to drive an iterative procedure similar to the relaxation algorithm mentioned in Section 4. Using TI performance results rather than response deflections to modulate filter construction weights, this algorithm attempts to advantageously relocate poorly positioned distributions (those near the boundaries of object classification regions). Of course, the iterative BPOSDF design procedure assumes initial TIs and weights exist. It is therefore convenient to make the initial weights uniform, and, even though no rigorous way to choose an initial TI set exists,~28J reasonable selections are not difficult to come by. Once initial choices have been made, it is easy to adapt the relaxation algorithm so that it is driven by expected TI performance. Each iteration in the modified algorithm consists of (1) constructing filters from the current weights, (2) calibrating the resulting vector classifier, (3) calculating the expected TI performances, and (4) modulating the weight(s) corresponding to the least classifiable TI(s). The exact form of this modulation is not very relevant; any monotonically increasing function with gradient less than that of the surface of interest will work. Such a repositioning maneuver has the advantage of improving the classifiabilityof signals, which is, presumably, the ultimate goal of any recognition system. Unfortunately, calculating TI performance requires a complete calibration which is quite time consuming. Therefore, the filter design procedure cannot run through too many iterations before time constraints become prohibitive. To shorten the time required, first note that estimates of distribution means converge faster than those of variance; furthermore, because variance estimates derived from an insufficient number of samples can reasonably be expected to be less than

the true values, the overlap detected by an expected performance analysis based on a partial calibration will correctly identify worst case distributions (the ones which should be moved). In other words, with an acceptable mean estimate and a variance estimate likely to be less that the true value, the most significantly overlapping distributions are readily identified. Avoiding a complete calibration at each iteration in the design algorithm saves much time during filter synthesis, but calculating all the TI confidences at each step (each of which requires a numerical integration over the whole signal space) still represents a significant bottleneck in the design process. Fortunately, we also have a short-cut to sidestep this bottleneck. A misplaced TI distribution is readily spotted if its mean produces a strong statistical correlation with any distribution associated with a Tl from a different object class. Thus, if this situation arises--where a distribution's mean is located on a high probability contour of another distribution associated with a different object class--the weight of each TI should be increased to move its distribution in the coordinate direction corresponding to the composite filter which represents that TI. Note that when the dimension of the signal space is equal to the number of decision regions (e.g. tree nodes), there are enough coordinate directions to ensure, in a global sense, that we are moving the distribution away from all wrong regions and toward the correct one. So, a mean-overlap analysis of the TI confidences can serve as the performance metric which modulates the weights. A not-so-rigorous but more practical version of the design iteration replaces steps (2) and (3) with the described short-cuts. 8. EXPERIMENTAL RESULTS

Suppose we wish to identify any one of nine possible airplanes from the vantage point of an adjustable skyhook directly over an airport (like a satellite in orbit). 8!:-::XI:X:K:I : i::i :.:' ii:i:!i.!i!~!:?:?~:!~::i:~:~:~ f!~?!i:~:?:i •. : :.: ] : i:i:::::::: .: ::::::::::::::::::::::::::::: :

:::::.::... . :::i

.....?}:~:::::: : } . ::::iiii::ii::~:i!:::': .....................:

u...... . ......... ,..... :;>5::.:;::


:::::::18::11:::: : .:.


}i:!iii:iiiii~ii:ii:':::: :.:.:.:.::.

dli :!.!.?i!# •:k}3: X:~~::: •...... :.~:

.5:-': ::-: :+

..! ..... :::u :¢ :::.:: ::::::::i


Fig. 10. The terrain board as seen by the low-resolution camera suspended above it.

Optical pattern recognition using Bayesian classification If we could simplify the problem further by acquiring a magic black box which converts potentially distorted (rotated, scaled, etc.) object views into undistorted or standardized versions, then nine Tls would be enough to characterize the object classes. Unfortunately, since such magical iblack boxes have yet to be devised, our solution to this problem will require a TI set with many elements to describe each object. Since we could not afford our own satellite (much less our own skyhook), we used a 4 x 4 foot scale model of E1 Paso International Airport--complete with proportionately scaled toy airplanes to serve as objects in need of autonomous recognition--to obtain experimental results (see Fig. I0). t29) Two computer controlled cameras operating at different resolutions (wide field, low resolution and narrow field, high resolution) constantly survey this terrain board to provide


input data for an assortment of image processing procedures. In a cueing process, location information about "objects of interest" found in a low resolution image (one that shows the entire airport) is used to acquire a high resolution image of the interesting object. Objects (presumably airplanes) in these high resolution images are the ones we are interested in identifying. 8.1. The T I set

Figure 11 shows the nine source images that were used to generate the TI set. To provide a good test of whether or not the classifiers would encounter misidentification difficulties, the TI set was intentionally constructed out of airplane images only. Mixing object types, e.g. using airplanes and tanks (which do drive around in the vicinity of E1 Paso airport), would not









a, TU26

Fig. 11. Source images used to generate the TI set.


G.W. CARHARTet al.

Terrain Board Object Distortions

l\ 1_\ I ~~

I • Rotation • Aspect

| II

\ \ \

Rings of constant aspect and scale I

Fig. 12. Terrain board geometry. A terrain board image should be a slightly distorted version of one of the TIs. Pertinent distortions include in-plane rotation, aspect, and scale; the magnitude of the scale and aspect distortions depends on the radial distance of the airplane from the center of the board. have provided as difficult a test because dissimilar objects could have easily been differentiated at the top of the tree, leaving the more difficult choices between similar objects for lower levels where the filters are more specific. Consisting of nothing but airplane images, the TI set forced the classifiers to decide between paths representing similar TIs in the upper levels of the tree. Because the system had to recognize images of model airplanes placed randomly on the terrain board, the filter set had to accommodate a range of geometric distortions for each object. Based on the size of the terrain board and the location of the camera (see Fig. 12), it was determined that nine different scales of each plane needed to be included in the TI set; furthermore, because it is relatively easy to determine the longest axis of an airplane and then rotate it to some standard orientation, it was deemed that only three images corresponding to different in-plane rotations of each object needed to be included in the TI set. These rotation TIs characterize a five degree RI span about the "standard orientation" Although terrain board images experience foreshortening due to aspect changes, this distortion was not mapped into the TI set; consequently, terrain board performance degraded a s the depression angle of the camera increased. Overall, there were 27 images-three in-plane rotations at nine scales--in the TI set for every object class. Although our standardizing black box is not exactly magical, it does significantly reduce the number of TIs required to characterize each object class. Since there were nine classes, the TI set consisted of a total of 243 images.

nodes of the tree so that the images can be differentiated using a simple filter. Since, for the TI set used, the scale distortion was felt to be the least significant, simple filters corresponding to TIs of the same class, but different scales, were grouped together at the bottom level of the tree. The filter tree was a five-level, ternary tree, where ternary indicates that each node had three branches emanating from it. Level five, the bottom level of the tree, was composed of simple filters that represented a single TI. Filters on level four were obtained by averaging three simple filters, each one corresponding to a different scale, to produce a single filter which was more or less scale invariant over the scale range of the simple filters. Since every level four filter was associated with three TIs, and every node was associated with three filters, every level four node was associated with nine TIs, each one a different scale of a given airplane. Level three filters were obtained by averaging three level four filters; thus, every level three node was associated with 27 TIs, implying that every path emanating from a level three node identified an airplane's rotation. Level two filters were obtained by averaging three level three filters, which means that level two filters represented 27 TIs; i.e. each level two filter was associated with a different object class, and choosing a path at level two in a tree search was equivalent to making an object class assignment. On level one, each filter represented 81 TIs, and therefore three different object classes. One important difference between level one filters and all other filters is that they were not obtained by averaging the filters beneath them. Since level one performance dictates overall tree performance, level one filters were made using the previously discussed SDF-like procedure intended to improve tree path classifiability.The level one filters required 20 iterations of about 3 h each. At each iteration, the least classifiable pdfs were moved by increasing their associated filter design weights. The initial average filters partitioned all signals at about 70% accuracy, and the final filters partitioned all signals at over 98~ accuracy for TI inputs. 8.3. Distribution parameter estimation Unfortunately, precise values of m and C for TI distributions are unobtainable because the noise processes generating the variations are unknown. Consequently, it is necessary to repeatedly correlate each TI against all filters (constituting a node) to collect N s sample signal vectors and estimate m and C using

~=-8.2. Filter bank As noted earlier, a tree architecture lends itself to performing an efficient search of the filter bank. However, the benefits derived from using a tree structure are strongly influenced by the arrangement of the filters. In general, filters corresponding to similar images in a TI set should be grouped together in bottom-level

1 Ns F, x, Ns i=1





NS E (x,x~-~)



Because one degree of freedom is lost in forming the mean, the summation in the expression for C is divided

Optical pattern recognition using Bayesian classification by Ns - 1 instead of N S. Also, the estimates for both m and C are unbiased and consistent--i.e, they converge to the true value and the variance of the estimates goes to zero as the number of samples increases without bound. (3°) After calibration, the values for m and C appearing in the discriminant functions for the QU and CB classifiers are replaced by their estimates. Because mean vector and covariance matrix estimates improve as the number of samples increases, the similarity between expected and actual performance is, in part, a function of the number of calibration samples taken. Until the number of samples used to estimate a distribution is sufficient, or-levelsignificances (each obtained by statistically correlating a signal with the current estimate of its distribution) rarely exceed 10~,,. Since SC values above 50~o should occur just as often as SC values below 50~o, the predominance of SC values below 10~o clearly indicated that the mean vectors and covariance matrices characterizing the TI distributions were not accurate. However, when the number of calibration samples for each TI at each node in a tree was close to 800, classifier behavior changed dramatically. When sample counts got near this threshold value, SC values above 50~, occurred almost as often as values below 50~o. Because the data was obtained from a ternary tree, implying signal distributions were 3-dimensional, this means, according to Bellman, that approximately nine samples are sufficient to characterize the marginal univariate TI distributions. As the number of samples increased past 800, SC values usually improved as well; however, if the correlator remained stable, a point of diminishing returns was reached for sample counts above approximately 2000--increasing the sample count above this number usually resulted in higher SC values, but the change was so slight that the extra calibration time was, for all practical purposes, not warranted. Nonetheless, since optical correlator behavior changes on a regular basis, there is a good reason for calibrating with large sample numbers over several different correlator states: doing so makes it much easier to alter system settings and/or realign the correlator without experiencing negative consequences on classifier performance. 8.4. Hybrid optical correlator The correlator used to gather the data for this paper was a hybrid system which used a 320 by 220 Epson LCTV for the input transducer, a 128 by 128 Semetex SIGHT-MOD for the filter transducer and a Burle CCD camera as the detector. Due to its greater speed capability and slightly closer pixel spacing, the Semetex device was the natural choice for the filter transducer. The LCTV permitted gray level images to be input to the correlator, and the M O D implemented BPOFs against which the inputs were correlated. A PC framegrabber digitized the resulting correlation patterns with seven bit resolution, implying pixel values ranged from 0 to 127. Since the peak value in each correlation pattern was sent to the classifier, responses were discrete


integer values; therefore, the signal space was finite and discrete. The optical design ramifications of using a hybrid correlator are discussed in detail in reference (31). Suffering from fluctuations in laser power, changes in polarization, and seismic disturbances, etc., our optical correlator is dynamic, making it impossible to get completely accurate estimates of distribution parameters. It was observed that the correlator typically remained in a fairly stable state for a few days or so; however, soon after this period of time, the correlator's behavior would invariably change and classifier performance would degrade, though usually not significantly. Calibrating over several correlator states caused both expected and observed classifier performance to decrease slightly since the calibration data was not matched to a single correlator state; however, the decrease in performance was almost negligible, usually less than 1~o. More importantly though, calibrating over several correlator states made it much easier to adjust and/or realign a correlator, and still be able to tune it in for good performance. In fact, we intentionally calibrated a tree over several correlator settings by slightly adjusting polarizer alignment, moving the position of the filter device over small distances, and then grabbing several hundred calibration samples for each correlator state. After obtaining over 5000 samples for every TI at every node, the system was much less sensitive to changes in correlator states. In fact, we were able to disassemble the correlator, move the laser, and then reassemble the correlator without adversely affecting classifier performance. 8.5. Terrain board performance Figure 12 depicts the geometry of the terrain board that was used to test the classifiers with true-class objects and with similar false-class objects. Figure 13 shows the set of six objects which were used to test classifier ability to reject similar objects which were not a part of the training set, i.e. they were used to test false alarm rates. Note that two of the images, the F16A and the F4, are almost identical to the T38A, while the F9F8 is very similar to the BI of Fig. 1l. Within a ring, where the radial distance from the center of the board falls within a narrow range of values, the distortions due to aspect and scale are theoretically less than a pixel wide; thus, the number of scale and aspect distorted images that need to be included in the T1 set is dictated by the number of rings that occupy the board. For the present system geometry, nine rings are necessary to tolerate the anticipated range of scale and aspect distortions. In the process that produces correlator inputs, the low-resolution camera identifies areas of interest by locating regions of high contrast on the approximate scale size of airplanes, and cues them for closer inspection by the high-resolution camera, Next, a digitally implemented segmentor attempts to isolate a potential airplane in the high resolution image using a k-nearest neighbor algorithm on samples of what are presumably


G.W. CARHARTet al.

4, F16A






Fig. 13. False-class source images. object and background vectors, t2°~ If the segmentor produces an image which is within the known range of object scales the system is designed to recognize, it rotates the image to a standard position by finding the longest axis, and then sends it to the optical processing laboratory for identification. For objects placed near a corner of the terrain board, the depression angle, fl in Fig. 12, is maximum and equal to about 23 deg; consequently, the object is foreshortened by the cosine of 23° ( ~ 0.92) along the line of sight. One major problem with this set-up is that the segmentor is very sensitive to shadows, which vary, depending on light intensity and subject location. Additionally, the segmentor is very sensitive to its choice of object and background vectors, which also varies, depending on the view angle of the high-resolution camera. Because BPOFs are higlily sensitive to edge inf o r m a t i o n - - a fact suggested by a BPOF's impulse response, which is just two edge enhanced versions of the original reference templatet~--variability in segmentor behavior, especially along the edges, can significantly degrade classifier performance. Compounding the problem of obtaining robust classifier performance from terrain board images is the difficulty in aligning images so that they lie within the rotational-invariant (RI) span of the filters in the filter bank. Locating the longest axis of an airplane image is relatively simple if the image is clean; however, if the image is corrupted with clutter, e.g. by airway stripes or other small objects, the longest axis search algorithm frequently misses the true longest axis by a few degrees. Consequently, during

data collection runs, it was not uncommon to receive images rotated slightly outside the RI span of the filters. To overcome this problem, images which were clearly outside the RI span of the filters were rejected before being processed by the optical system. Unfortunately, many images with questionable rotational offsets were processed, with the result that classifier performance suffered. Tables 1 and 2 list terrain board performance data for the CB and QU classifiers, respectively, on a classby-class basis. Obtaining images from the terrain board is a time consuming operation because the handshaking involved in sending segmented images across an ethernet connection to the optics laboratory is relatively slow; moreover, compiling performance statistics cannot be automated because a human must validate classifier decisions. Therefore, only 233 tests were performed on true-class images and 84 on falseclass images for each classifier. To obtain these test run tables, the correlator picked a TI at random, classified it via a tree search using the CB classifier, classified it via a tree search using the QU classifier, and then repeated the entire process. Every other image, the classifier order was reversed (QU first, CB second) since there is evidence that the filter device has a weak memory, allowing it to represent filters for the second tree search with more pixels in the correct state and thereby giving an unfair advantage to the second classifier. It is natural to wonder why both classifiers were not allowed to operate on the same tree search data simultaneously to provide a direct corn-

Optical pattern recognition using Bayesian classification


Table 1. CB terrain board performance for tree search decisions Class Overall T38A X29A MK6 Mig21 B1 TU26 RA5C Mig23

# Of tests

# Correct

# Aborted

233 24 56 28 24 28 28 20 25

189 24 51 23 23 16 17 15 20

43 0 4 5 1 12 11 5 5

# Missed Performance 1 0 1 0 0 0 0 0 0

81.16 100.0 91.07 82.14 95.83 57.14 60.71 75.0 80.0

Table 2. QU terrain board performance for tree search decisions Class Overall T38A X29A MK6 Mig21 B1 TU26 RA5C Mig23

# Of tests #Correct 233 24 56 28 24 28 28 20 25

187 23 50 25 16 17 18 18 20

parison between the two. Unfortunately, there is no simple way to handle cases where the two classifiers decide on a different path down the tree, so separate tree searches had to be used. Note that CB and QU performance is almost identical, with the CB classifier just edging out the QU classifier in the number of correct decisions. In both tables, the # aborted column refers to the number of times the classifier aborted its tree search due to signals occurring at locations of low density, even though a true-class image was present on the input device. The # missed column refers to misclassifications, i.e. the classifier asserted high confidence in a decision that was wrong. It is interesting that the QU classifier was more apt to make such a mistake--it had nine misidentifications, compared to only one for the CB classifier. The large difference in misidentifications between the two classifiers indicates that the abort threshold for the CB classifier was too high, relative to the abort threshold for the QU classifier. Because of this, the abort threshold for the CB classifier was lowered while the abort threshold for the QU classifier was raised to obtain the results in the next section--performance of computer generated non-TIs. In the last column, performance is simply the ratio of the number of correct assignments to the total number of inputs presented. It is surprising that the TU26 exhibited such poor performance since it has a rather distinctive silhouette; however, this can be attributed, in part, to the testing of several TU26 images whose rotational offsets were questionable. Finally, false alarm rates, although not listed in the tables, were the same, about 2.4~o, for both classifiers. To obtain these false alarm rates, the objects of Fig. 13




37 1 5 2 8 8 7 2 4

9 0 1 1 0 3 3 0 1

80.26 95.83 89.29 89.29 66.67 60.7t 64.29 90.0 75.0

were placed randomly, at several different locations, on the terrain board. In total, 84 false-class images were tested, and both classifiers twice identified a falseclass image, the F9F8, as a B1 bomber. It is interesting that no false alarms occurred with the F16A or the F4, since both of these objects bear a strong resemblance to the T38A (see Figs 11 and 13). It is important to note that about 3/4 of the total number of images used to generate Tables 1 and 2 were obtained from models which had been painted white so that they would show up better against the mostly black, green, and brown background of the terrain board. Using white models makes it more likely that the segmentor will produce a crisp image without edge distortion. It was felt that painting the models white was justified because the segmentor had a tendency to shrink the borders of input images, often producing segmented images which laid outside the scale-invariant span of the filters. Since about 1/4 of the images were of objects not painted white, the performances in Tables 1 and 2 are obviously a little higher for painted objects, somewhere in the middle to upper 80 percentile. For non-painted objects, the performance is somewhere near the mid 70 percentile. To provide a fair test of false alarm rates, painted objects were used roughly 75~o of the time for false-class images as well. 8.6. Performance of computer-generated non-Tls As the terrain board data gathering process cannot be automated, it is difficult to obtain statistically significant, non-TI performance results with terrain board images. Fortunately, it is fairly easy to generate nonTIs from TIs by mathematically simulating the geo-



Table 3. Computer-generated non-Tl performance for the CB classifier Class Overall T38A X29A MK6 Mig21 S3A BI TU26 RA5C Mig23

# Of tests #Correct 7646 874 846 884 817 855 877 840 821 832

6775 831 780 830 761 828 820 674 662 589




856 043 66 54 56 27 56 163 152 239

15 0 0 0 0 0 1 3 7 4

88.61 95.08 92.20 93.89 93.15 96.84 93.50 80.24 80.63 70.80

Table 4. Computer-generated non-TI performance for the QU classifier Class Overall T38A X29A MK6 Mig21 S3A B1 TU26 RA5C Mig23

# Of tests #Correct 7646 874 846 884 817 855 877 840 821 832

6568 802 739 803 724 809 795 667 649 580

metric distortions imparted to TIs by the terrain board's imaging system. Such an approach lends itself to an automated data collection process, and has the added advantage of giving the designer control over the distortion range the TIs are subjected to, making it possible to guarantee that only images which are mapped into the distortion-invariant span of the filters are tested while compiling performance statistics. Tables 3 and 4 are analogous to Tables 1 and 2, respectively, except they pertain to computer-generated non-TIs and involve thousands of tests, instead of hundreds. To obtain these results, only distortions which were mapped into the TI set were tolerated--inplane rotation angle was allowed to vary randomly between + 2.5 deg, and depression angle was allowed to vary randomly from zero to 25 deg. Both scale and foreshortening distortions were accounted for to simulate changes in the depression angle. To simulate segmentor edge distortion variability, binary TIs (target pixels at gray-level 127, background pixels at zero) were mathematically subjected to the geometric distortions described above to yield images that had many edge pixels with values other than 127 (due to interpolation in the distortion software). Rebinarizing these images with different thresholds produced edge variability--high thresholds produced less distorted images than low thresholds did. For the results below, the binarization threshold was allowed to vary randomly between 115 and 127. It should be pointed out that the results were stronoly dependent on binarization level. Permitting the binarization threshold range to vary only between 125 and 127




1077 72 107 81 93 46 82 172 172 252

1 0 0 0 0 0 0 1 0 0

85.90 91.76 87.35 90.84 88.62 94.62 90.65 79.40 79.04 69.71

resulted in significantly better performance; conversely, permitting the threshold range to vary between 50 and 127 resulted in significantly lower performance. What this seems to indicate is that classifier performance is much more strongly affected by segmentor variability than it is by geometric distortion when dealing with non-TI images. As was the case with all previous tests, the CB classifier's overall performance was slightly better than the QU's--88.61% as opposed to 85.90%. However, it is worth noting that the disparity between CB performance and QU performance more than doubled, relative to what it was for the terrain board results. Of course, changing the abort thresholds between the two data runs played a major role in producing this disparity. Also, as was expected, the increase in hit rate performance was partially offset by an increase in the CB classifier misclassification rate--the CB classifier misclassified 15 times, compared to only once for the QU classifier. To compare CB and QU results directly, the abort conditions should be identical. However, this is not possible because the CB classifier aborts when the signal density of a composite pdfis below an abort threshold, whereas the QU classifier bases its aborts on the density values of pdfs associated with a single TI. In a paper accepted for publication in Applied Optics, ~24~ we presented the distribution model from which the CB classifier is derived and evaluated the classifier's performance, both theoretically and experimentally, for the TI set. The TI data were collected for the express purpose of validating the composite model,

Optical pattern recognition using Bayesian classification and it was found to agree very closely with theoretically expected performance derived from calibration data. Not surprisingly, TI performance was considerably better than the non-TI performance presented h e r e - overall performance exceeded 97% with a theoretical and experimental discrepancy of less than one tenth

of 1%. 9. CONCLUSIONS Recognition systems based on mathematical correlation can be put to practical use. Relevant problems require many views of an object to characterize its class, however, so a system intended to recognize many objects will likely have a very large training set. In principle, a classifier which obtains data from a complete representation of each TI is the most reliable way to solve this problem. Unfortunately, a filter bank with each TI represented by one filter will lead to a slow feature extractor and very high dimensional signal v e c t o r s - - b o t h of which are undesired conditions. Each of these problems is caused by requiring too many correlations for each decision cycle, so some reduction of the filter bank size through the use of composite filters is necessary. Composite filters will, of course, minimize the number of required correlations, but there is a significant trade-off involved in such a filter bank shrinking procedure. Namely, as the number of Tls a filter represents increases, its ability to reject falseclass inputs decreases. A way out of this conundrum is provided by a tree filter bank which, after each composite filter says "yes", provides progressively more specific (less composite) filters which must concur for a positive identification to be produced. False alarm rates from a tree search system can thus be manipulated almost arbitrarily by choosing specificity (compositeness) of bottom level filters. Since composite filters will be required for most problems, whether a tree architecture is used or not, a robust way to classify the signals produced by composite filters is essential. We could reliably classify these signals with a union version of the traditional quadratic classifier, but we would not actually be using optimal partitions if we did. Although the performance improvement to be expected is slight, if we wish to classify these signals by likelihood we must construct distributions for each object class by forming a renormalized superposition of the simple multivariate distributions associated with the TIs that characterize the object. The optimal way to partition these distributions is to generate discriminant functions based on likelihood by using Bayes' likelihood ratio test on such composite pdfs.


1. D. Psaltis, E. G. Pack and S. S. Venkatesh, Optical image correlation with a binary spatial light modulator, Opt. Enyng 23, 698-704 (1984). 2, J. M. Florence and R. O. Gale, Coherent optical corre-


lator using a deformable mirror device spatial light modulator in the Fourier plane, Appl. Optics 27, 2091-2093 (1988). 3. M. W. Farn and J. W. Goodman, Optimal binary phaseonly matched filters, Appl. Optic 27, 4431-4437 (1988). 4. F. M. Dickey, T, K. Stalker and J. M. Mason, Bandwidth considerations for binary phase-only filters, Appl. Optics 27, 3811-3818 (1988). 5. D. U Flannery, J. S. Loomis and M. E. Milkovich, Design elements of binary phase-only correlation filters, Appl. Optics 27, 4231-4235 (1988). 6. B. A. Kast, M. K. Giles, S. D. Lindell and D. L. Flannery, Implementation of ternary phase amplitude filters using a magneto-optic spatial light modulator, Appl. Optics 28, 1044-1046 (1989). 7. J. Downie, B. Hine and M. Reid, Effects and correction of magneto-optic spatial light modulator phase errors in an optical correlator, AppL Optics 31, 636-643 (1992). 8. L.J. Hornbeck, 128 × 128 deformable mirror device, IEEE Trans. El. Dev. ED-30(5), 539-545 (1983). 9. D. A. Gregory, R. D. Juday, J. Sampsell, R. Gale and S, E, Monroe, Optical characteristics of a deformable mirror spatial light modulator, Optics Lett. 13, 10-12 (1988). 10. R. M. Boysel, J. M. Florence and W. Wu, Deformable mirror light modulators for image processing, Proc. SPIE 1151, 183-194 (1990). 11. J.M. Florence and R.D. Juday, Full complex spatial filtering with a phase mostly DMD, Proc. SPIE 1558, 487-498 (1991). 12. N. Clark, Using liquid crystal devices as input and filter SLMs, Proc. SPIE 1704, 237 247 (1991). 13. D. A. Gregory, J. C. Kirsch and E. C. Tam, Full complex modulation using liquid-crystal televisions, Appl. Optics 31, 163-165 (1992). 14. F. Hester and D. Casasent, Multivariant technique for multi-class pattern recognition, Appl. Optics 19, 17581761 (1980). 15. D. Jared and D. Ennis, Learned pattern recognition using synthetic discriminant functions, Proc. SPIE 638, 91 - 101 (1986). 16. Z. Bahri and B. V. K. Vijaya Kumar, Generalized synthetic discriminant functions, J. Opt. Soc. Am. A 5, 562571 (1988). 17. A. Mahalanobis, B. V. K. Vijaya Kumar and D. Casasent, Minimum average correlation energy filters, Appl. Optics 26, 3633-3640 (1987). 18. D. Jared and D. Ennis, Inclusion of filter modulation in synthetic discriminant function construction, Appl. Optics 28, 232-239 (1989). 19. R. Shiavi, Introduction to Applied Statistical Signal Analysis. Aksen Associates, Boston (1991). 20, D. Casasent and W. Chang, Correlation synthetic discriminant functions, Appl. Optics 25, 2343-2350 (1986). 21. S. I. Sudharsanan, A. Mahalanobis and M. K. Sundareshan, Selection of optimum output correlation values in synthetic discriminant function design, J. Opt. Soc. Am. A 7, 611-616 (1990). 22. T. R. Walsh, J. E. Cravatt, B. A. Kast and M. K. Giles, Time-sequenced rotation invariant optical correlator for multiple target recognition, Proc. SPIE 1098, 240-252 (1989). 23. T. Walsh and M. K. Giles, Statistical filtering of timesequenced peak correlation responses for distortion invariant recognition of multiple input objects, Opt. Enqn.q 29, 1052-1064 (1990). 24. B. F. Draayer, G. W. Carhart and M. K. Giles, Optimum classification of correlation plane data via Bayesian decision theory, submitted for publication, Appl. Optics. 25. C, W. Therrien, Decision Estimation and Classit~cation. Wiley, New York (1989). 26. D.J. Hand, Discrimination and Classification. Wiley, Chichester (1981). 27. D. Flannery, J. Loomis and M. Milkovich, Transform-


G.W. CARHARTet al.

ratio ternary p h a s e - a m p l i t u d e filter f o r m u l a t i o n for improved correlation discrimination, Appl. Optics 27, 4079-4083 (1988). 28. B. V. K. Vijaya Kumar, Tutorial survey of composite filter designs for optical correlators, Appl. Optics 31, 47734801 (1992). 29. P. Billings and T. Giles, Multisensor image analysis system demonstration, Proc. SPIE 1701, 83-91 (1992).

30. A. Papoulis, Probability, Random Variables, and Stochastic Processes, 2nd edn. McGraw-Hill, New York (1984). 31. B. F. Draayer, A Bayesian classifier of composite distributions in a correlation-based optical recognition system, Doctoral Dissertation, New Mexico State University (1993).

About the Author--GARY W. CARHARTbegan his academic career in 1973 at the University of New Mexico. He acquired extensive experience as both a hardware and software engineer during four years at the Air Force Weapons Laboratory, Albuquerque, New Mexico. Since 1988 he has been associated with New Mexico State University where he has become an integral part of NMSU's Electro-Optic Research Laboratory, performing extensive research in the field of correlation-based pattern recognition.

About the Author--BRET F. DRAAYERhas a B.S. in electrical engineering (1986) from the University of Idaho and an M.S, in electrical engineering (1989) from New Mexico State University. He is presently working toward his Ph.D. in electrical engineering at New Mexico State University. His experience includes one year with the Atmospheric Sciences Laboratory at White Sands Missile Range, New Mexico, and one summer with the Naval Ocean Systems Center in Konheohe, Hawaii. Currently, he is a college instructor at New Mexico State University.

About the Author--MICHAEL K. GILES, Professor of Electrical Engineering, earned B.E.S. and M.S. degrees in electrical engineering from Brigham Young University (1971) and a Ph.D. in optical sciences from the Optical Sciences Center, University of Arizona (1976). His experience in electro-optics includes six years with the Naval Weapons Center, China Lake, four years with White Sands Missile Range, and two years with the Air Force Weapons Laboratory, Kirtland Air Force Base. Since 1982 he has performed research and taught courses in electro-optics at New Mexico State University in Las Cruces, New Mexico.