- Email: [email protected]

IMAGING

5,

55-70

(1983)

BREAST TISSUE CLASSIFICATION RECOGNITION TECHNIQUES: Steven

Finette,1'3

Alan

USING DIAGNOSTIC ULTRASOUND AND PATTERN I. METHODS OF PATTERN RECOGNITION Bleier,'"'

and William

Swindell,ly2

lDepartment of Radiology Arizona Health Sciences Center University of Arizona Tucson, AZ 85724 20ptical Sciences Center University of Arizona Tucson, AZ 85721

This paper discusses the application of statistical pattern recognition techniques to problems in diagnostic ultrasound. Using our own system as an example, we describe the concepts and specific methods that we have applied to a problem involving the computer-aided classification of breast tissue in vivo. Topics include feature generation, feature selection and classification, as well as a method which estimates the probability of error on classifying future data. An accompanying paper applies these methods to the classification of backscattered RF signals from normal and diseased breast tissue. Key words: 1.

Breast nition;

neoplasms or diseases; ultrasonics.

classification;

pattern

recog-

Introduction

This is the first of two articles that deal with the principles and application of pattern recognition to problems in diagnostic ultrasound. Using our own system design as an example, we summarize the methods used in applying pattern recognition to the ultrasonic characterization of normal and diseased tissue. In the second paper [l], referred to as II, these methods are applied to the classification of human breast tissue in viva. The term "pattern recognition" encompasses a number of techniques that are used to classify data into two or more example, one may wish to classify data obtained from breast two groups: benign and malignant. A further classification neoplasms into fibrocystic disease and fibroadenoma, could

mathematical groups. For neoplasms into of benign also be considered.

In a simple system it is possible to classify objects by means of template matching. Square objects can be sorted from round objects by a simple comparison of each object with a template that matches one of the known classes. In our case we are trying to classify tissues into disease groups by studying the ultrasonic A-scan signal that is backscattered from the tissues of interest. When dealing with complicated and highly variable objects such as clinically derived ultrasound signals it becomes hard, if not impossible, to define a single, satisfactory template with which to classify

3Present

address:

Department Piscataway,

of Electrical NJ 08854.

55

Engineering,

Rutgers

University,

0161-7346/83 $3.00 Copyright 0 1983 by Academic Press. Inc. All rights of reproduction in any form reserved.

the data. There is simply too much variability among individuals among scans from a given individual for a simple scheme like this and we must resort to more complicated methods.

and even to work

Pattern reco nition is also an exercise in information reduction. Consider a data set 4 from a given patient) composed of twelve A-scans, each of 2048 sample points digitized to 8 bits. Approximately 2~10~ bits of information are presented. The required output in the simplest case could be represented by just one bit, for example 1 =malignant, 0= benign. Our task then is to compute an output number (of low dimensionality) that represents the tissue class using as input the 2~10~ bits contained in the raw A-scan data. The classification scheme must be optimum in the sense that the errors associated with the classification are as small as possible. The probable magnitude of the errors should also be known. Several steps can be identified in the overall process: (1) For each tissue class of interest, samples of A-scan data are obtained. Since the class associated with each scan is known, these samples are called ZabeZZed (the labels may be determined from biopsy results). These samples of data will be used eventually to formulate and test the various classification schemes. (2) The first stage of data reduction is to calculate features from the sample data scans. Features are quantitative measures of information that condense the dimensionality of the initial data. For example a histogram of amplitudes of a scan could be constructed. With (say) ten bins, the original 16,000 bits per A-scan can be described by 10 (bins) x 11 (max bits per bin) = 110 bits. Each A-scan in this example has yielded 10 features, the contents of the individual bins. At this stage it is not known which features will be useful for classifying purposes so a large number (several hundred in our case) are computed. This step is called feature generation, and it may be preceded by some preprocessing of the raw data. (3) Statistical tests are applied to the computed features to determine which of them have the greatest potential for discriminating between disease classes. This step is called feature selection. It represents another stage of dimensionality reduction. (4) A classification rule is created and applied to the previously selected labelled features. An estimate can then be made of the feature performance of the rule when it is applied to unlabelled data. The block diagram in figure 1 illustrates these steps as they are incorporated in our system. Several researchers (see for example [2-S]) have applied variations of this approach to problems in tissue characterization. They have classified a number of pathologies in various organs including liver, kidney, spleen, prostate and breast. There are several reasons to suppose that this approach to diagnostic ultrasound will eventually prove to be beneficial: (1) If a high enough sensitivity and specificity of classification can be achieved, this will enable automated mass screening systems to be realized. (2) Diagnostic ultrasound presently relies on the interpretation of images by the Information that could potentially provide diagnostic human visual system. information may not actually be appreciated by the radiologist a) because it is 'lost' in the system in the processes of forming the video signal or b) it is encoded in certain statistical properties of the signal that are not visually appreciated by the human observer. Julesz [9] has demonstrated that certain statistical properties of images can be varied over wide limits with no difference being appreciated by eye although, of course, the differences Since the acoustical properties are immediately evident using computation. of biological tissue are well described by stochastic processes, it seems more than likely that different tissues will leave their classification imprint on the statistical properties of the ultrasound signal. Such features may not be discernable to the human interpreter but they should be available for exploitation by computer analysis.

56

BREAST

TISSUE

CLASSIFICATION:

I.

METHODS

COLLECTION AND DIGITIZATION OF RF DATA

i)

1

A-SCAN

EDITING

CLASSIF,ICATION

1

Fig.

2.

Concepts

2.1

Feature

in Statistical

1

Pattern

Flowchart illustrating the general steps involved in our pattern recognition system.

Recognition

generation

For our work a pattern digitizing an A-scan.

is

defined

as the

set

of numbers

obtained

by

A feature can be any well-defined mathematical function determined from a pattern. For example, suppose the distribution of voltage levels in the RF waveform contains useful information for classification purposes. We could use as features each voltage level in the signal or the II values of the probability distribution g(a) of voltage levels. However, a more efficient approach would be to characterize the distribution by its moments and use the variance, skew and kurtosis as features. Since hundreds of sample patterns are involved in designing the classification rule, this data reduction represents an economical means of compressing information from the raw data into possible useful subsets. We use labelled features to determine a decision rule which separates A-scan data into different disease classes. A label is an association between a feature value and the class to which it belongs. The label is determined by methods that are independent of those used in the pattern recognition system. For example, we use biopsy results to associate the label "benign" or "malignant" with all feature values derived from patterns that represent data from the biopsied region. Each

component

type of of a D-dimensional

different

in a D-dimensional

feature

feature

xk,

feature space

D, isninterpreted aS a =(x" 1, x2, . . . . x:)~. located wJ J ri"w. corresponds to the nth feature J k=l,

vector

where

57

. . . .

p

FINETTE

vector

associated

the number

with

of classes

class

w..

J

Here,

and Nw. is the

ET AL.

l<_jsJ

number

and n=l, of sample

. . . Nw. where feature

J

vectors

j is in

J class These feature vectors are treated as multivariate random variables with probability distributions P(?l,j) in the feature space. These distributions summarize our knowledge of a feature's ability to discriminate between two or more tissue classes. Wj.

It is useful to distinguish between intraset and interset features [lo]. Intraset features are those which characterize properties common to all members of a given class. Those intraset features which contain no information that permits discrimination between classes may be ignored. An example of an ignorable intraset feature would be the mean value of the digitized A-scan, since this would always be zero regardless of disease classification. On the other hand, interset features have values that permit differentiation between the tissue classes under study.

In some applications, the interset features may be determined from the physical properties of the biological system. For example, it is known that the linear attenuation coefficient for soft tissues increases with increasing frequency within the diagnostic range of frequencies. If we assume that diseased and normal tissue have different attenuative properties, then a feature such as the ratio of total energy in a diagnostic frequency band obtained at two different tissue depths may be a discriminating feature. Other potentially useful measures are related to absorption, velocity, Unfortunately, the physical interaction scattering, tissue impedance, etc. of ultrasound with tissue is complicated , and useful "physically motivated" features are often difficult to define and/or measure. We do know, however, that sonogram "texture" contains important diagnostic information and we have emphasized a heuristic approach to feature generation by choosing features which are likely to measure important statistical properties of sonogram texture. These features are described in II. 2.2

Feature

selection

Those interset features that discriminate best between different tissue classes are selected with statistical tests. This phase of the processing is termed feature selection, and it results in a small subset of "information-rich" features that are then used to design a classification rule. This selection procedure reduces the dimensionality of the feature space by discarding features that have negligible power of discrimination.

In general, feature selection is the problem of selecting the most discriminating subset of D features from a set of L features. It may be solved exactly by considering all possible combinations of features but the cost is prohibitive for even small D and L. Much effort has gone into determining efficient suboptimal search procedures [ll]. Our choice of feature selection algorithms is discussed in the next section. Dimensionality reduction has important consequences when attempting to determine classification error rates. By rejecting useless features, dimensionality reduction prevents overspecialization of the decision rule to the available data. Foley [la] showed that if the ratio of number of samples in class oj to dimensionality, N,./D is small, one can derive a decision rule to discriminate perfectly betweenJany two sets of labelled samples, even when the sets have identical probability distributions. This rule would work well on the data used to derive it but taking more samples from the same feature

58

BREAST TISSUE CLASSIFICATION:

distribution Computer

and applying that decision simulations showed that N,./D

I. METHODS

rule would obviously be invalid. must be at least 3 to 5 to produce

decision rules that will perform wiih acceptable error rates on future test samples. In practice, many investigators have used ratios as high as 10 to 20. (See also Young and Calvert 1131, Meisel [14]). We emphasize the difference between feature generation and feature Feature generation is the process that provides many features selection. as possible candidates for a classification scheme. Such features may or may not contain discriminatory information. Feature selection is the process of selecting from this larger set, a subset that is useful in distinguishing between different disease states. This discriminatory inforThe process of demation is conveniently summarized in a decision rule. signing such a rule is now discussed. 2.3

Classification

Once a discriminating set of features has been selected they can be used to define a decision rule. New (unlabelled) A-scans can then be classified directly by application of the rule after the appropriate features are generated. The selection process, which was needed to identify the interset features for use in the decision rule, is no longer needed as an explicit part of the classification process. We start in the reduced feature space that is defined by the selected The space is partitioned into regions each of which (ideally) features. encloses feature vectors from only one of the classes. The decision rules are stated as inequalities associated with the boundaries separating these classes. In figure 2, the plane f(xl, x2, x3) = 0 divides a threedimensional feature space into two disjoint regions containing feature vectors associated with classes W, and w2. This plane is also called a decision (or discriminant) function and it may be calculated in a number of The position and orientation of the plane are determined by estimating ways. the parameters of a linear equation using representative labelled feature vectors from both classes and a suitable optimization rule. If then the nth unlabelled feature vector is associated with f(x; I x;, xg)>O, class w and so on. If unlabelled feature vector p is such that with either group or left fbq, 3’A xi) = 0 then it can be associated unclassified. This figure depicts an idealized case where there is no overlap between the class distributions, but this is rarely the case in practice. The process of defining the decision rule using labelled data is also referred to as supervised learning. The particular functional form chosen for a decision rule depends on a number of factors such as system complexity, economic constraints and the penalty for making errors of various kinds. The classification algorithms used in our system are discussed in Section 3.2. 2.4

Probability

of error

for

unlabelled

samples

The usefulness of a particular rule is determined by its ability to accurately discriminate among different pathologies in tissue samples not used in the design of the rule. Often, the amount of data available for designing and testing a classification rule is limited. Therefore, a question of considerable importance is how to estimate the reliability, or equivalently the probability of error P, for a given classification rule when it is applied to new, unlabelled feature vectors. A number of approaches have been suggested [15]. Since the underlying distributions of the features are usually unknown, we have taken a nonparametric approach to the solution of this problem.

59

FINETTE

Fig.

2

ET AL.

Illustration of a linear decision function f(x ,x ,x ) =0 shown as a plane in a three-dimensional feature space.' *Th$ plane partitions the space into two regions, completely separating feature vectors associated with classes w, and w2 in this ideal case.

We define P, as the probability of misclassification on unlabelled feature vectors after the parameters of the classifier have been estimated on a separate labelled data set. Clearly, if two decision rules were designed using two different labelled sample sets from the same tissue population, the reliabilities of the two rules would differ because of sampling considerations. Therefore, we are faced with the problem of how to partition the given sample data between the design and test phases in order to obtain a good estimate, P,, of the probability of error. Consider the extreme case where the same data that are used to determine the coefficients of a decision rule are used to estimate P,. Under this constraint Pe is biased, and the direction of the bias is to indicate that the classifier has greater reliability than actually exists for disThis is intuitively clear, criminating between the true disease classes. for the particular sample used since a particular decision rule is "optimal" to estimate its parameters. With a large data base an optimal partitioning of the groups can be derived theoretically [16], but in our work we are dealing with relatively small data sets and these theoretical results do not apply.

60

BREAST

TISSUE

positive

I

fraction

Sensitivity Power

of

the

False

positive

I error

error

of

the

significance

negative

True

fraction

II

error

negative

fraction

of

the

l-a second

Accuracy

3

kind

level

Specificity

8

Fig.

first

922

921 False

error

fraction

a

Type

test

l-8

Type

METHODS

QlZ

Qll True

I.

CLASSIFICATION:

The various associated

kind

= i

(Q,,

+ Q,,)

names for the elements of the confusion with a two-way classification scheme.

matrix

Therefore, we use an alternative approach [17] to estimate Pe. The technique is known as the n or rotation method and calculates the mean value E(p,+) of the estimate of P on future performance, over a distribution of training sets. The algorithm is described in Section 3.3. In general, classifier performance can be expressed in terms of a matrix of conditional probabilities Q.., often called a confusion matrix. decides in The quantity Qi* denotes the probabill '$ y that the classifier favor of hypoth&is Hj given that Hi is actually true. For the extreme the confusion matrix reduces to the identity case of perfect discrimination, matrix. Let Hi be the hypothesis that a feature vector A-scan is associated with malignant tissue. We will against the alternative hypothesis that the feature with either a lumped benign-normal data set or with For two-group depending on the classification scheme. confusion matrix is 2x2, where Qll is defined as the (sensitiuity) and 422 is the true negative fraction Similarly, we define 412 and 921 as the false positive negative fraction, respectively. The Qij are subject QlltQ 2=1 and Q ltQ22=1. In other contexts, the names 1 see Fig. 3 f . 3.

Selection

and Classification

Methods

for

Breast

calculated from a given test this hypothesis vector is associated only benign disease, discrimination the true positive fraction (specificity). fraction and false to the constraints Qij have different Tissue

Characterization

We discuss here the feature selection and classification methods that we have used in our system to classify breast tissue pathology from backscattered RF signals. They have proven their usefulness over a number of years in other problems [181; we have implemented them in Fortran on a

61

FINETTE

ET AL.

II

KRUSKAL-WALLIS TEST

AMBIGUITY

I

I

RECEIVER OPERATING CHARACTERISTIC

FUNCTION

I

I FIGURE

OF MERIT i

MINIMIZE

AVERAGE

CORRELATION

FORWARD/BACKWARD STEPWISE SELECTION PRINCIPAL I HIN~MIZE

WILK’S

COMPONENTS ANALYSIS +

X FEATURE

FEATURE

Fig.

4

SUBSET

SUBSET

Flowchart of shown which subsets may assumptions

feature selection algorithms. Three branches are Different lead to discriminatory feature subsets. be obtained because each branch uses different or criteria for feature selection.

minicomputer. Since the preprocessing and feature generation used in breast tissue classification are more problem-specific they are discussed selection and classification procedures, the experimental results. 3.1

Feature

selection

algorithms than the in II along

with

algorithms

As mentioned in the previous section, practical considerations force us to use sub-optimal methods for determining an "information-rich" subset of features. We have incorporated the procedures diagrannned in figure 4, In our which illustrates three ways of obtaining a useful feature subset. case, several hundred features from each tissue class are initially de-

62

BREAST TISSUE CLASSIFICATION:

termined, and in order to have a large space dimensionality, this number must

I.

METHODS

ratio of sample be reduced.

size

to feature

One criterion a feature should possess for discrimination is a statist tally significant difference in mean value for each class distribution. Therefore, all feature distributions are first tested using the KruskalWallis nonparametric rank-order statistic which considers the null hypothes that two or more feature distributions have identical means. It is a procedure for testing the equality of means when the experimenter wishes to avoid the assumption that the features were selected from normal distributions [19]. Features determined from benign, malignant, or normal tissue data which reject the null hypothesis at high confidence level (p

the statistic be the number tissue class. their mean rank

All measurements are ranked for that feature l?"j = & c RUj(i) where lsiLn,.. The mean i J is the mean rank fir each tissue class if the ranks

N=(n

H=

is

for the two class discrimination problem 1201. of feature values obtained from an arbitrary

+n +1)/2 ml @2 were randomly distributed between the classes. which is Chi-square distributed is then given rank

l-

(

n

+n ml

12 )(n +n tl) @2 Y w2

j=l

The Kruskal-Wallis

statistic

by w' J

(K

(1)

-N)2 wj

While differences in the means between classes are necessary, they may not be sufficient for reliable discrimination because the dispersion about the means may be large and blur the distinction between the classes. Therefore, two additional measures that involve the overlap between the These are the receiver operating feature distributions are used [18,21]. characteristic (ROC) and the ambiguity function. An ROC curve describes the compromises one can make between true positives and false positives as a decision threshold is varied, indicating all possible combinations of the relative frequencies of the various kinds of correct and incorrect decisions [22]. A typical ROC curve is illustrated in figure 5. The general situation is shown in figure 5(a), where a given feature has overlapping distributions for benign and malignant classes. Depending on the positioning of a decision threshold, both false positive (~1) and false negative (6) errors may occur. The ROC curve is generated by moving the threshold over the entire range of both distributions (see Fig. 5(a)), and plotting the resulting values of (l-6) versus a. (See Fig. 5(b)). The greater the area between the curve and the diagonal, the better the discrimination between the two classes. This area d' is used as the measure of discrimination. from

The ambiguity function concepts in information

is another theory.

measure of class overlap, generalized It is given by the expression

A = - ~ ~ P(Ai)P(ojl'i)logJP(wjl'i) A.w. 1 J

I

(2)

where P(Ai) is the probability that a feature has a value in an interval Ai and P(Uj [Ai) is the probability that a feature value in interval ~~ belongs t0 class Oj. The logarithm base, J, corresponds to the number of classes.

63

FINETTE

y

a

DECISION THRESHOLD

:

FALSE

I-a

:

TRUE

POSITIVE

p 1-p

: :

FALSE NEGATIVE TRUE POSITIVE

NEGATIVE G

a) Fig.

5

ET AL.

a-

I

b)

Relationship between the receiver operating characteristic d' and the probability distributions of a single feature for two class discrimination. (a) General situation with overlap between two distributions, benign and malignant. For clarity the malignant (b) d' is determined by moving the distribution is shown inverted. threshold through the range of both distributions shown in (a) and plotting l-0 against CL. d' is the area between the curve and the diagonal. It varies between 0 for no discrimination and 0.5 for full discrimination.

When there is no overlap between discrimination occurs, while for feature provides no discrimination.

the class distributions, the case of complete

A=0 and perfect overlap, A=1 and the

These two measures are applied to individual features without regard to correlations between two or more features. Those features which are highly correlated provide redundant information. In order to account for possible correlations the ROC d' values and the ambiguity values are combined into an intermediate merit function given by ~=[(l-A)+2d']/2. The feature with the largest value of 5 is then selected as the single best discriminator. Other features are chosen from those with large values of 5 subject to the constraint that the averaged correlation, h, between the new feature and the already selected feature set is small. The figure of merit function sF, defined by SF = (

l-A)+2d'+(l-h) 3

is computed for every unselected feature and the feature with highest SF is chosen. Then h. and SF are computed again for the unselected features, and the process is repeated until all features are ranked. This processing results in a feature subset that ranks the remaining features according to their ability to discriminate, and the top few members could be used to design a classification rule. Another method of decoupling (decorrelating) the feature is to use principal component analysis to form new features from linear combinations of the original set [23]. The approach is equivalent to finding lower dimensional vectors which may represent the original feature vectors with the least mean square error. The basis vectors of the transformed space are chosen to be eigenvectors associated with the largest eigenvalues of the covariance matrix formed by the original feature vectors. These eigenvectors Features in the original space are used to form the linear combinations.

64

BREAST TISSUE CLASSIFICATION:

I.

METHODS

that have marginal ability to discriminate may have their capacity to disHowever, the method does not criminate enhanced by this change of basis. Thus, the new features, necessarily preserve the distinction between classes. which are linear combinations of the original features, must be recycled through the feature selection routines described above. While the selection methods discussed so far are independent of the underlying probability distributions of the features, fomard/backward stepwise seZection assumes normality, though some deviation from this conAn advantage of stepwise selection is that straint may be tolerated [24]. we have the provision of eliminating a feature already chosen at a previous iteration if it shares discriminatory information with other features chosen Tests of significance are applied at each step before on subsequent steps. adding or deleting a feature. Several significance tests are available. statistic, A, the multivariate analog to the defined as the ratio of two determinants:

We use Wilk's lambda It univariate F ratio.

is

A=PT AyW . The pooled

within-groups

covariance

matrix,

W, is

defined

by

(5) where

J is the

class

is the number of feature vectors for J a feature vector and rU. is the centroid or mean feature J The among-groups covariance matrix, A, is given by

number

w., Yi is J “j vector for class

of classes,

N,.

Wj.

A Ej;l Nu.(i -2?Cu.-& J where

U is

the

grand

‘j

,

(6)

J

centroid.

Features are chosen sequentially such that A is minimized. From the definition, A is a statistic which accounts for both the differences between groups and the cohesiveness (clustering) about the group centroids. Note that a feature which increases clustering without changing the separation between the centroids might be selected over a feature which increases separation without changing the cohesiveness, and vice versa. 3.2

Classification

algorithms

Two decision rules have been used in our pattern recognition system. The first to be discussed is a &yes cZa.ssifier. This rule represents the optimal measure of performance from a statistical point of view since, when the feature distributions are known, it minimizes the total expected loss (or risk) with respect to all incorrect decisions [lo]. In this context, the terms loss and risk are measures of the importance of making an incorrect classification. A Bayes decision rule requires knowledge of the functional form of the densities of the feature vectors for each tissue class. It assumes the vectors follow multivariate normal distributions P(Xnlwj) where the mean

FINETTE

vectors may differ cal [El.

for

each

class

ET AL.

and thz

covariance

matrices

are

identi-

For the two-class problem, the aigorithm uses estimates of the mean feature vectors and covariance-matrix to determine the likelihood ratio J?,,* given by PCQJ,) "12 =

(7)

PCP lw,)

where

‘(-~(~-~w.)tC-‘(Xn-uw.)) 7 J The quantity

c-l

is the mean feature

1s the vector

iwerse

of the covariance

W.. Let 3 The maximum likelihood

bility for class (wj). vector to class w, if An arbitrary decision

for

*

class

P(tij)

matrix,

j proba-

be the a pri~i

decision

~,~>P(w~)IP(u,) or to class is made if R,~=P(~&P(~,).

and FU

~2 if

rule

assigns

a

a12

function belongs to the category of Fisher disThe method involves finding a direction Tin feature space such that vectors from the two classes projected onto 1 are maximally discriminated. A projection of a sample feature vector 2. on to J h is given by r%n w. with mean value ?vW,. Then for equal covariance matrices The second

erJimirzant

decision

anaZysis

[26].

CT=Q=C, we maximi:ed within-group dispersion: i-t; ~ = One choice

$ the

ratio

- s;t, w1 w2 ?Ch

of betieen-group

differences

.

of h satisfying

Mith multivariate normal distributions is optimal in the sense that it minimizes reduces to the maximum likelihood classifier Rotation

method

for

estimating

to

(9)

$=O is %=C-'(i -G ). W1 w2 ax trates the Fisher classification rule for the simple case classes. By projecting the feature vectors onto the Fisher criminant, 1, the ratio + is maximized and minimal overlap classes is achieved. The procedure uses labelled samples and: . S, iii1 and Z2 of the population parameters C, U w1 w2 of an unlabelled feature vector ik7 to class UT is made if closer to ($-m2)%';fil, than to (?$-m2)%-'m2, etc.

3.3

relative

the

Figure

of two overlapping linear disbetween the to form estimates The assignment - - t -1-n _ (ml-m2) S X is

the Fisher discriminant the total expected loss, discussed above [lo]. probability

function and

of error

Only by determining the probability of error associated rule, formed using a particular subset, can the investigator However,exhaustive search procedure is better than another.

66

6 illus-

with a decision say that one is necessary

BREAST TISSUE

Fig.

FISHER

I.

METHODS

6

Illustration of the Fisher discriminant as a classification rule for the case of equal covariance matrices in a two class discrimination problem. The Fisher linear discriminant is that line upon which the projected feature vectors of classes w, and w show maximum separa ?*ion relative to their common dispersion. The lower the overlap (shaded areas), the higher the classification accuracy.

CLASSIFICATION:

DISCRIMINANT c

xl The following of error

for choosing the optimal subset [27]. obtain an estimate of P,, the probability feature vectors [17].

for

algorithm is classifying

used to unlabelled

The labelled samples are repeatedly partitioned into pairs of mutually exclusive groups called the Zearning and test sets. The test sets contain and the samples in each test set fewer samples than the learning sets, are chosen at random from the total number of samples. Those samples not chosen for the test set are placed in the corresponding learning set. Using the ith learning set, a sample classifier is designed and its performance is evaluated by using the complementary test set. This method determines a proportion of errors called Pe[rrliki/r where ki is the number of misclassified feature vectors in the ith test set, and r is the fixed number of sample vectors in the test set. This process is repeated for different pairs of learning and test sets until each feature vector has been The resulting estimate of P, is then given by a member of a test set.

(10) As the number of feature Confidence intervals for Highleyman [16]. 4.

samples approaches infinity, this mean value are derived

Pe. of

Summary

The method of computer-aided reference to a specific ultrasound used in II for the classification It feature

E{Pel approaches using the approach

pattern recognition was discussed with analysis system. This system will be of breast tissue in viva.

should be noted that the system set, and if poor discrimination

67

is only occurs,

as good as its initial the original feature

set

may need modification. If no useful interset features are calculated during feature generation, no amount of data reduction or optimization will yield good classification. Since the original feature space has nothing optimal about it but was arrived at by a combination of intuition, problem knowledge, etc.,there is no guarantee that the minimum possible error rate will have been achieved for a particular application [ll]. A number of feature selection and classification procedures have been described. The flexibility apparent in this analysis software reflects the fact that machine pattern recognition is very much a heuristic procedure where considerable choice exists in choosing the specific components of the system, and some fine tuning may be necessary in order to achieve the most accurate classification. Acknowledgments This work was supported in part by BRSG grant 307 RR07002 awarded by the Biomedical Research Support Grant Program, Division of Research Resources, N.I.H. Major funding for this work was provided by a grant from National Institutes of Health (NIH ROl CA24257). The authors wish to thank Peter H. Bartels for supplying much of the analysis software and for reading the manuscript. We also acknowledge the expertise of Debbie Spargur who typed the manuscript ad nausewn. REFERENCES [l]

Finette, S., Bleier, A.R., and Swindell, W., Breast tissue fication using diagnostic ultrasound and pattern recognition techniques: II. experimental results, Ultrasonic Imaging (1983).

classi5, 71-86

[2]

Good, M.S., recognition analysis of Imaging 4,

Rose, J.L., and Goldberg, B.B., Application of pattern techniques to breast cancer detection: ultrasonic 100 pathologically confirmed tissue areas, Ultrasonic 378-396 (1982).

[3]

Greenleaf, J .F. and Bahn, R.C., Clinical imaging with transmissive ultrasonic computerized tomography, IEEE Trans. Biomed. m. BME-28, ~___ ___ 177-185 (1981).

[4]

Lerski, R.A., and MacSween, diffuse liver

[5]

Lerski, R.A., Smith, M.J., Morley, P., Barnett, E., Mills, P.R., Watkinson, G., and MacSween, R.N.M., Discriminant analysis of ultrasonic texture data in diffuse alcoholic liver disease. I. fatty liver and cirrhosis, Ultrasonic Imaging 3, 164-172 (1981).

[6]

Preston, Jr., K., Czerwinski, M.J., Skolnik, M.L., and Leb, D.E., Recent Developments in Obtaining Histopathological Information from Ultrasound Tissue Signatures, in Ultrasonic Tissue Characterization (U.S. Government II, M. Linzer, ed., NBS Spec. Publ. 525, pp. 303-313 Einting Office, Washington, DC, 1979).

[7]

Reinhard, D.K., Gift, D.A., Harris, Limitations of tissue differentiation backscattered ultrasound, Ultrasonic

Barnett, R.N.M., disease,

E., Morley, P., Mills, P.R., Watkinson, G., Computer analysis of ultrasonic signals in Ultrasound Med. -___ & Biol. 5, 341-350 (1979).

68

G.I., and De Sosta, C.J., by spectral measures of Imaging 3, 108-112 (1981).

BREAST TISSUE CLASSIFICATION:

I. METHODS

[8]

von Seelen, W., Gaca, A., Lock, E., Scheiding, W., and Wessels, G., Recoanition of Patterns in Ultrasonic Sectional Pictures of the Prostate for Tumor Diagnosis, in Ultrasonic Tissue Characterization (U.S. Government II, M. Linzer, ed., NBS Spec. Publ. 525, pp. 297-302 Einting Office, Washington, DC, 1979).

[9]

Julesz, B., Experiments 232, 39-43 (1975).

in the visual

[lo]

TOU, J.T. and Gonzalez, Wesley, Ontario, 1977).

R.C.,

[ll]

Kanal, Inform.

in pattern L., Patterns Theory lT-20, 697-722

[12]

Foley, Inform.

D., Considerations of sample and feature Theory lT-18, 618-626 (1972).

[13]

Young, T.Y. Recognition

[14]

Meisel, W.S., Computer-Oriented Approaches (Academic Press, New York, 1972).

[15]

Toussaint,

and Calvert, T.W., (American Elsevier

G.T.,

Bibliography

Pattern

perception

of texture,

Recognition

Principles

recognition: (1974).

1968-1974,

Classification, Publishing Co.,

on estimation

size,

&.

(Addison-

--IEEE Trans. IEEE Trans.

Estimation New York,

to Pattern

-Sci.

and Pattern 1974).

Recognition

of misclassification,

IEEE Trans. --

Inform. Theory lT-20,

[16]

Highleyman, experiments,

W.H., Bell

[17]

Toussaint, G.T. and Sharpe, P.M., An efficient method for the probability of misclassification applied to a problem diagnosis, Comput. -Biol. Med. 4, 269-278 (1975). __

[18]

Bartels, P.H. and Olson, G.B., The Design and Analysis of Lymphocyte Images, in Methods of Cell Separation, Vol. 3, N. Catsimpoolas, ed., pp. l-99 (PmrzsNew York, 1980).

[19]

Walpole, R.E. and Myers, R.H., Probabilit and Statistics for En ineers and Scientists, 2nd ed. 7’ MacMillan - Publishing Co.,New i!hr---

[20]

472-479 (1974).

The design and analysis of pattern recognition Syst. -Tech. J. 4l-, 723-744 (1962).

evaluation of cytologic data: Bartels, P.H., Numerical of features for discrimination, Anal. Quant. Cytol. 1,

estimating in medical

III. 153-159

York, selection (1979).

[21]

and feature extraction on Genchi, H. and Mori, K., Evaluation automated pattern recognition system, Denki Tsushin Gakkai Part 1 iin Japanese), quoted in Tanaka, N., Ikeda, H.,Uena,T.,%%abe, ., Imasato, Y., and Kashida, R., Fundamental Study of an Automated Cytoscreening System Utilizing the Pattern Recognition System. I. Feature Evaluation in the Pattern Recognition System, in --The Automation of -------f----W Uterine Cancer C to10 G.L. Wied, G.F. Bahr, and P.H. ms,eds., pp. 223-227 Tutorials of Cytology, Chicago, 1976).

[22]

Metz, C.E., Basic principles VIII, 283-298 (1978).

[23]

Mucciardi, A.N. and Gose, E.E., A comparison of seven techniques choosing subsets of pattern recognition properties, --IEEE Trans. C-20, 1023-1031 (1971).

of ROC analysis,

69

Seminars

--in Nucl.

-Med. for Comput.

FINETTE ET AL.

[24]

Klecka, in the Hills,

W.R., Discriminant Analysis, Social Sciences, J.L. Sullivan, CA, 1980).

[25]

Cooley, W.W. and Lohnes, P.R., and Sons, New York, 1971).

[26]

Lachenbruch,

[27]

Sterns, S.D., On Selecting Features for Pattern Classifiers, Third Int. Joint Conf. Pattern Recognition, pp. 71-75 (IEEE 76CH1140-3C, 197r

P.A.,

Discriminant

Series ed.

Multivariate Analysis

70

Quantitative Applications (Sage Publications, Beverly Data Analysis (Hafner

Press,

(John New York,

Wiley 1975)

in Proc. Cat. No.