Use of an artificial neural network to quantitate risk of malignancy for abnormal mammograms

Use of an artificial neural network to quantitate risk of malignancy for abnormal mammograms

Use of an artificial neural network to quantitate risk of malignancy for abnormal mammograms Richard K. Orr, MD, MPH, Spartanburg, SC Background. The...

67KB Sizes 0 Downloads 1 Views

Use of an artificial neural network to quantitate risk of malignancy for abnormal mammograms Richard K. Orr, MD, MPH, Spartanburg, SC

Background. The purpose of this study was to develop a simplified method for standardized categorization of patients with abnormal mammograms by incorporating quantitative risk assessment. Methods and patients. A prospective collection of 1288 outpatient referrals to a surgeon for abnormal mammograms, 185 (14.4%) with malignancy, was studied. Artificial neural network (ANN) and logistic regression (LR) models were developed and compared with the surgeon’s clinical impression. The first 490 patients were used as the training set for each model. The ANN and LR were tested on the remaining patients, who were divided into 2 consecutive groups. The main outcome measures were (1) the accuracy (receiver operating characteristic [ROC] curve analysis) of biopsy recommendations based on the surgeon’s impression and created by the 2 models and (2) the percentage of cancers that were falsely categorized as benign by the surgeon or the 2 models. Results. Despite the fact that the surgeon’s clinical impression showed good discrimination (area under ROC = 0.86), 13 of 708 cases (1.8%) thought to be benign by the surgeon proved to be carcinomas. The neural network (but not the LR model) was statistically superior to the surgeon’s impression (ANN: ROC = 0.89, P = .004; LR: ROC = 0.86). Additionally, the computerized models were able to quantitate risk. Those patients predicted to be “benign” by the network (n = 391) had only a 0.5% risk of intraductal carcinoma and no invasive carcinoma, whereas 47% of those patients in the highest risk quartile had cancer. Both computerized models predicted a need for biopsy in 11 of 13 of the lesions (85%) missed by the surgeon’s impression. Each model missed only 2 cases of intraductal carcinoma in young women. Conclusions. Computerized risk stratification models, used in routine practice, may help surgeons with decision making. The use of either model helps quantitate risk, thereby facilitating discussions with patients, and may reduce the number of “missed” cancers. (Surgery 2001;129:459-66.) From the Department of Medical Education (Surgery), Spartanburg Regional Medical Center, Spartanburg, SC

SCREENING MAMMOGRAPHY has led to improvement in the early detection of breast cancer.1 Unfortunately, mammography is nonspecific, and most patients who undergo needle-localized breast biopsies do not have cancer. Breast biopsies have minimal morbidity, but are associated with economic and psychologic costs that may limit the overall effectiveness of breast cancer screening.2 The ease and lower cost of stereotactic core biopsy is appealing but has not reduced biopsy rates.3 Using defined criteria, several authors4,5 have suggested frequent follow-up mammograms or stereoAccepted for publication September 29, 2000. Reprint requests: Richard K. Orr, MD, MPH, The Department of Medical Education, Spartanburg Regional Medical Center, 101 E Wood Ave, Spartanburg, SC 29303. Copyright © 2001 by Mosby, Inc. 0039-6060/2001/$35.00 + 0 11/56/112069 doi:10.1067/msy.2001.112069

tactic core biopsy in lieu of surgical biopsy for lesions of low radiographic suspicion. Implicit in any algorithm for the management of mammographic abnormalities is the basic concept that lesions may be segregated into specific risk groups and treatment dictated by the likelihood of cancer.6 Unfortunately, the accurate classification of mammographic abnormalities is not trivial, and there are often significant variations between radiologists viewing the same mammograms.7 Artificial neural networks (ANNs) have been used to facilitate such predictions8,9 and appear to perform better than experienced radiologists. Unfortunately, the ANNs in the radiology literature are not practical for busy clinicians, as they require specific evaluation of many (8 to 43) clinical and radiographic features for subsequent incorporation in the ANN. Logistic regression (LR)10 models have advantages over ANNs, but have not been applied to mammographic interSURGERY 459

460 Orr

Surgery April 2001

Table I. Composition of training and 2 test sets Group Training Test 1 Test 2

N

Mass (%)

Density (%)

490 454 344

58.0 52.1 39.1

11.2 13.9 22.1

pretation. LRs may be created with standard statistical software packages and generate interpretable coefficients and odds ratios, thus avoiding the “black box” effect of neural networks.11 The current study is an extension of the literature on neural networks in mammography. It attempts to answer 2 questions: (1) Can a simplified ANN be useful, using data collected during the routine clinical practice of a general surgeon? and (2) Can a more standard LR model perform as well as the ANN? Both models are compared with the surgeon’s clinical impression and with ANN performance estimates from the literature. PATIENTS AND METHODS Patients. During a 6-year period, an attempt was made to prospectively collect all mammographic referrals to 1 general surgeon, as an ongoing outcomes study. A total of 1288 patients, aged 17 to 92 years (mean, 56 years), were included in this study, which represented approximately 95% of all such referrals to the surgeon during this period (at the Fallon Clinic, Worcester, Mass). Patients were included in the current study if no lesion was palpable to the surgeon, patient, or referring physician, but were excluded if ultrasonography had established the unequivocal presence of a simple cyst accounting for the mammographic abnormality. The surgeon personally examined all mammograms and reviewed the official mammographic report. Individual consultation was obtained with a mammography expert when appropriate. The radiologists were not constrained in dictating their reports, each preferring a slightly different “freeform” style. None of the reports used the Breast Imaging Reporting and Data System nomenclature. Recommendations for biopsy or follow-up were made on the basis of published criteria12-14 and after discussion with the patient and her family. Patients were counted as having malignancy if a biopsy showed invasive or in situ carcinoma; patients’ conditions were considered benign on the basis of biopsy or mammographic follow-up. There were 617 patients (47.9%) who underwent surgical biopsy. The median follow-up was 51 months, and all patients who did not undergo biop-

Microcalcifications (%)

Asymmetry (%)

22.2 28.3 36.0

8.6 5.7 2.9

Malignant (%) 15.3 14.8 11.7

sy were followed for at least 12 months. The surgeon recorded all pre-biopsy data, which included the patient’s age and 3 radiographic features, at the time of the initial consultation. Lesions were classified as to type of abnormality: a nonpalpable dominant mass, an area of abnormal density, microcalcifications alone, or asymmetry. They were also categorized as to stability: stable finding, new finding, increasing abnormality, decreasing abnormality, or first mammogram. Finally, each lesion was assigned a degree of suspicion on the basis of the surgeon’s overall impression: benign, indeterminate, or suspicious. In all cases, the mammography reports (and occasionally direct consultation with a radiologist) were used as a guideline for the surgeon, but the final overall impression for each case rested on the surgeon’s ultimate decision. To create the models, the data were divided into a training set and 2 test sets (approximately 2 years each) in chronologic order. The training set included 490 patients, test set 1 included 454 patients, and test set 2 included 344 patients. There were 185 mammographic abnormalities that proved to be malignant (14.4%). Fifty-four patients (29.2%) had carcinoma in situ, and 131 patients had invasive cancer (70.8%). More patients in later years have been referred for microcalcifications (Table I), and there has been a consequent decrease in the proportion of patients with malignancy. Neural network. The ANN was implemented with a commercially available software package (Neuroshell2; Ward Systems Group, Inc, Frederick, MD). The network architecture is a standard backpropagation network with a single hidden layer. A total of 10 inputs are used. Patient age is accepted directly, degree of suspicion is coded: benign = 0, indeterminate = 1, or suspicious = 2, while the other 2 variables (type of lesion, change from previous mammogram) require creating dummy variables, which is easily performed. The output of the network was coded as benign = 0 or invasive or in situ cancer = 1. The ANN was trained by using the default settings on the software package, utilizing 26 hidden neurons (in a single layer), a learning rate of 0.1, and a momentum of 0.1. The software uses an internal stopping rule, essentially training

Orr 461

Surgery Volume 129, Number 4 Table II. Published neural networks Author Wu9 Floyd18 Lo35 Baker8 Lo42 Orr

Year 1993 1994 1995 1995 1999 Current series

Variables 43 8 9 10 18 4

on 75% of the training set and continually testing on the remaining 25%. The ANN stops training after the mean square error of the “withheld” cases stops diminishing. Training the ANN required less than 2 minutes (using a 486 microprocessor), with a minimum error of 0.086. LR model. To create an LR model, it was necessary to develop a plan to select the most predictive set of variables.15 Several models were created, by using various combinations of the predictive variables, and various stepwise selection procedures. These candidate models were applied to the training set data, and the model with the highest area under the ROC curve was selected for further analysis. The best model used 5 variables. Age was accepted directly, without transformation. Two dummy variables were created to model the 3 impression states [benign 00, indeterminate 10, and suspicious 01]. Two dichotomous variables were created to represent presence or absence of microcalcifications and presence or absence of mammographic progression. Additional models using variables to represent interactions were also tested but did not improve upon the basic model, nor did other transformations of the candidate variables. The LR model was created by using NCSS software (Dr Jerry Hintze, Kaysville, Utah). Statistical analysis. For illustrative purposes, univariable and bivariable descriptive analyses were performed. The accuracy and discrimination of the ANN and LR models were compared with the surgeon’s impression by using several metrics. Receiver operating-characteristic (ROC) curves graph the true-positive fraction against the falsepositive fraction,16 for all possible cutoff points, thereby enabling calculation of diagnostic accuracy throughout the entire spectrum of values, and are considered a good test of a model’s discrimination.17 A test with perfect discrimination would have an area under the ROC curve of 1.0, while a test with no diagnostic value (chance alone) corresponds to an area of 0.5. Trained radiologists reading test mammograms typically perform with areas ranging from 0.8 to 0.918 (Table II). ROC curves

Total number 193 260 260 206 500 1288

% Cancer

ROC impression

ROC – neural network

43 35 35 35 35 14

0.84 0.91 0.90 0.85 0.82 0.86

0.89 0.94 0.95 0.89 0.87 0.89

(and areas) were generated with NCSS software. Areas under the ROC curves were compared by using the method of Hanley and McNeil.19 RESULTS Descriptive statistics. There were 185 patients (14.4%) who had malignancies, 8% of the 78 with asymmetry, 19% of the 196 with mammographic densities, 12% of the 654 with mammographic masses, and 16% of the 360 with microcalcifications only. Age was highly correlated with the likelihood of malignancy, but this correlation differed for varying mammographic classifications. For microcalcifications only, the youngest quintile of patients (< 49 years) had a 12% likelihood of malignancy vs. a 21% likelihood for the oldest quintile (> 70 years). For those with asymmetry, density, or a mass, the youngest quintile had only a 0.5% likelihood of cancer versus 31% for the oldest quintile. Twenty percent of those patients referred to the surgeon after their first mammogram showed microcalcifications had malignancy versus 7% of those whose first mammograms showed asymmetry, density, or a mass. Surgeon’s impression. The areas under the ROC curves for the surgeon’s impression were: training set: 0.87; test set 1: 0.86; and test set 2: 0.83; all cases: 0.86. The surgeon judged 708 cases to be benign; however, 13 of those were actually malignant (1.8%). Six of those lesions harbored intraductal carcinoma, and 7 showed invasive carcinoma. Table III summarizes the 13 mammograms that were incorrectly diagnosed as benign by the surgeon. Nine of the 13 patients with benign-appearing malignant lesions underwent prompt biopsy, usually because of other risk factors or patient preferences. In 4 cases, however, biopsy was performed because of a change in the mammographic appearance 6 to 36 months after the original surgical opinion. At the time of this writing, all 4 patients with delayed diagnoses remain well without evidence of recurrent cancer. Neural network. The area under the ROC curve for the training set was 0.91. The network per-

462 Orr

Surgery April 2001

Table III. Patients with malignancy and benign surgical impression Age

Abnormality

45 54 58 59 66 68 73 75 73 76 86 46 80

Mass Mass Mass Mass Mass Mass Mass Mass Mass Mass Mass Microcalcifications Microcalcifications

Change

Network prediction (%)

Stable New New Increasing Increasing Increasing Increasing Increasing Unknown Increasing Increasing Increasing Increasing

0 3.0 2.1 5.4 8.0 7.9 9.3 10.0 2.5 10.4 15.2 0 3.1

Final diagnosis Intraductal Invasive Invasive Invasive Invasive Invasive Intraductal Invasive Invasive Intraductal Intraductal Intraductal Intraductal

Table IV. Network outputs divided into quartiles (training set) 1 2 3 4

Quartile mean output*

Total†

0 0.01 0.13 0.51

141 103 124 122

Number malignant 0 3 13 60

% CA‡ 0 2.9 10.5 49.1

*Neural network crude output. †These groupings are slightly asymmetric because there is a larger number (141) of patients with 0 network output. ‡Percentage of patients with cancer (invasive or in situ).

formed equally well on the initial testing set (ROC = 0.90), with slight performance degradation in the second test set (ROC = 0.84), despite the marked change in case mix and proportion of cancer observed during the 6-year period (Table I). These ANN performances were all better (and statistically significant, P < .01, for all but test set 2) to models created with the surgeon’s impression alone. Increasing network output was strongly correlated with increasing likelihood of malignancy, and this relationship was almost linear. When the data were broken down into quartiles, definite risk stratification was possible. Those in the lowest quartile had minimal risk of breast cancer, which increased steadily through the second, third, and fourth quartiles (Table IV). This relationship, in general, was also apparent in the 2 independent test sets [Table V]. The crude ANN output ranged from 0 to 0.89 in the training set. No training set patient with a network output of 0 (n = 141) had cancer. Conversely, 91% of the patients in the training set with a network output greater than 0.75 had cancer (20/22). When the 3 sets (testing and training) were combined, only 2 of 391 patients with network output of 0 had cancer (0.5%: both of these were carcinoma in situ). Those in the highest quartile

had a high risk of cancer (139/295 = 47.1%), while the 2 intermediate groups were also well separated (quartile 2: 7/272 [2.6%]; quartile 3: 37/341 [10.9%]). Notably, the neural network would have missed only 2 of the 13 cases missed by the surgeon’s impression, both intraductal carcinomas in young women (Table III). The network was able to integrate additional information that contradicted the “benign” appearance of the abnormality. For example, the mean age of the 11 patients with false-negative masses was higher (67 years), than the 498 women with true negative masses (53 years). Only one of the 13 false-negative mammograms contained a stable lesion, as compared with 20% of the true negative mammograms. Logistic regression. Despite the strong correlation between the LR and ANN outputs, (Pearson correlation = 0.97), discrimination by the LR was inferior to that of the ANN. Areas under ROC curves were: training set, 0.88; test set 1, 0.89; test set 2, 0.80; and all cases, 0.86. These performances were all slightly significantly (P < .03) worse than the ANN. From a practical point, however, the LR model also quantified risk, with nearly identical results to the ANN. The 2 cases of intraductal carcinoma missed by the ANN were also missed by the

Orr 463

Surgery Volume 129, Number 4 Table V. Prediction of malignancy in test set patients Test set 1 Quartile* 1 2 3 4

N

Test set 2

Cancer

% Cancer

2 2 11 53

1.3 2.3 9.9 51

158 86 111 104

N 92 83 106 69

Cancer 0 2 13 26

% Cancer 0 2.4 12.3 37.7

* These quartiles are based on the cut points established in the training set. Thus, they are asymmetric, reflecting the case-mix in the 2 test sets, each of which differs from that in the training set.

Table VI. Logistic regression model Variable Age Dummy 1 Dummy 2 Microcalcifications Progression Constant

Coefficient

P value

Odds ratio

0.03609 2.702 4.853 -0.7052 0.6078 -6.426

.006 < .001 < .001 .049 .093 < .001

1.04 14.9 128.1 0.49 1.84

LR, with a model predicted risk of 0.8% for both cases. A review of the variable odds ratios (Table VI) shows that the LR model was strongly based on the surgeon’s impression (dummy variables 1 and 2) but used the other 3 variables, as well. For example, a given lesion will be 1.8 times more likely to be malignant if the mammogram showed progression (with the other variables held constant). DISCUSSION Many patients undergo biopsy of benign mammographic abnormalities to avoid delaying a diagnosis of early cancer, as well as the extreme medicolegal situation.20 Several authors have shown that selective biopsy is appropriate, with avoidance of biopsy in significant numbers of patients, with few missed breast cancers.21 To decrease confusion, it would be extremely helpful to have a system that helps clinicians quantify the risk of malignancy for individual patients, thereby facilitating patient discussions and management recommendations. Theoretical work by Dawes22 suggest that clinicians are adept at determining which variables best categorize a diagnosis, but weaker at assigning a relative weight to each characteristic. Statistical models may combine the best features of clinicians (variable selection) with a mathematical method for assigning relative weights. Recently, ANNs have become popular in medical diagnosis.23,24 Published surgical applications include prediction of intensive care unit resource utilization,25 prediction of the mortality

risk after heart surgery,26 prediction of common bile duct stones,27 survival after trauma,28 and early outcomes after liver transplantation.29 Although ANN architectures and training algorithms vary, they share one basic function: All networks accept a set of inputs and generate corresponding outputs, through a process known as vector mapping.30 ANNs are particularly attractive for diagnostic problems without a linear solution.31 In the current study, for example, it is apparent that mammographic lesions with microcalcifications follow different “rules” than other lesions. Although the risk of malignancy in microcalcifications increases with advancing patient age, the slope of this increase is less than with other lesions. Furthermore, a change in microcalcifications appears to have a different significance than a change in other lesions. Such nonlinearity may explain the success of an ANN for mammographic interpretation. The technical details of ANNs are beyond the scope of this paper and are presented in other publications.32,33 Published neural network models have performed slightly better than well-trained radiologists.8,9 These models have used many radiographic features and a few clinical features as inputs. Recently, Baker et al34 have presented a more complex model using a standard radiology lexicon, the Breast Imaging Reporting And Data System. Such a system, although elegant, is unlikely to gain favor in a busy surgeon’s office. Although trained radiologists may fill out the proper data form in 5 min-

464 Orr

utes, it is likely that less frequent users would take more time to input data on each patient. The simplified model, presented here, performs as well as the more complex models without requiring excessive data input. Recently, radiologists have been studying simplified versions of ANNs that also predict well with only 4 features,35 similar to the ANN reported here. This study is the first to compare ANNs with the more standard logistic regression models for mammographic categorization, showing that ANNs discriminate better than LRs. From a practical standpoint, however, the LR was equivalent in identifying those malignant cases misclassified by clinical impression. LR models offer several key advantages over ANNs. LR models are generally understandable to clinicians because the individual contribution of each predictor variable may be expressed as an interpretable odds ratio instead of the more obscure “weight” produced by ANNs. The LR models also produce coefficients for each variable (Table VI) that enable transportability of the model to other users, as opposed to ANNs, which require actual possession of the trained ANN. Additionally, LR models may be created with most statistical software packages, thus removing the need to purchase additional ANN software. LR models are also easily reproducible by different users and different software, as opposed to ANNs, which are user and software dependent. The current investigation differs somewhat from those published series. The patient population in this study included many women with probably benign lesions desiring second opinions by a surgeon. Consequently, the prevalence of cancer was 14% in this series, as compared with the more typical figure of 35% in other series, based on patients undergoing biopsy. Additionally, this study used a different approach to variable selection. In contrast to other studies, only 3 radiographic variables are recorded for each patient in this model. Age is an additional important variable and requires no effort to obtain.36 A review of the network shows that the surgeon’s impression of the likelihood of cancer is the critical variable that drives the network. This feature, alone, gives reasonable separation between groups; those classified as benign had a small risk of malignancy (1.8%), while the intermediate group had a higher risk (15.9%), and suspicious lesions had a very high risk (63%). The areas under ROC curves for the surgeon’s prediction were very similar to those reported in published series by radiologists, which cluster around 0.86. In effect, the models fine-tuned the mammographic opinion by integrating the 4 variables. This

Surgery April 2001

fine-tuning enabled the models to predict a need for biopsy in 11 of 13 malignant lesions (85%) that appeared benign to the surgeon. An additional advantage is the quantitative information that the statistical models provide, which may be useful in terms of medicolegal documentation and institutional quality review. Statistical models (and ANNs, in particular) have a tendency to “memorize” patterns seen in the training set, creating models that appear valid but are not likely to accurately classify new data sets.37 To minimize the possibility of memorization by the ANN and LR models, this study used cross-validation by using new test sets. In this cross-validation, the training and testing sets are totally separate, minimizing the potential for overlearning and giving a better estimate of the true predictive power of the ANN. Despite the simplicity of this ANN and LR, they performed well on 2 different data sets taken from varied time periods. Nevertheless, the performance in the second test set degraded substantially, especially for the LR. Of note, however, is that those test cases were obtained 2 to 4 years after the training set cases and reflected a marked difference in the surgeon’s referral pattern (and perhaps, the surgeon’s developing classification skills). For example, in the training set, 12.6% of the patients had questionable microcalcifications, a very difficult group to classify. Conversely, 22.8% of patients in the most recent group (test set 2) had questionable microcalcifications. The current project has important limitations. With modern software, creation of an ANN or LR is relatively easy and does not require computer programming skills. It is not necessary to know the mathematics or technical details of ANNs to accomplish usable results. Nevertheless, the creation of ANNs has a learning curve, and the creation of valid LRs requires adherence to methodologic rules.15 Correct selection and scaling of the input variables is often critical to the eventual success or failure of the ANN and requires trial and error to learn. Certain technical details of network construction are important and may be critical in more complex applications.38 The interested reader is referred to a recent series of well-referenced articles about medical applications of ANNs.32,33,39 The second limitation reflects the importance of 1 subjective variable (overall impression) in the network’s performance. Although the author used published criteria (and the attending radiologist’s written report), it is possible that this network may be very dependent on the mammogram reader. Current studies are under way to investigate this issue, but previous studies with computer-assisted

Orr 465

Surgery Volume 129, Number 4

abdominal pain diagnosis have shown that these computer-assisted models may not be transportable between institutions.40 The third limitation relates to statistical power. Because of the nonlinearity of the data set, theoretically, the ANN should outperform LR models. This study shows a statistical difference between the ANN and LR models, but not between the LR and clinical impression. It is possible that a larger number of patients may have shown a statistical advantage to the LR, but this may be moot, as the LR demonstrably diminishes the number of missed cancers despite its weaker “statistical” performance. A fourth limitation is that the advantage seen by the ANN over the clinician may simply relate to the larger number of potential diagnostic classes that the ANN creates. The clinical impression creates only 3 classes, whereas the ANN and LR create a new class for each nonidentical data entry. It is possible that a discerning surgeon or radiologist could achieve better results by creating 5 or 6 risk categories, but that seems unlikely to this author. A fifth limitation relates to the length of follow-up for presumed benign lesions. Although most patients have several years of follow-up (median, 51 months), some have as few as 1 year. Although it is possible that an occasional patient will be found to have cancer, this number will represent a very small proportion of the study patients. This study has shown that relatively simple computerized models are useful for quantitative characterization of an abnormal mammogram. Nevertheless, the actual situation with an individual patient is more complicated than simply creating a numerical risk assessment.41 Patient factors, such as family history, other risk factors, and attitude about even a minimal risk of delayed diagnosis of a potentially curable breast cancer are critical factors. The situation is made even more complex because the biologic consequences of a 6- to 12-month delay in diagnosis are not completely clear, and such consequences may be exaggerated by the American judicial system.20 Models like these may be best used as a method for second-guessing a surgeon’s (or radiologist’s) recommendation for follow-up. Those patients felt to be at low risk by the surgeon, and carrying a low risk (< 1%) from the model may be counseled as to the low (but not zero) possibility of malignancy. The final decision for follow-up as an alternative to biopsy is made only after a thorough discussion with the patient and her family. The author thanks the departments of radiology at The Fallon Clinic and St Vincent Hospital (Worcester, Mass) for their expertise and assistance with this project.

REFERENCES 1. Costanza ME, D’Orsi CJ, Greene HL, Gaw VP, Karellas A, Zapka JG. Feasibility of universal screening mammography: Lessons from a community intervention. Arch Intern Med 1991;151:1851-6. 2. Cyrlak D. Induced costs of low-cost screening mammography. Radiology 1988;168:661-3. 3. Klem D, Jacobs HK, Jorgensen R, Facenda LS, Baker DA, Altimari A. Stereotactic breast biopsy in a community hospital setting. Am Surg 1999;65:737-40. 4. Hamby LS, McGrath PC, Stelling CB, Baker KF, Sloan DA, Kenady DE. Management of mammographic indeterminate lesions. Am Surg 1993;59:4-8. 5. McManus V, Desautels JEL, Benediktsson H, Pasieka J, Lafreniere R. Enhancement of true-positive rates for nonpalpable carcinoma of the breast through mammographic selection. Surg Gyn Obstet 1992;175:212-8. 6. Morrow M, Schmidt R, Cregger B, Hassett C, Cox S. Preoperative evaluation of abnormal mammograms to avoid unnecessary breast biopsies. Arch Surg 1994; 129:1091-6. 7. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. N Engl J Med 1994;331:1492-9. 8. Baker JA, Kornguth PJ, Lo JY, Williford ME, Floyd CE Jr. Breast cancer: Prediction with artificial neural network based on BI-RADS standardized lexicon. Radiology 1995;196:817-22. 9. Wu Y, Giger ML, Doi K, Vyborny CJ, Schmidt RA, Metz CE. Artificial neural networks in mammography: application to decision making in the diagnosis of breast cancer. Radiology 1993;187:81-7. 10. Harrell FE, Lee KL, Polock BG. Regression models in clinical studies: determining relationships between predictors and response. J Natl Cancer Inst 1988;80:1198-202. 11. Hart A, Wyatt J. Evaluating black-boxes as medical decision aids: issues arising from a study of neural networks. Med Inform (Lond) 1990;15:229-36. 12. Moskowitz M. The predictive value of certain mammographic signs in screening for breast cancer. Cancer 1983:51:1007-11. 13. Sickles EA. Breast calcifications: mammographic evaluation. Radiology 1986;160:289-93. 14. Kopans DB, Swann CA, White G, McCarthy KA, Hall DA, Belmonte SJ, et al. Asymmetric breast tissue. Radiology 1989;171:639-43. 15. Hosmer DW, Lemeshow S. Applied logistic regression. John Wiley: New York; 1989. 16. Dwyer AJ. In pursuit of a piece of the ROC. Radiology 1997;202:621-5. 17. Meistrell ML. Evaluation of neural network performance by receiver operating characteristic (ROC) analysis: examples from the biotechnology domain. Comput Methods Programs Biomed 1990;32:73-80. 18. Floyd CE Jr, Lo JY, Yum AJ, Sullivan DC, Kornguth PJ. Prediction of breast cancer malignancy using an artificial network. Cancer 1994;74:2944-8. 19. Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148:839-43. 20. Berlin L. The missed breast cancer: perceptions and realities. Am J Roentgenol 1999;173:1161-7. 21. Sickles EA. Probably benign breast lesions: when should follow-up be recommended and what is the optimal follow-up protocol. Radiology 1999;313:11-4.

466 Orr

22. Dawes R. The robust beauty of improper linear models in decision making. Am Psychol 1979;34:571-82. 23. Sawyer MD. Invited commentary: artificial neural networks — an introduction. Surgery 2000;127:1-2. 24. Baxt WG. Application of artificial neural networks to clinical medicine. Lancet 1995;346:1135-8. 25. Buchman TG, Kubos KL, Seidler AJ, Siegforth MJ. A comparison of statistical and connectionist models for the prediction of chronicity in a surgical intensive care unit. Crit Care Med 1994;22:750-62. 26. Orr RK. Use of a probabilistic neural network to estimate the risk of mortality after cardiac surgery. Med Decis Making 1997;17:178-85. 27. Golub R, Cantu RC Jr, Tan M. The prediction of common bile duct stones using a neural network. J Am Coll Surg 1998;187:584-90. 28. McGonigal MD, Cole J, Schwab W, Kauder DR, Rotondo MF, Angood PB. A new approach to probability of survival scoring for trauma quality assurance. J Trauma 1993;34:86370. 29. Doyle HR, Dvorchik I, Mitchell S, Marino IR, Ebert FH, McMichael J, et al. Predicting outcomes after liver transplantation: a connectionist approach. Ann Surg 1994;219: 408-15. 30. Wasserman PD. Advanced methods in neural computing. New York: Van Nostrand Reinhold; 1993. pp. 35-55. 31. Baxt WG. Complexity, chaos, and human physiology: the justification for non-linear neural computational analysis. Cancer Lett 1994;77:85-93. 32. Drew PJ, Monson JRT. Artificial neural networks. Surgery 2000;127:3-11. 33. Cross S, Harrison RF, Kennedy RL. Introduction to neural networks. Lancet 1995;346:1075-9.

Surgery April 2001 34. Baker JA, Kornguth PJ, Floyd CE Jr. Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. AJR Am J Roentgenol 1996;166:773-8. 35. Lo JY, Baker JA, Kornguth PJ, Floyd CE. Computer-aided diagnosis of breast cancer: artificial neural network approach for optimized merging of mammographic features. Acad Radiol 1995;2:841-50. 36. Kerlikowsike K, Grady D, Barclay J, Sickles EA, Eaton A, Ernster V. Positive predictive value of screening mammography by age and family history of breast cancer. JAMA 1993;270:2444-50. 37. Astion ML, Wenre MH, Thomas RG, Hunder GG, Bloch DA. Overtraining in neural networks that interpret clinical data. Clin Chem 1993;39:1998-2004. 38. Doyle HR, Parmanto B, Munro PW, Marino IR, Aldrighetti L, Doria C, et al. Building clinical classifiers using incomplete observations - A neural network ensemble for hepatoma detection in patients with cirrhosis. Methods Inf Med 1995;34:253-8. 39. Guerriere MRJ, Detsky AS. Neural networks: what are they? Ann Int Med 1991;115:906-7. 40. Pesonen E, Ohmann C, Eskelinen M, Juhola M. Diagnosis of acute appendicitis in two databases. Evaluation of different neighborhoods with an LVQ neural network. Methods Inf Med 1998;37:59-63. 41. Sterns EE. Changing emphasis in breast diagnosis: The surgeon’s role in evaluating mammographic abnormalities. J Am Coll Surg 1997;184:297-302. 42. Lo JY, Baker JA, Kornguth PJ, Floyd CE. Effect of patient history data on the prediction of breast cancer from mammographic findings with artificial neural networks. Acad Radiol 1999;6:10-5.