Measuring problem solving in computer environments: current and future states☆

Measuring problem solving in computer environments: current and future states☆

Computers in Human Behavior 18 (2002) 609–622 Measuring problem solving in computer environments: current and futu...

122KB Sizes 0 Downloads 23 Views

Computers in Human Behavior 18 (2002) 609–622

Measuring problem solving in computer environments: current and future states§ Eva L. Bakera,*, Harold F. O’Neil Jr.a,b a

University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 300 Charles E Young Drive North, Room 301, Los Angeles, CA 90095-1522, USA b Rossier School of Education, 600 Waite Phillips Hall, University of Southern California, University Park, Los Angeles, CA 90089-0031, USA

Abstract The first part of this article will review the status of technology and educational reform; the second will treat traditional approaches to problem-solving measurement common today in technological settings; and the third will propose how problem-solving assessment may change and, in particular, the role that authoring strategies may play in measuring problemsolving performance successfully in computer environments. We will suggest four areas that must be considered in which computers would increase the fidelity and validity of measures of complex problem solving: (1) the intentions and skills of assessment designers; (2) the range of performance that counts as problem solving; (3) the ways in which validity evidence can be sought; and (4) the degree to which the measurement produces results that generalize across tasks and contexts. # 2002 Elsevier Science Ltd. All rights reserved. Keywords: Assessment; Problem solving; Educational reform; Technology

This article focuses on the measurement of performance, proficiency, or achievement in problem solving, using computer settings. Performance can mean withincourse, end-of-course, or academic or on-the-job demonstrations of significant accomplishment or skills in a variety of domains including academic knowledge or applied settings. The discussion will be organized in three parts. The first part will review the status of technology and educational reform; the second will treat traditional The findings and opinions expressed in this report do not reflect the positions or policies of the National Institute on Student Achievement, Curriculum, and Assessment, the Office of Educational Research and Improvement, or the US Department of Education, nor do they reflect the positions or policies of the Office of Naval Research. * Corresponding author. Tel.: +1-310-206-1530; fax: +1-310-267-0152. E-mail address: [email protected] (E.L. Baker). §

0747-5632/02/$ - see front matter # 2002 Elsevier Science Ltd. All rights reserved. PII: S0747-5632(02)00019-5


E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622

approaches to problem-solving measurement common today in technological settings; and the third will propose how problem-solving assessment may change and, in particular, the role that authoring strategies may play in measuring problem-solving performance successfully in computer environments.

1. Technology and educational reform Two trends in technology are certain: the cost of computer technology will continue to drop, and technology of all sorts will become easier to use—wireless, wearable, and wondrously intuitive in its design. The lure of the technological solutions will be amplified by continued pressures to improve the learning of children and adults, in both formal and less traditional settings. After decades of effort, children in the United States are not yet able to demonstrate high performance in academic achievement, either in a criterion sense, meeting national standards (Campbell, Hombo, & Mazzeo, 2000), or in a comparative sense, as judged by their middling to poor showing in international achievement studies [National Center for Education Statistics, 2001; Third International Mathematics and Science Study (TIMSS), 1997, 1998]. American children coming to school are poorer, more diverse culturally, and more linguistically variable than ever before. They are more likely to be in childcare and have two working parents. Their teachers are staying in the profession for ever shorter intervals, and the schools of the most needy (and poorest performing) students are staffed with growing proportions of less well prepared teachers. Clearly, learning technologies may be the major effective option, moving from a role that generally supports learning to one that assumes more direct responsibility for standardized content access and for instruction Adult settings have similar imperatives. Business and industry invest in training not only to compensate for inadequate education but to target and refine skills for particular business contexts and roles. Universities need to find mechanisms to maintain or raise their prestige and attractiveness to students and parents, to reduce the insupportable growth in postsecondary education costs, and to demonstrate their effectiveness to a competitive marketplace. Adult education, voluntarily sought, also must fit around the busy work schedules of mature students. For adults, technology-based learning is a growth industry. It plays out within or across university campuses, in adult distance learning programs, and in distance learning programs designed by the military and by businesses. In mission-oriented education and training, of the sort central to both the military and business sectors, there are desires to shorten the period of ‘‘school-house’’ training (O’Neil, Drenz, & Lewis 2000), increase effectiveness, and move personnel, as soon as possible, into productive, operational roles. Individuals themselves wish to condense the time between when they decide to train (or retrain) for a career and the day they can step into a needed role as an employee. The confluence of increased technological access and functionality and the continuing increase in societal demand for performance places higher expectations on computer-supported learning environments, in that the limited flow of high-quality teachers and the delivery of just-in-time training and education present no other

E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622


feasible option. Yet, to date, technology interventions, with a few notable exceptions, have not been worthy of our hopes and dreams. Evidence has been presented that they teach traditional material somewhat faster and with a long amortization period, at lower cost (Fletcher, in press; Fletcher, Hawley, & Piele, 1990; Kulik, 1994). But the data are far from convincing at present, and examination of many current technology offerings is not reassuring. Dreary intellectually, predictable pedagogically, despite cuter, more active graphics, our learning systems will need massive rethinking to make them useful for the challenges facing instruction for both children and adults. One key to their ultimate utility will be the degree to which technology can be used simultaneously to teach and to measure better, more deeply and speedily, the complex tasks and propensities needed for learners to achieve and to continue to learn in a rapidly changing society. Not surprisingly, in the light of this special issue, attention is inevitably focused on problem-solving tasks measured in technological settings. Measurement is a well-known focus in precollegiate education. Policy attention continues to be directed to assessment and accountability, and performance and proficiency (Improving America’s Schools Act of 1994; No Child Left Behind Act of 2001) will continue to capture the attention of policymakers. The watchwords of reform for the last 10 or so years have been accountability and performance, and now the pressure is growing for technology-based systems to support these functions. Some responses have been made to support and develop new forms of assessment using technology (Baker, 1999; Chung, O’Neil, & Herl, 1999; Herl, O’Neil, Chung, & Schacter, 1999; Kim, Kolko, & Park, in press; Minstrell, 2000; Orvis, Wisher, Bonk, & Olson, in press; Stevens, Ikeda, Casillas, Palacio-Cayetano, & Clyman, 1999; vanLehn & Martin, 1998; Webb et al., 2001; Yeagley, 2001), but the field is in its infancy. Measurement will be embedded in technology, both to facilitate the design, administration, scoring, and reporting of traditional forms of instruction and to evaluate interventions that use technology to teach. The longterm impact of the integration of accountability and technology has consequences both for systems designed to teach and approaches intended to measure and report performance. Where technology is used for teaching, the present practice of developing rapid prototypes and proofs of concept will no longer suffice; evidence of effectiveness will be demanded.

2. Problem-solving measurement We turn our attention to the broad area of problem solving, a key component of learning for children, youth, and adults. Problem solving is a family of cognitive demands that can be required in many subject areas. The term problem solving goes far beyond the application of algorithms (e.g. subtraction rules) to simple tasks. Our definition of problem solving (Baker & Mayer, 1999; O’Neil, 1999) is already an important component of educational reform efforts designed to raise the expertise of students. With access to digitized data a reality, the measurement of learning outcomes shifts from tapping students’ storage and retrieval of knowledge to their


E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622

acquisition and use of information for particular purposes. The revolution in computers by itself probably would have pushed us to this realization, but supported by cognitive approaches intended to characterize the requirements of complex problem solving, our task should become more tractable as knowledge and experience accumulate. 2.1. Current strategies for measuring problem solving in computer environments How are computers and supporting technology at the present time used in measuring problem solving? We will consider traditional achievement or aptitude testing administered in computer settings, which emphasize efficiency goals, and we will suggest ways in which computers can increase the fidelity and validity of measures of complex problem solving. To improve the quality of measurement in this area, four things must be considered: (1) the intentions and skills of assessment designers (e.g. purposes held for the measurement); (2) the range of performance that counts as problem solving; (3) the ways in which validity evidence can be sought; and (4) the degree to which the measurement produces results that generalize across tasks and contexts. 2.2. Traditional testing and technology An excellent early article on this general topic of testing and computerization was written by Bunderson, Inouye, and Olsen (1989) and has informed some of this analysis. Computer-based testing has been an area of long-standing development. In its earliest implementations, its major purpose was to increase the efficiency of the testing process without diminishing the validity or credibility of results. From primitive devices to simplify and then automate the scoring of multiple-choice responses, achievement testing has exploited the efficiency of the computer. For the most part, responses have been constrained (students choosing among alternatives, for example, in arithmetic problems, filling in boxes under the numerals comprising the right values). Characterized as the first generation of computerized testing by Bunderson et al. (1989), the first implementations of computerized testing were intended to transfer paper-and-pencil test models onto the computer for administration. Bunderson et al. (1989) characterized some of the contributions of such tests, including richer item displays, some online scoring, data storage, and so on. Little effort, however, was devoted to rethinking the underlying measurement model or re-conceiving the domain of assessed tasks. In this form, computers regularize interactions between the test and the examinee. A prearranged set of tasks can be delivered to the examinee, and both the process (keystrokes and latency) and outcomes (right or wrong, or in between) of student responses can be monitored and reported. Other traditionally formatted tests are offered on computer. For instance, released items from the Math and Science literacy scale of the Third International Mathematics and Science Study (TIMSS), a test that was developed for large-scale paper administration, have been made available for computer administration

E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622


( Although the original intent of the test was to make international comparisons among carefully drawn samples from 41 countries, the availability of some of the test items for delivery in computer-supported environments has resulted in other uses. Through Web-based administration, students can take items from the test, get feedback, and see how their answers compare to those of others. The system provides item-by-item comparisons of the examinee’s responses to the US average performance and to performance in other countries (e.g. Japan; see Baker & Mitchell, in press). In this computerized setting, the purpose of the examination has shifted from the test designers’ original goal of international system monitoring to self-assessment by the learner, and the comparison of local performance (e.g. an 8th-grade mathematics class in Indiana) to the international results. To date, approximately 30,000 students have voluntarily taken these released items from the TIMSS for these purposes. Although the test focuses on problem solving, most response modes call for selection among alternatives. To take some advantage of the technology, the transformation of TIMSS into computer-administered format was augmented by a heavy use of animation graphics to heighten the examinee’s interest and to provide feedback in less traditional ways. Computer adaptive testing (CAT), the second generation of computer testing described by Bunderson et al. (1989), administers to the examinee different test items drawn from a large pool of items in an attempt to focus the measurement on the area of the domain that brackets the individual examinee’s level of expertise. Time for test administration is reduced by reducing the number of items taken through the process of successively limiting the domain of individual examinee achievement. This is accomplished by applying algorithms designed to target the examinee’s level of performance. Computer adaptive testing is now in widespread use in admissions and placement tests (e.g. the SAT, the Graduate Record Examination, and the Test of English as a Foreign Language). For the most part, little has occurred to advance the definition of problem solving, but the systems tend to rely on different scaling techniques, such as item response theory (IRT), that have been used in classical testing approaches. The 1999 Standards for Educational and Psychological Testing [American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA, APA, NCME), 1999] included 18 standards related to computerized test design and interpretation. Among difficulties in this area are the role of proprietary algorithms in commercial publishing and the difficulty in making that knowledge available to users. Even though different examinees receive different examination items (an idea that seems at odds with common interpretations of ‘‘standardized’’ administrations), CAT is reasonably well accepted by the public. CAT is used for testing purposes that can support the development and maintenance of the item infrastructure needed, so for the most part, its uses have been targeted primarily to high-volume purposes, admissions and selection, where many thousands of examinees participate on an annual basis. The settings of use, therefore, tend to be formalized and time-bound, and for the most part, the systems rely on multiple-choice formats.


E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622

3. Evolving approaches to problem-solving assessment How can problem solving be measured with non-multiple-choice formats using computer technology, and how can the validity and fidelity of assessments be supported by the availability of computer services? Let us consider a typology of problem-solving tasks. Problem solving can be a focus of computer interventions that are dedicated to a particular content area and for which appropriate solutions are known in advance (e.g. to solve chemistry problems through simulation, or to land an F-14 on a carrier deck through simulation), where the goal is to determine whether the trainee has enough specific knowledge and strategies to accomplish one or more tasks needed to meet a known standard. In these domains, although there will be context variations, the approach to problem solving is to determine the characteristics of challenging conditions (e.g. compounds with very similar properties in chemistry, or high seas for the aircraft carrier deck landing) and figure out which of a series of procedures to apply. A problem-solving task may be incredibly complex. There may be many alternative hypotheses to be tested intellectually, internally, before any particular overt procedure is put into place. The examinee may need strategies to apply in order to recover from error (e.g. if he or she selects an approach that may lead to a warning, e.g., imminent creation of a noxious compound, or danger to the aircraft and ship). Last, the examinee may, especially in the latter case, operate under time constraints such that a leisurely examination and selection of alternatives is not possible. Some problem solving has to have an intuitive component (continual monitoring and understanding of context and changes) and a resulting level or degree of automaticity. A second class of problem-solving tasks may involve the application of a tool set to a broadly ranging set of topics. In this case, examinees would be expected to demonstrate proficiency (e.g. in searching the Internet to find important and relevant information). Tools might be applied iteratively, or different tools applied sequentially, but the rules for defining the tasks and the criteria to be used for rating examinee performance would be judgmental and applied to divergent answers rather than to predictable ones. Problem-solving tasks can take a third form, dealing with simulations and problems for which there is not a known solution, but which present, like the first case, a rapidly changing scenario, for instance, with chance or the probability of existing faults occurring as ‘‘surprises’’ during the examination sequence. Here the intellectual task for the learner varies and includes assimilation and incorporation of useful strategies and a running internal record of the degree to which any combination of procedures or actions is likely to optimize the outcome. As in the second example of problem solving, responses are judgmentally scored. Problem-solving tasks may also vary in the degree to which the nature of the problem is made clear to the examinee. In a recent review of sample problems, Sawaki and Stoker (in preparation) found that the seeming complexity of difficult problems was caused by complex language forms rather than by complexity in the domain knowledge. The question when considering less well specified problems is

E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622


whether the examinee is detecting the problem using the domain of skills that are the target of education or training (e.g. mathematics, or troubleshooting), or whether the problem is made difficult unnecessarily by masking it in impenetrable language. While it is clear that some human communication is garbled, and that complex language may be an organic barrier to the understanding of the problem, it is similarly likely that difficult language structures are sometimes put in place to increase the problem’s difficulty and thereby the spread among examinee responses. This construct-irrelevant variance (see AERA, APA, NCME, 1999) undermines the potential for drawing correct inferences from examinee performance. Not only can problems be obscured or embedded in distracting settings, or presented in complex language; problems also can be provided sequentially to learners in a computerized setting. Solving the first part of a task in a particular way may lead to a conditional representation of the second part of the task. Contingent tasks may, on the one hand, approximate reality, for there are consequences of correct and incorrect paths. However, there are other ways in which sequential or contingent performance can provide useful information about competence. In simulations available on computer, examinees can also be afforded tasks that allow them to show their expertise along the entire course of the problem-solving requirement. For example, if in a paper-and-pencil setting, one infers the wrong problem, then all additional work—computations, strategies, explanations and answers—will be wrong. Despite the fact it has internal logic, it is likely that such work will receive minimal credit. In a computer environment, however, the examinee can submit the articulation of the problem and receive either permission to continue or opportunity to correct, or be provided with the appropriate problem situation. Following this presentation, the examinee can proceed to exhibit expertise in applying strategies and procedures to solve the problem. Similarly, if a decision is incorrect, and will result in the examinee’s inability to show any other relevant skills, the system can display the correct partial solution and allow the examinee to continue. Scoring of such complex sequences will take into account the parts incorrectly addressed, but allow the collection of information to establish the degree of partial knowledge possessed by the examinee. Such approaches will allow suboptimal performance to be addressed without necessarily re-teaching every aspect of the task. While this approach may have benefit for the examinee (i.e. learning may occur), its purpose is not the same as a branching system of instruction. The purpose of a conditional and corrected path in problem solving is to obtain the best estimate of the examinee’s performance on parts of a task as well as the total task outcome. It is obvious that computer-based problem solving is not necessarily an individual task, although there are often clear desires to make attributions of performance to an individual for certification or performance-recording purposes. As problemsolving tasks further approximate the real life conditions of performance, team and group solutions may be called for (see Chung et al., 1999). In the measurement of performance, it is important to determine which aspects of ‘‘teamness’’ itself are measured, as opposed to team members’ specialized knowledge contributing to the solution. If team performance efficiency were the topic, then measures of speed,


E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622

focus, level of distraction, and so on would be measured. If the component contribution to the effort were the focus, then the contribution of every member would need to be assessed and weighted in order to determine the outcome. The good news is that many of the difficulties in measuring team performance have been reduced by sophisticated networking options and data capturing approaches. For the most part, however, this type of problem solving is under-used. The logical combination, of course, is marrying complex problem solving, with either open or convergent answers, to computer technology. In some ways, intelligent tutors have attempted to create systems where complex tasks are presented or inferred by learners, and contingent instruction occurs as a function of the student’s responses. The creation of a model of a student’s knowledge is the basis for the sequencing of experience in the system (Mislevy, Steinberg, & Almond, 1999; vanLehn & Martin, 1998). For the most part, such systems are expensive to develop and have focused often on technical areas (Gott & Lesgold, 2000; Lesgold, 1994). The single well-known exception are the Algebra tutors by Anderson and his colleagues (Anderson, Corbett, Koedinger, & Pelletier, 1995; Milson, Matthew, & Anderson, 1990). Tutoring systems are the archetypal example of learning and assessment embedded in a content domain. It is key to understanding alternatives in the evolution of computer-supported problem-solving measurement to understand the relationship between problem solving and subject matter. 3.1. Domain-dependent and domain-independent attributes of problem solving Both domain-independent and domain-dependent knowledge are usually essential for problem solving. Domain-dependent analyses focus on the subject matter as the source of all needed information. For example, in multiplication, the procedure of ‘‘carrying’’ is a domain-specific requirement for an examinee to demonstrate. In physics, the analyses done by Minstrell (1992) and Hunt and Minstrell (1996) focused on partial understandings and misconceptions of knowledge about very specific aspects of force and motion. Much of the analyses provided as examples relevant to measurement in the recent volume from the National Research Council (Pellegrino, Chudowsky, & Glaser, 2001) focused on domain-dependent analyses of learning intended to guide measurement models. Domain-independent analyses are those that attempt to capture the general strategies that are in use across subject matters. These approaches should engender not general notions of intelligence testing, but rather the attributes of performance that could be expected to transfer over a wide task domain. One example we can think of is written composition. The entire academy teaching writing skills operates on a notion that writing is a problem-solving activity focused on communication, and that general strategies, such as rhetoric and organization, can be applied to particular writing purposes, for example, communication with an intention to persuade the reader of a certain viewpoint. The subject-matter content of the domain is of central importance, whether it is an analysis of a novel by a contemporary European writer or the performance report of an individual’s service record on the job. But the content itself is the screen on which the domain-independent features are projected.

E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622


The assumption of this approach to teaching and measuring writing performance is that it is expected that performance will transfer, so that the writer of a good essay in a composition classroom situation, will, with appropriate background knowledge, be able to write similarly effective pieces of work in history, science, or Shakespeare courses, as well as analytic efforts on the job. For the same reason, employers often wish to see writing samples by prospective employees; they believe such skills generalize. Considering the history of assessment and testing to date, for the most part investment has been made in domain-dependent measures—measuring biology, astronomy, or grammar. This is logical, for it is how courses, disciplines, and academic training have been organized. But one question that must focus our attention relates to the future, not the past. How certain are we that what we teach and measure today, especially in technical settings, will be useful in the future. How much retraining will there need to be? How different will tasks be? Will features that operate in one system be largely transferred to another, although the examples may change, the displays may differ, and the goal may be slightly modified? Imagining a future where common attributes of problem solving get successively applied to different contexts, using different prior knowledge, would lead one to think about measurement of problem solving differently. Designers would want to import the common-features (domain-independent) strategies from one task to another so that they did not need a new analysis for every new task. Furthermore, examining performance with a domain-independent component could heighten the examinee’s ability to transfer his expertise to new domains, and such competency could be routinely included in validity studies of the measures. Wenger (1987) considered alternative approaches to the design of expert systems. This frame is comparable to our interest in describing the domain of problem solving we want to measure. Wenger analyzed knowledge-specific approaches (he termed these ‘‘bottom-up’’ analyses). He reported that they tend to produce more isolated, redundant, and efficient (in the short term) representations. They were truly focused on what they are about. Their downside was that these domain representations were poor in supporting modification, adaptability, or transfer to new domains. In considering domains that were constructed from general principles and rules—domain-independent (or in Wenger’s terminology, ‘‘top-down or articulated’’)—different properties were found. They will be at the outset more time consuming to develop; however, they will be modifiable and support transfer to other domains. Choosing the right balance between domain-specific and domain-independent approaches to problem solving requires the consideration of a number of criteria. The first set deals with the purpose of the measure and the prospective roles of the examinee.  If the purpose of the measure is to certify performance in a particular domain or on a particular piece of equipment, measures must use the domain and equipment, or high-fidelity simulations, sampling both instances and examples where context changes.


E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622

 If the purpose of the measure is the evaluation of a course for a specific and limited purpose, problem solving can be measured in domain-specific ways.  If the role of the examinee is anticipated to change—for example, if the examinee will be expected to solve problems other than the explicit ones measured, or is anticipated to progress to higher levels of responsibility and broader oversight—domain-independent features of problem solving should be included and measured, and transfer ability selectively measured.  If there is an overwhelming amount of technical knowledge, the system should include both strong domain-specific elements and domain-independent measures to support schemata needed to organize and access knowledge. A second set of criteria deals with the degree to which the design and the development of computer-based problem-solving assessments are intended to be computer supported. Specifically, we move beyond the computer administration, scoring, and reporting of performance, to computer-supported authoring systems. One way to think about computer-supported authoring is in terms of what outputs can reasonably be expected and what ‘‘smarts’’ have to be in an assessment design system. In this consideration, we must be concerned with the level of expertise of the designer and the quality of the measure needed. At the low end, there are authoring systems that purport to help novice assessment designers. Vendors sell Web-based testing services, including templates designed to help people write multiple-choice tests for classroom or end-of-course use. These systems have very little independent knowledge included in them, and other than stepping individuals through pages where they are directed to fill in the blanks of already-formatted items, they offer little assistance. Real authoring systems should have the goal to produce measures that can stand up to empirical reviews related to reliability and validity for various purposes. These systems need to know who the users are and the level of expertise they have, both in subject matter and in measuring common tasks, such as problem solving. The systems need to provide outputs that meet the constraints of cost and give trustworthy information about performance. Investment in such systems involves the analysis of task attributes that are domain-independent and that can be transferred within a general domain (e.g. history or electronics). If this is not possible, investment should be in the design of the one and only test useful for the particular task. But a future orientation argues against that approach, and efforts are underway at the UCLA Center for Research on Evaluation, Standards, and Student Testing (CRESST) to create authoring systems for problem-solving tasks. The goal of these systems is to create tests for particular purposes (e.g. diagnosis, certification). In doing so, separate tests should become more coherent, so that, for example, approaches to measuring on-the-job performance in using a particular system could influence the approach taken to measure in-process training performance. Second, these systems can compensate for lack of knowledge by the designer. Where the designer is weak in some aspect of content knowledge, the system can

E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622


import information and examples. Where the designer is weak on measurement, the system can offer default conditions designed to provide a safety net for test quality. Three approaches to authoring are under exploration at the present time at CRESST. One is to provide needed background information to guide developers in solving specific problems or in understanding details of requirements. Constructed as a Web-based system, measurement and evaluation knowledge will be available to a range of users who have various purposes, such as test design or validation. This system is to serve as a precursor to the use of an automated, or partially automated, system. The second approach is to develop some archetypal authoring templates. One of these, the Knowledge Mapper (Chung, Baker, & Cheak, 2001), allows individuals to author online assessments for use during courses and for certification and evaluation purposes. The Knowledge Mapper allows expert representations (links and nodes) to be inserted in the system and used as the basis of judging performance. Evidence for this knowledge representation approach to measure problem solving is available (Herl et al., 1999; Hsieh & O’Neil, in press; O’Neil, Wang, Chung, & Herl, 2000; Schacter, Herl, Chung, Dennis, & O’Neil, 1999). The third approach is to develop an object-oriented system of authoring. In this, the designer would select from a menu one or more purposes of the measure to be developed, and the architecture of the problem. Architecture might involve types of cloaking of the problem to be addressed, the range of strategies to be employed, the degree of discrimination required among potential interim steps, the speed of performance, and so on. Each of these objects would control the software displays provided to the designer and would generate a set of features to be included in the test prepared for the examinee population. In this approach, certain initial questions will need to be answered:      

What objects are reliably applied to a range of subject matter? What features yield particular patterns of empirical results? How does the combination of features influence outcomes? How should domain knowledge be imported into the system? When should human review of outputs be made? What is the range of user characteristics for successful application?

Problem-solving measurement will have to be scalable and available for rapid transition to real world environments. Analyzing how computers can support both the act of collecting and scoring performance and the authoring process yields many needs. As these become tractably solved, we will need to include in future analyses motivational aspects of problem solving as well.

Acknowledgements The work reported herein was supported in part under the Educational Research and Development Centers Program, PR/Award Number R305B60002 and Award


E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622

Number R305B960002-01, as administered by the Office of Educational Research and Improvement, US Department of Education, and in part under Office of Naval Research Award Number N00014-02-1-0179, as administered by the Office of Naval Research. References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Anderson, J. R., Corbett, A. T., Koedinger, K. R., & Pelletier, R. (1995). Cognitive tutors: lessons learned. The Journal of the Learning Sciences, 4, 167–207. Baker, E. L. (1999, Summer). Technology: Something’s coming—something good (CRESST Policy Brief 2). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing. Baker, E. L., & Mayer, R. E. (1999, May/July). Computer-based assessment of problem solving. Computers in Human Behavior, 15, 269–282. Baker, E. L., & Mitchell, D. International self-assessment: TIMSS Online (Technical Report). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing (in press). Bunderson, C. V., Inouye, D. K., & Olsen, J. B. (1989). The four generations of computerized educational measurement. In R. Linn (Ed.), Educational measurement (3rd ed.) (pp. 367–408). New York: Macmillan. Campbell, J. R., Hombo, C. M., & Mazzeo, J. (2000). NAEP 1999 trends in academic progress: three decades of student performance (NCES 2000-469). Washington, DC: US Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Chung, G. K. W. K., Baker, E. L., & Cheak, A. M. (2001). Knowledge mapper authoring system prototype (Final deliverable to OERI). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Chung, G. K. W. K., O’Neil, H. F. Jr., & Herl, H. E. (1999). The use of computer-based collaborative knowledge mapping to measure team processes and team outcomes. Computers in Human Behavior, 15, 463–493. Fletcher, D. Cost effectiveness of a computer-based maintenance problem solving strategy. Computers in Human Behavior (Submitted for publication). Fletcher, J. D., Hawley, D. E., & Piele, P. K. (1990). Costs, effects, and utility of microcomputer assisted instruction in the classroom. American Educational Research Journal, 27, 783–806. Gott, S. P., & Lesgold, A. M. (2000). Competence in the workplace: how cognitive performance models and situated instruction can accelerate skill acquisition. In R. Glaser (Ed.), Advances in instructional psychology: educational design and cognitive science (pp. 239–327). Mahwah, NJ: Lawrence Erlbaum Associates. Herl, H. E., O’Neil, H. F. Jr., Chung, G. K. W. K., & Schacter, J. (1999). Reliability and validity of a computer-based knowledge mapping system to measure content understanding. Computers in Human Behavior, 15, 315–333. Hsieh, I. -L., & O’Neil, H. F., Jr. Types of feedback in a computer-based collaborative problem-solving group task. Computers in Human Behavior, Xref: SO7475632(02)000250. Hunt, E., & Minstrell, J. (1996). Effective instruction in science and mathematics: psychological principles & social constraints. Issues in Education: Contributions from Education Psychology, 2(2), 123–162. Improving America’s Schools Act of 1994, Pub. L. No. 103–382, 108 Stat. 3518 (1994). Kim, S., Kolko, B. E., & Park, O. C. Problem-solving in Web-based problem-based learning: third-year medical students’ participation in end-of-life care virtual clinic. Computers in Human Behavior, Xref: SO7475632(02)000298.

E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622


Kulik, J. (1994). Meta-analytic studies of findings on computer-based instruction. In E. L. Baker, & H. F. O’Neil Jr. (Eds.), Technology assessment in education and training (pp. 9–33). Hillsdale, NJ: Lawrence Erlbaum Associates. Lesgold, A. (1994). Assessment of intelligent training technology. In E. L. Baker, & H. F. O’Neil Jr. (Eds.), Technology assessment in education and training (pp. 97–116). Hillsdale, NJ: Lawrence Erlbaum Associates. Milson, R. L., Matthew, W., & Anderson, J. R. (1990). The teacher’s apprentice project: Building an algebra tutor. In R. Freedle (Ed.), Artificial intelligence and the future of testing (pp. 53–71). Hillsdale, NJ: Lawrence Erlbaum Associates. Minstrell, J. (1992). Facets of students’ knowledge and relevant instruction. In R. Duit, F. Goldberg, & H. Nieddere (Eds.), Proceedings of the International Workshop: Research in Physics Learning—Theoretical Issues and Empirical Studies. Kiel, Germany: The Institute of Science Education at the University of Kiel (IPN). Minstrell, J. (2000). Student thinking and related assessment: creating a facet-based learning environment. In N. S. Raju, J. W. Pellegrino, M. W. Berthenthal, K. J. Mitchell, & L. R. Jones (Eds.), Grading the nation’s report card: research from the evaluation of NAEP (Report of the Committee on the Evaluation of National and State Assessments of Educational Progress, Commission on Behavioral and Social Sciences and Education, National Research Council (pp. 44–73). Washington, DC: National Academy Press. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (1999). On the roles of task model variables in assessment design (CSE Tech. Rep. No. 500). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). National Center for Education Statistics. (2001, May). Highlights from the Third International Mathematics and Science Study-Repeat (TIMSS-R) (NCES 2001-027). Washington, DC: US Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. No Child Left Behind Act of 2001, Pub. L. No. 107–110, 115 Stat. 1425 (2002). O’Neil, H. F. Jr. (1999). Perspectives on computer-based performance assessment of problem solving: Editor’s introduction. Computers in Human Behavior, 15, 255–268. O’Neil, H. F., Jr., Drenz, C., & Lewis, F. (2000, July). Technical and tactical opportunities for revolutionary advances in rapidly deployable joint ground forces in the 2015–2025 era (Army Science Board— 1999–2000 Summer Study—Training Dominance Panel). Arlington, VA: Army Science Board. O’Neil, H. F. Jr., Wang, S., Chung, K. W. K., & Herl, H. E. (2000). Assessment of teamwork skills using computer-based teamwork simulations. In H. F. O’Neil Jr., & D. Andrews (Eds.), Aircrew training and assessment (pp. 245–276). Mahwah, NJ: Lawrence Erlbaum Associates. Orvis, K. L., Wisher, R. A., Bonk, C. J., & Olson, T. M. Communication patterns during synchronous Web-based military training in problem solving. Computers in Human Behavior, Xref: SO7475632(02)000183. Pellegrino, J., Chudowsky, N., & Glaser, R. (Eds.). (2001). Knowing what students know: the science and design of educational assessments (Committee on the Foundations of Assessment; Board on Testing and Assessment, Center for Education. Division on Behavioral and Social Sciences and Education). Washington, DC: National Academy Press. Sawaki, Y., & Stoker, G. Framework for linguistic analysis as a LEARNOME component. University of California, Los Angeles, National Center for Research on Evaluation, Standards, and Student Testing (CRESST) (in preparation). Schacter, J., Herl, H. E., Chung, G. K. W. K., Dennis, R. A., & O’Neil, H. F. Jr. (1999). Computer-based performance assessments: a solution to the narrow measurement and reporting of problem solving. Computers in Human Behavior, 15, 403–418. Stevens, R., Ikeda, J., Casillas, A., Palacio-Cayetano, J., & Clyman, S. (1999). Artificial neural networkbased performance assessments. Computers in Human Behavior, 15, 295–313. Third International Mathematics and Science Study. (1997, September). Performance assessment in IEA’s Third International Mathematics and Science Study (TIMSS). International Association for the Evaluation of Educational Achievement. Third International Mathematics and Science Study. (1998). Pursuing excellence: a study of US twelfthgrade mathematics and science achievement in international context (NCES 98–049). Washington, DC:


E.L. Baker, H.F. O’Neil, Jr. / Computers in Human Behavior 18 (2002) 609–622

US Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. vanLehn, K., & Martin, J. (1998). Evaluation of an assessment system based on Bayesian student modelling. International Journal of Artificial Intelligence in Education, 8, 179–221. Webb, N. L., Mason, S., Choppin, J., Green, L. Y., Thorn, C., Watson, J., Andrekopolous, W., & Szopinski, E. (2001). Study of electronic information systems in Milwaukee Public Schools. Second year report (Technical report to the Joyce Foundation). Madison: University of Wisconsin, Wisconsin Center for Education Research. Wenger, E. (1987). Artificial intelligence and tutoring systems: computational and cognitive approaches to the communication of knowledge. Los Altos, CA: Morgan Kaufmann. Yeagley, R. (2001, April). Data in your hands. The School Administrator. Retrieved 25 March 2002 from