FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 |

«RR-00-12 R E S E A R CR HE CATs: WHITHER AND WHENCE P O R Howard Wainer T Princeton, New Jersey 08541 September 2000 Research Reports provide ...»

-- [ Page 1 ] --












R Howard Wainer


Princeton, New Jersey 08541

September 2000

Research Reports provide preliminary and limited

dissemination of ETS research prior to publication. They are

available without charge from the

Research Publications Office

Mail Stop 07-R

Educational Testing Service

Princeton, NJ 08541

CATs: Whither and Whence

Introduction and Background

Throughout its entire history there has always been the tradeoff between individual testing and group testing. An individually administered test does not contain too many inappropriately chosen items and, furthermore, the examinee is very likely to understand what he or she is to do on the task. A group-administered test has the advantage of greater uniformity of the testing situation for all examinees, as well as a vastly reduced cost of testing. Throughout the first 90 years of the 20th century, the choice has almost always been in favor of the mass-administered test.

A critical problem facing a mass-administered test is that, under most circumstances, it must be assumed that there is a relatively broad range of ability to be tested. To effectively measure everyone, the test must contain items whose difficulties match this range (i.e., some easy items for the less proficient, some difficult ones for the more proficient). If the test did not have difficult items, we might not, for example, be able to distinguish among the proficient examinees who got all the easy items correct. Similarly, if there were no very easy items on the test, we might not be able to distinguish among the less proficient examinees who got the more moderate items all wrong. If making these kinds of discriminations is important, the test must contain as broad a range of item difficulties as the ability range of the population to be tested. The accuracy with which a test measures at any particular proficiency level is (roughly) proportional to the number of items whose difficulties match that level.

Fortunately for mass-administered testing, Lincoln's observation1 that the good Lord must have loved the common man “because he made so many of them,” remains valid. Most examinees' abilities seem to lie in the middle of the continuum.

Thus, mass-administered tests, whose purpose is to discriminate among examinees, match this by having most of their items of moderate difficulty with fewer items at the extremes.

The consequence of this kind of test structure has historically been that the most proficient examinees have had to wade through substantial numbers of too easy items before reaching any that provided substantial amounts of information about their ability. This was wasteful of time and effort as well as introducing possibly extraneous variables into the measurement process, for instance, the chance of careless errors induced by boredom. Less proficient examinees face a different problem. For them, the easy items provide a reasonable test of ability, whereas the difficult ones yield little information to the examiner. They can, however, 1 See Dennett (1988, p. 143)

–  –  –

In the early 1970s, the possibility of a flexible mass-administered test that would alleviate these problems began to suggest itself. The pioneering work of Frederic Lord (1970, 1971a,b,c,d) is of particular importance. He worked out both the theoretical structure of a mass-administered, but individually tailored test, as well as many of the practical details.

The basic notion of an adaptive test is to mimic automatically what a wise examiner would do. Specifically, if an examiner asked a question that turned out to be too difficult for the examinee, the next question asked would be considerably easier. This stems from the observation that we learn little about an individual's ability if we persist in asking questions that are far too difficult or far too easy for that individual. We learn the most when we accurately direct our questions at the same level as the examinee's ability. An adaptive test first asks a question in the middle of the prospective ability range. If it is answered correctly, the next question asked is more difficult. If it is incorrectly answered, the next one is easier. This continues until we have established the examinee's ability to within some predetermined level of accuracy.

Early attempts to implement adaptive tests were clumsy and/or expensive.

The military, through various agents (e.g., Office of Naval Research; Navy Personnel Research and Development Center; Air Force Human Resources Laboratory; Army Research Institute), recognized early on the potential benefits of adaptive testing and supported extensive theoretical research efforts. Through this process much of the psychometric machinery needed for adaptive testing was built. Nevertheless, the first real opportunity to try this out in a serious way awaited the availability of cheap, high-powered computing. The 1980s saw this and the program to develop and implement a computerized adaptive test (CAT) began in earnest (see Sands et al.

1997, for a detailed description of the development of the CAT-ASVAB, and Wainer et al. 2000, for a reasonably up-to-date textbook on CAT).

This work was aimed at improving the entire measurement process. In addition to the increased efficiency of testing the other advantages expected of a

CAT (from Green, 1983) were:

1. Improved test security, to the extent that a test is safer in a computer than in a desk drawer. Moreover, because what is contained in the computer is the item pool, rather than merely those specific items that will make up the examinee‘s test, it is more difficult to artificially boost one’s score by merely learning a few items. This is analogous to making available a dictionary to a student prior to a spelling test and saying, “All the items of the test are in here.” Learning all of the items in the pool is

–  –  –

2. Individuals can work at their own pace, and the speed of response can be used as additional information in assessing proficiency. Aside from the practical necessity of having rough limits on the time of testing (even testing centers must close up and clean the floors occasionally), we can allow for a much wider range of response styles than is practical with traditional standardized tests.

3. Each individual stays busy productively — everyone is challenged but not discouraged. Most items are focused at an appropriate range of difficulty for each individual examinee.

4. The physical problems of answer sheets are solved. No longer would a person's score be compromised because the truck carrying the answer sheets overturned in a flash flood — or other such calamity. There is no ambiguity about erasures, no problems with response alternatives being marked unwittingly.

5. The test can be scored immediately, providing immediate feedback for the student. This has profound implications for using tests diagnostically.

6. Pretesting items can be easily accomplished by having the computer slip new items unobtrusively into the sequence. Methods for doing this effectively are still under development.

7. Faulty items can be immediately expunged, and an allowance for communication between examinee and examiner can be made.

8. A greater variety of questions can be included in the test builder's kit. The multiple-choice format need not be adhered to completely — numerical answers to arithmetic problems can just be typed in. Memory can be tested by use of successive frames. With voice synthesizers, we can include a spelling test, as well as aural comprehension of spoken language. Video disks showing situations can replace long-winded explanations on police or firefighter exams.

The Present With such convincing cheerleading, it is no wonder that the actual use of computerized testing for operational tests took off in the decade of the 1990s. In Figure 1 are shown (on a logarithmic scale) the number of CATs given in four testing programs: the Graduate Record Examinations (GRE) General Test, Graduate Management Admissions Test (GMAT), the Test Of English as a Foreign Language (TOEFL), and the Armed Services Vocational Aptitude Battery (ASVAB).

3 These four tests constitute four of the largest operational testing programs that have “gone CAT.” We see that in 1990 only a few hundred CATs were administered, but by 1999 this figure had grown to more than a million. The growth over this decade was exponential and while it is hard to predict how much longer it will remain so, it is clear that CAT utilization is a long way from leveling off.

–  –  –

At the same time that CAT utilization has been booming there has been a movement toward “distance learning.”2 The idea of using internet technology to reach distant students is the latest attempt to spread the scarce resource of firstclass education more broadly than is possible within the bounds of face-to-face instruction. On-line internet instruction is the 21st century version of a 20th century correspondence course. But when the student is at-a-distance how can we measure the efficacy of the instruction? How much has the student learned? Correspondence courses would often include extensive written exercises and exams that would be mailed in for teacher evaluation. It is natural to think that if the course was provided electronically, over the internet, so too would be the evaluation. And, if a computer is administering the test, efficiency would suggest that it might as well be made adaptive. With this scenario of utilization looming on the near horizon, how could anyone doubt the bright future of CATs? The issue yet to be resolved is the same one faced by correspondence courses, “how can we know at-a-distance who is answering the questions?” The administration of more than a million CATs a year becomes even more impressive when one considers the circumstances under which a CAT is administered. Most typically it is done in a small room with no more than 8 to 10 testing stations, each in a separate cubicle, overseen by a test administrator. The administrator has a monitor that allows him/her to see what each examinee is doing.

Compare the cost of such a set-up with the more familiar situation for mass administration of tests in which a gymnasium is filled with desks and a couple of proctors roam the room keeping an eye out for improper behavior. Typically a measure of test security is added through the use of two or three different forms of the same test that are “spiraled3” throughout the examinees in the room.

2 Until secure and valid “distance assessment” is operational, it is probably more accurately called “distance teaching,” or more honestly, “distant students.” 3 “Spiraled” is the term that is often used to describe the process of interleaving different test forms in the shipping box so that when they are passed out to examinees people sitting next to one another do not have the same test form. This makes copying from your neighbor’s test futile.

4 “A pessimist is an optimist with data.” (Linda Steinberg, 1999). With the administration of more than three million operational CATs, the decade of the 1990s has provided us with an enormous amount of testing information. The importance of the enterprise also has had the effect of increasing the closeness with which those data were scrutinized. This examination revealed practical limitations to the technology that were not apparent earlier. As the glow of initial enthusiasm faded and as our eyes became accustomed to the darker reality, previously unsuspected problems emerged. With our increasing awareness of practical limitations has come the requirement that we reevaluate old assumptions.

The questions we now must address deal less with “how to use it?” but more often “under what circumstances and for what purposes should we use it?” The future surely holds a promise for the possibilities of testing that are hard to foresee, but tests will still need to fulfill the age old canons of validity that characterize good practice.

Let us reconsider Green’s eight points with the wisdom of both data and hindsight.

1. Test security. Test security remains an essential element for the validity of most tests, and how to maintain security at-a-distance remains an unsolved problem. Current economic realities mean that CATs are given continuously. Thus the item pool is constantly being exposed. In addition, the CAT item selection algorithm does not choose all items with equal likelihood. In fact, a very small proportion of the item pool accounts for a large proportion of the items administered (Wainer, 2000); a common finding is that between 15 and 20 percent of the item pool accounts for more than 50% of the test items administered4. Thus, although we might provide a dictionary as the corpus of items for a spelling test, the item selection algorithm would choose some words much more often than others (Zipf, 1949). Hence the effective size of the item pool is much smaller than the actual size.

This is an enormous problem since item exposure, and hence test security problems, seems to increase logarithmically with item pool size (Wainer, 2000).

Since item writing costs are linear with pool size, this means that costs increase exponentially with linear increases in test volume. This contrasts sharply with the economics of mass administered, paper and pencil tests, in which costs decline 4 A reviewer of this paper commented that “Item usage in adaptive testing is directly dependent upon the population of test takers. The reason that a smaller proportion of the item pool accounts for a larger number of item administrations is that most test takers are of middle ability. Why is this a surprise?” Let me respond. The distribution of item difficulties in the pool usually resembles the distribution of ability in the population; there are more items at levels of moderate difficulty than at either extreme. Despite this only a small proportion of these items are used very much. Roughly 20% of all items in the pool are not used at all, and these are distributed evenly across the ability range (Wainer, 2000).

5 with increased volume; indeed the marginal cost of a paper and pencil test goes almost to zero.

Pages:   || 2 | 3 |

Similar works:

«The Post Suburban Metropolis: Western Sydney and the Importance of Public Space Sophie Watson Open University, UK Sydney, like all cities, is imagined in particular ways which derive from a specific sociocultural and historical context and which persist over time, even when social and economic changes render the dominant imaginary outmoded. How a city is imagined has distinct effects on how that city is planned and lived, and in this sense a mismatch between the dominant imaginary and the...»

«CDHO Advisory | Kidney Disease and Kidney Failure COLLEGE OF DENTAL HYGIENISTS OF ONTARIO ADVISORY ADVISORY TITLE Use of the dental hygiene interventions of scaling of teeth and root planing including curetting surrounding tissue, orthodontic and restorative practices, and other invasive interventions for persons1 with kidney disease or kidney failure. ADVISORY STATUS Cite as College of Dental Hygienists of Ontario, CDHO Advisory Kidney Disease and Kidney Failure, 2010-07-15 INTERVENTIONS AND...»

«Co-Leadership: LESSONS FROM REPUBLICAN ROME David Sally L ike an upraised sword glinting in the merciless sun of the Colosseum, the Roman Empire—its pageantry, cruelty, power, and imperiousness— has dazzled the eyes and captured the minds of the modern crowd. When we think of ancient Rome today, due in no small part to the box office triumph of movies such as Ben Hur and Gladiator, we focus on the luster of the Empire and neglect Rome's early history as a republic. This is unfortunate,...»

«February 2010 by Rockdale City Council On Historic Botany Bay Australia Day Celebrations Highlights of Our Big Party Grand Opening for Soccer Club City Suns Shine for Gala Launch Chinese New Year Join Our Festivities Contents Message from the Mayor King Street Place Roars Into Life Botany Bay Glistens for New Aussies Kick Off for Kids Beach Soccer Memories Just a Mouse Click Away Grand Opening for Grand Soccer Club A+ Results With yourtutor Go Back in Time with Bexley Dob in a Dumper...»

«THE THREE SISTERS TRIED AND TRUE RECIPES OF BUTTE MONTANA Dr. Lyn Olsen Chapter 1: 3 Sisters Tried and True Recipes Page 3 TABLE OF CONTENTS Chapter 1: History of Butte Chapter 2: Appetizers/Others Chapter 3: Main Dishes Chapter 4: Sides Chapter 5: Breads Chapter 6: Desserts Chapter 7: Hints Chapter 1: 3 Sisters Tried and True Recipes Page 4 Chapter 1 History This is a collection of hundreds of recipes tested over many years from friends and family in Butte, Montana, which are just as vividly...»

«The 2009 Elections and Iran’s Changing Political Landscape by Mehran Kamrava Mehran Kamrava is the Interim Dean of Georgetown University’s School of Foreign Service in Qatar and the Director of the School’s Center for International and Regional Studies. His most recent books include Iran’s Intellectual Revolution (2008) and The Modern Middle East: A Political History Since the First World War (2005). Abstract: Iran’s June 2009 elections set into motion four processes that are central...»

«ORAL HISTORY INTERVIEW with Mr. Wirt Mineau at his home St. Croix Falls, Wisconsin September 30, 1955 by Helen McCann White ©Forest History Society Durham, North Carolina Original publisher’s notice: All publication rights to the contents of this oral history interview are held by the Forest History Foundation, Inc., 2706 West Seventh Boulevard, St. Paul, Minnesota. Permission to publish any part of this oral history interview must be obtained in writing from the Forest History Foundation,...»

«Cambridge Suburbs and Approaches Madingley Road Cambridge Suburbs and Approaches: Madingley Road Prepared by The Architectural History Practice Ltd For Cambridge City Council March 2009 Contents 1 CHARACTER SUMMARY 2 INTRODUCTION 2.1 Background 2.2 Methodology 2.3 Limitations 3 HISTORICAL DEVELOPMENT 3.1 Brief overview of the development of Cambridge 3.2 The development of Madingley Road 4 CHARACTER ASSESSMENT 4.1 The Assessment Area 4.2 Overall Character and Appearance 4.3 Character Area 1 4.4...»

«Signifier, Signified, and Multiplicity of Context Ahmad K. Ardat It Texts, verbal and written, ensue from a tension between a cathodeand-anode like shuttling in a dichotomy of words and things that is as old as the material and social worlds. In fact words and things are forbear-nodes of the one and same dichotomy that originated in antiquity and preserved intermittently but insistently throughout the periods of history of western thought under different linguistic rubric-nodes, the latest of...»

«André Ourednik – 2004 – Métaphysique, ontologie et épistémologie des automates cellulaires 1 Métaphysique, ontologie, et épistémologie des automates cellulaires 1 André Ourednik, Université de Lausanne, Faculté des Lettres, Section de Philosophie, 2004 Séminaire de master en épistémologie et philosophie des sciences, supervisé par Prof. Michaël Esfeld Abstract In the last decades, triggered by a historical leap in computer technology, a theory bearing the name of “cellular...»

«Nicole Ledoux UCLA/Getty Conservation Program Treatment and Technical Study of a Lakota Beaded Hide Ledoux, ANAGPIC 2010, 1 Abstract This paper discusses the conservation and technical study of a Lakota (est.) beaded hide object in very poor condition. The piece, whose original function is not known, was reported as collected in the late 19th or early 20th century by John Anderson, a photographer living on the Rosebud reservation in South Dakota. It was passed down through family lines until it...»

«Ranch Houses Are Not All the Same David Bricker Architectural Historian California Department of Transportation San Bernardino, California Introduction With nearly constant rumbling and clattering sounds of construction, much of American suburbia was transformed during the bustling postwar period. Vast acres of land were subdivided for a multitude of new housing tracts. Their varied patterns of streets, yards, and detached single-family houses rapidly changed the appearance of the semi-rural...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.