«RR-00-12 R E S E A R CR HE CATs: WHITHER AND WHENCE P O R Howard Wainer T Princeton, New Jersey 08541 September 2000 Research Reports provide ...»
HE CATs: WHITHER AND WHENCE
R Howard Wainer
Princeton, New Jersey 08541
Research Reports provide preliminary and limited
dissemination of ETS research prior to publication. They are
available without charge from the
Research Publications Office
Mail Stop 07-R
Educational Testing Service
Princeton, NJ 08541
CATs: Whither and Whence
Introduction and Background
Throughout its entire history there has always been the tradeoff between individual testing and group testing. An individually administered test does not contain too many inappropriately chosen items and, furthermore, the examinee is very likely to understand what he or she is to do on the task. A group-administered test has the advantage of greater uniformity of the testing situation for all examinees, as well as a vastly reduced cost of testing. Throughout the first 90 years of the 20th century, the choice has almost always been in favor of the mass-administered test.
A critical problem facing a mass-administered test is that, under most circumstances, it must be assumed that there is a relatively broad range of ability to be tested. To effectively measure everyone, the test must contain items whose difficulties match this range (i.e., some easy items for the less proficient, some difficult ones for the more proficient). If the test did not have difficult items, we might not, for example, be able to distinguish among the proficient examinees who got all the easy items correct. Similarly, if there were no very easy items on the test, we might not be able to distinguish among the less proficient examinees who got the more moderate items all wrong. If making these kinds of discriminations is important, the test must contain as broad a range of item difficulties as the ability range of the population to be tested. The accuracy with which a test measures at any particular proficiency level is (roughly) proportional to the number of items whose difficulties match that level.
Fortunately for mass-administered testing, Lincoln's observation1 that the good Lord must have loved the common man “because he made so many of them,” remains valid. Most examinees' abilities seem to lie in the middle of the continuum.
Thus, mass-administered tests, whose purpose is to discriminate among examinees, match this by having most of their items of moderate difficulty with fewer items at the extremes.
The consequence of this kind of test structure has historically been that the most proficient examinees have had to wade through substantial numbers of too easy items before reaching any that provided substantial amounts of information about their ability. This was wasteful of time and effort as well as introducing possibly extraneous variables into the measurement process, for instance, the chance of careless errors induced by boredom. Less proficient examinees face a different problem. For them, the easy items provide a reasonable test of ability, whereas the difficult ones yield little information to the examiner. They can, however, 1 See Dennett (1988, p. 143)
In the early 1970s, the possibility of a flexible mass-administered test that would alleviate these problems began to suggest itself. The pioneering work of Frederic Lord (1970, 1971a,b,c,d) is of particular importance. He worked out both the theoretical structure of a mass-administered, but individually tailored test, as well as many of the practical details.
The basic notion of an adaptive test is to mimic automatically what a wise examiner would do. Specifically, if an examiner asked a question that turned out to be too difficult for the examinee, the next question asked would be considerably easier. This stems from the observation that we learn little about an individual's ability if we persist in asking questions that are far too difficult or far too easy for that individual. We learn the most when we accurately direct our questions at the same level as the examinee's ability. An adaptive test first asks a question in the middle of the prospective ability range. If it is answered correctly, the next question asked is more difficult. If it is incorrectly answered, the next one is easier. This continues until we have established the examinee's ability to within some predetermined level of accuracy.
Early attempts to implement adaptive tests were clumsy and/or expensive.
The military, through various agents (e.g., Office of Naval Research; Navy Personnel Research and Development Center; Air Force Human Resources Laboratory; Army Research Institute), recognized early on the potential benefits of adaptive testing and supported extensive theoretical research efforts. Through this process much of the psychometric machinery needed for adaptive testing was built. Nevertheless, the first real opportunity to try this out in a serious way awaited the availability of cheap, high-powered computing. The 1980s saw this and the program to develop and implement a computerized adaptive test (CAT) began in earnest (see Sands et al.
1997, for a detailed description of the development of the CAT-ASVAB, and Wainer et al. 2000, for a reasonably up-to-date textbook on CAT).
This work was aimed at improving the entire measurement process. In addition to the increased efficiency of testing the other advantages expected of a
CAT (from Green, 1983) were:
1. Improved test security, to the extent that a test is safer in a computer than in a desk drawer. Moreover, because what is contained in the computer is the item pool, rather than merely those specific items that will make up the examinee‘s test, it is more difficult to artificially boost one’s score by merely learning a few items. This is analogous to making available a dictionary to a student prior to a spelling test and saying, “All the items of the test are in here.” Learning all of the items in the pool is
2. Individuals can work at their own pace, and the speed of response can be used as additional information in assessing proficiency. Aside from the practical necessity of having rough limits on the time of testing (even testing centers must close up and clean the floors occasionally), we can allow for a much wider range of response styles than is practical with traditional standardized tests.
3. Each individual stays busy productively — everyone is challenged but not discouraged. Most items are focused at an appropriate range of difficulty for each individual examinee.
4. The physical problems of answer sheets are solved. No longer would a person's score be compromised because the truck carrying the answer sheets overturned in a flash flood — or other such calamity. There is no ambiguity about erasures, no problems with response alternatives being marked unwittingly.
5. The test can be scored immediately, providing immediate feedback for the student. This has profound implications for using tests diagnostically.
6. Pretesting items can be easily accomplished by having the computer slip new items unobtrusively into the sequence. Methods for doing this effectively are still under development.
7. Faulty items can be immediately expunged, and an allowance for communication between examinee and examiner can be made.
8. A greater variety of questions can be included in the test builder's kit. The multiple-choice format need not be adhered to completely — numerical answers to arithmetic problems can just be typed in. Memory can be tested by use of successive frames. With voice synthesizers, we can include a spelling test, as well as aural comprehension of spoken language. Video disks showing situations can replace long-winded explanations on police or firefighter exams.
The Present With such convincing cheerleading, it is no wonder that the actual use of computerized testing for operational tests took off in the decade of the 1990s. In Figure 1 are shown (on a logarithmic scale) the number of CATs given in four testing programs: the Graduate Record Examinations (GRE) General Test, Graduate Management Admissions Test (GMAT), the Test Of English as a Foreign Language (TOEFL), and the Armed Services Vocational Aptitude Battery (ASVAB).
3 These four tests constitute four of the largest operational testing programs that have “gone CAT.” We see that in 1990 only a few hundred CATs were administered, but by 1999 this figure had grown to more than a million. The growth over this decade was exponential and while it is hard to predict how much longer it will remain so, it is clear that CAT utilization is a long way from leveling off.
At the same time that CAT utilization has been booming there has been a movement toward “distance learning.”2 The idea of using internet technology to reach distant students is the latest attempt to spread the scarce resource of firstclass education more broadly than is possible within the bounds of face-to-face instruction. On-line internet instruction is the 21st century version of a 20th century correspondence course. But when the student is at-a-distance how can we measure the efficacy of the instruction? How much has the student learned? Correspondence courses would often include extensive written exercises and exams that would be mailed in for teacher evaluation. It is natural to think that if the course was provided electronically, over the internet, so too would be the evaluation. And, if a computer is administering the test, efficiency would suggest that it might as well be made adaptive. With this scenario of utilization looming on the near horizon, how could anyone doubt the bright future of CATs? The issue yet to be resolved is the same one faced by correspondence courses, “how can we know at-a-distance who is answering the questions?” The administration of more than a million CATs a year becomes even more impressive when one considers the circumstances under which a CAT is administered. Most typically it is done in a small room with no more than 8 to 10 testing stations, each in a separate cubicle, overseen by a test administrator. The administrator has a monitor that allows him/her to see what each examinee is doing.
Compare the cost of such a set-up with the more familiar situation for mass administration of tests in which a gymnasium is filled with desks and a couple of proctors roam the room keeping an eye out for improper behavior. Typically a measure of test security is added through the use of two or three different forms of the same test that are “spiraled3” throughout the examinees in the room.
2 Until secure and valid “distance assessment” is operational, it is probably more accurately called “distance teaching,” or more honestly, “distant students.” 3 “Spiraled” is the term that is often used to describe the process of interleaving different test forms in the shipping box so that when they are passed out to examinees people sitting next to one another do not have the same test form. This makes copying from your neighbor’s test futile.
4 “A pessimist is an optimist with data.” (Linda Steinberg, 1999). With the administration of more than three million operational CATs, the decade of the 1990s has provided us with an enormous amount of testing information. The importance of the enterprise also has had the effect of increasing the closeness with which those data were scrutinized. This examination revealed practical limitations to the technology that were not apparent earlier. As the glow of initial enthusiasm faded and as our eyes became accustomed to the darker reality, previously unsuspected problems emerged. With our increasing awareness of practical limitations has come the requirement that we reevaluate old assumptions.
The questions we now must address deal less with “how to use it?” but more often “under what circumstances and for what purposes should we use it?” The future surely holds a promise for the possibilities of testing that are hard to foresee, but tests will still need to fulfill the age old canons of validity that characterize good practice.
Let us reconsider Green’s eight points with the wisdom of both data and hindsight.
1. Test security. Test security remains an essential element for the validity of most tests, and how to maintain security at-a-distance remains an unsolved problem. Current economic realities mean that CATs are given continuously. Thus the item pool is constantly being exposed. In addition, the CAT item selection algorithm does not choose all items with equal likelihood. In fact, a very small proportion of the item pool accounts for a large proportion of the items administered (Wainer, 2000); a common finding is that between 15 and 20 percent of the item pool accounts for more than 50% of the test items administered4. Thus, although we might provide a dictionary as the corpus of items for a spelling test, the item selection algorithm would choose some words much more often than others (Zipf, 1949). Hence the effective size of the item pool is much smaller than the actual size.
This is an enormous problem since item exposure, and hence test security problems, seems to increase logarithmically with item pool size (Wainer, 2000).
Since item writing costs are linear with pool size, this means that costs increase exponentially with linear increases in test volume. This contrasts sharply with the economics of mass administered, paper and pencil tests, in which costs decline 4 A reviewer of this paper commented that “Item usage in adaptive testing is directly dependent upon the population of test takers. The reason that a smaller proportion of the item pool accounts for a larger number of item administrations is that most test takers are of middle ability. Why is this a surprise?” Let me respond. The distribution of item difficulties in the pool usually resembles the distribution of ability in the population; there are more items at levels of moderate difficulty than at either extreme. Despite this only a small proportion of these items are used very much. Roughly 20% of all items in the pool are not used at all, and these are distributed evenly across the ability range (Wainer, 2000).
5 with increased volume; indeed the marginal cost of a paper and pencil test goes almost to zero.