15 minute read

Testing

Standardized Tests And High-stakes Assessment

Assessment is the process of collecting data to measure the knowledge or performance of a student or group. Written tests of students' knowledge are a common form of assessment, but data from homework assignments, informal observations of student proficiency, evaluations of projects, oral presentations, or other samples of student work may also be used in assessment. The word assessment carries with it the idea of a broader and more comprehensive evaluation of student performance than a single test.

In an age when testing is controversial, assessment has become the preferred term because of its connotation of breadth and thoroughness. The National Assessment of Educational Progress (NAEP) is an example of a comprehensive assessment worthy of the name. Also known as the "Nation's Report Card," NAEP administers achievement tests to a representative sample of U.S. students in reading, mathematics, science, writing, U.S. history, civics, geography, and the arts. The achievement measures used by NAEP in each subject area are so broad that each participating student takes only a small portion of the total assessment. Not all assessment programs, however, are of such high quality. Some administer much more narrow and limited tests, but still use the word assessment because of its popular appeal.

Standardized tests are tests administered and scored under a consistent set of procedures. Uniform conditions of administration are necessary to make it possible to compare results across individuals or schools. For example, it would be unfair if the performance of students taking a test in February were to be compared to the performance of students tested in May or if one group of students had help from their teacher while another group did not. The most familiar standardized tests of achievement are traditional machine-scorable, multiple-choice tests such as the California Achievement Test (CAT), the Comprehensive Tests of Basic Skills (CTBS), the Iowa Tests of Basic Skills (ITBS), the Metropolitan Achievement Test (MAT), and the Stanford Achievement Test (SAT). Many other assessments, such as open-ended performance assessments, personality and attitude measures, English-language proficiency tests, or Advanced Placement essay tests, may also be standardized so that results can be interpreted on a common scale.

High-stakes testing is a term that was first used in the 1980s to describe testing programs that have serious consequences for students or educators. Tests are high-stakes if their outcomes determine such important things as promotion to the next grade, graduation, merit pay for teachers, or school rankings reported in a newspaper. When test results have serious consequences, the requirements for evidence of test validity are correspondingly higher.

Purposes of Assessment

The intended use of an assessment–its purpose–determines every other aspect of how the assessment is conducted. Purpose determines the content of the assessment (What should be measured?); methods of data collection (Should the procedures be standardized? Should data come from all students or from a sample of students?); technical requirements of the assessment (What level of reliability and validity must be established?); and finally, the stakes or consequences of the assessment, which in turn determine the kinds of safeguards necessary to protect against potential harm from fallible assessment-based decisions.

In educational testing today, it is possible to distinguish at least four different purposes for assessment: (1) classroom assessment used to guide and evaluate learning; (2) selection testing used to identify students for special programs or for college admissions; (3) large-scale assessment used to evaluate programs and monitor trends; and (4) high-stakes assessment of achievement used to hold individual students, teachers, and schools accountable. Assessments designed for one of these purposes may not be appropriate or valid if used for another purpose.

In classrooms, assessment is an integral part of the teaching and learning process. Teachers use both formal and informal assessments to plan and guide instruction. For individual students, assessments help to gauge what things students already know and understand, where misconceptions exist, what skills need more practice in context, and what supports are needed to take the next steps in learning. Teachers also use assessment to evaluate their own teaching practices so as to adjust and modify curricula, instructional activities, or assignments that did not help students grasp key ideas. To serve classroom purposes, assessments must be closely aligned with what children are learning, and the timing of assessments must correspond to the specific days and weeks when children are learning specific concepts. While external accountability tests can help teachers examine their instructional program overall, external, once-per-year tests are ill-suited for diagnosis and targeting of individual student learning needs. The technical requirements for the reliability of classroom assessments are less stringent than for other testing purposes because assessment errors on any given day are readily corrected by additional information gathered on subsequent days.

Selection and placement tests may be used to identify students for gifted and talented programs, to provide services for students with disabilities, or for college admissions. Because selection tests are used to evaluate students with a wide variety of prior experiences, they tend to be more generic than standardized achievement tests so as not to presume exposure to a specific curriculum. Nonetheless, performance on selection measures is strongly influenced by past learning opportunities. Unlike IQ tests of the past, it is no longer assumed that any test can measure innate learning ability. Instead, measures of current learning and reasoning abilities are used as practical predictors of future learning; because all tests have some degree of error associated with them, professional standards require that test scores not be the sole determiner of important decisions. For example, college admissions tests are used in conjunction with high school grades and recommendations. School readiness tests are sometimes used as selection tests to decide whether children five years old should start school, but this is an improper use of the tests. None of the existing school readiness measures has sufficient reliability and validity to support such decisions.

Large-scale assessments, such as the National Assessment of Educational Progress (NAEP) or the Third International Mathematics and Science Survey (TIMSS), serve a monitoring and comparative function. Assessment data are gathered about groups of students in the aggregate and can be used by policymakers to make decisions about educational programs. Because there is not a single national or international curriculum, assessment content must be comprehensive and inclusive of all of the curricular goals of the many participating states or nations. Obviously, no one student could be expected to master all of the content in a test spanning many curricula, but, by design, individual student scores are not reported in this type of assessment. As a result, the total assessment can include a much broader array of tasks and problem types to better represent the content domain, with each student being asked to complete only a small sample of tasks from the total set. Given that important policy decisions may follow from shifts in achievement levels or international comparisons of achievement, large-scale assessments must meet high standards of technical accuracy.

High-stakes assessments of achievement that are used to hold individual students, teachers, and schools accountable are similar to large-scale monitoring assessments, but clearly have very different consequences. In addition, these tests, typically administered by states or school districts, must be much more closely aligned with the content standards and curriculum for which participants are being held accountable. As a practical matter, accountability assessments are often more limited in the variety of formats and tasks included, both because each student must take the same test and because states and districts may lack the resources to develop and score more open-ended performance measures. Regardless of practical constraints, high-stakes tests must meet the most stringent technical standards because of the harm to individuals that would be caused by test inaccuracies.

A Short History of High-Stakes Testing

Accountability testing in the United States started in 1965 as part of the same legislation (Title I of the Elementary and Secondary Education Act [ESEA]) that first allocated federal funds to improve the academic achievement of children from low-income families. Federal dollars came with a mandate that programs be evaluated to show their effectiveness. The early accountability movement did not assume, however, that public schools were bad. In fact, the idea behind ESEA was to extend the benefits of an excellent education to poor and minority children.

The public's generally positive view of America's schools changed with the famous SAT test score decline of the early 1970s. Despite the fact that a blueribbon panel commissioned by the College Board in 1977 later found that two-thirds to three-fourths of the score decline was attributable to an increase in the number of poor and minority students gaining access to college and not to a decline in the quality of education, all subsequent accountability efforts were driven by the belief that America's public schools were failing.

The minimum competency testing movement of the 1970s was the first in a series of educational reforms where tests were used not just as measures of the effectiveness of reforms, but also as the primary drivers of reform. Legislators mandated tests of minimum academic skills or survival skills (e.g., balancing a checkbook), intending to "put meaning back into the high school diploma." By 1980, thirty-seven states had taken action to mandate minimum competency standards for grade-to-grade promotion or high school graduation. It was not long, however, before the authors of A Nation at Risk (1983) concluded that minimum competency examinations were part of the problem, not part of the solution, because the "'minimum' [required of students] tends to become the 'maximum,' thus lowering educational standards for all" (p. 20).

Following the publication of A Nation at Risk, the excellence movement sought to ratchet up expectations by reinstating course-based graduation requirements, extending time in the school day and school year, requiring more homework, and, most importantly, requiring more testing. Despite the rhetoric of rigorous academic curricula, the new tests adopted in the mid-1980s were predominantly multiple-choice, basic-skills tests–a step up from minimum competency tests, but not much of one. By the end of the 1980s, evidence began to accrue showing that impressive score gains on these tests might not be a sign of real learning gains. For example, John Cannell's 1987 study, dubbed the "Lake Wobegon Report," showed that all fifty states claimed their test scores were above the national average.

Standards-based reforms, which began in the 1990s and continued at the start of the twenty-first century, were both a rejection and extension of previous reforms. Rejecting traditional curricula and especially rote activities, the standards movement called for the development of much more challenging curricula, focused on reasoning, conceptual understanding, and the ability to apply one's knowledge. At the same time, the standards movement continued to rely heavily on large-scale accountability assessments to leverage changes in instruction. However, standards-based reformers explicitly called for a radical change in the content and format of assessments to forestall the negative effects of "teaching the test." Various terms, such as authentic, direct, and performance-based, were used in standards parlance to convey the idea that assessments themselves had to be reformed to more faithfully reflect important learning goals. The idea was that if tests included extended problems and writing tasks, then it would be impossible for scores to go up on such assessments without there being a genuine improvement in learning.

Effects of High-Stakes Testing

By the end of the 1980s, concerns about dramatic increases in the amount of testing and potential negative effects prompted Congress to commission a comprehensive report on educational testing. This report summarized research documenting the ill effects of high-pressure accountability testing, including that high-stakes testing led to test score inflation, meaning that test scores went up without a corresponding gain in student learning. Controlled studies showed that test score gains on familiar and taught-to tests could not be verified by independent tests covering the same content. High-stakes testing also led to curriculum distortion, which helped to explain how spurious score gains may occur. Interview and survey data showed that many teachers eliminated science and social studies, especially in high-poverty schools, because more time was needed for math and reading. Teaching to the test also involved rote drill in tested subjects, so that students were unable to use their knowledge in any other format.

It should also be noted that established findings from the motivational literature have raised serious questions about test-based incentive systems. Students who are motivated by trying to do well on tests, instead of working to understand and master the material, are consistently disadvantaged in subsequent endeavors. They become less intrinsically motivated, they learn less, and they are less willing to persist with difficult problems.

To what extent do these results, documented in the late 1980s, still hold true for standards-based assessments begun in the 1990s? Recent studies still show the strong influence that high-stakes tests have on what gets taught. To the extent that the content of assessments has improved, there have been corresponding improvements in instruction and curriculum. The most compelling evidence of positive effects is in the area of writing instruction. In extreme cases, writing has been added to the curriculum in classrooms, most often in urban settings, where previously it was entirely absent.

Unfortunately, recent studies on the effects of standards-based reforms also confirm many of the earlier negative effects of high-stakes testing. The trend to eliminate or reduce social studies and science, because state tests focused only on reading, writing, and mathematics, has been so pervasive nationwide that experts speculate it may explain the recent downturn in performance in science on NAEP. In Texas, Linda McNeil and Angela Valenzuela found that a focus on tested content and test-taking skills was especially pronounced in urban districts.

In contrast with previous analysts who used test-score gains themselves as evidence of effectiveness, it is now widely understood by researchers and policymakers that some independent confirmation is needed to establish the validity of achievement gains. For example, two different studies by researchers at the RAND Corporation used NAEP as an independent measure of achievement gains and documented both real and spurious aspects of test-score gains in Texas. A 2000 study by David Grissmer, Ann Flanagan, Jennifer Kawata, and Stephanie Williamson found that Texas students performed better than expected based on family characteristics and socioeconomic factors. However, a study by Stephen Klein and colleagues found that gains on NAEP were nothing like the dramatic gains reported on Texas's own test, the Texas Assessment of Academic Skills (TAAS). Klein et al. also found that the gap in achievement between majority and minority groups had widened for Texas students on NAEP whereas the gap had appeared to be closing on the TAAS. Both of these studies could be accurate, of course. Texas students could be learning more in recent years, but not as much as claimed by the TAAS. Studies such as these illustrate the importance of conducting research to evaluate the validity and credibility of results from high-stakes testing programs.

Professional Standards for High-Stakes Testing

The Standards for Educational and Psychological Testing (1999) is published jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. The Standards establish appropriate procedures for test development, scaling, and scoring, as well as the evidence needed to ensure validity, reliability, and fairness in testing. Drawing from the Standards, the American Educational Research Association issued a position statement in 2000 identifying the twelve conditions that must be met to ensure sound implementation of high-stakes educational testing programs.

First, individual students should be protected from tests being used as the sole criterion for critically important decisions. Second, students and teachers should not be sanctioned for failing to meet new standards if sufficient resources and opportunities to learn have not been provided. Third, test validity must be established for each separate intended use, such as student certification or school evaluation. Fourth, the testing program must fully disclose any likely negative side effects of testing. Fifth, the test should be aligned with the curriculum and should not be limited to only the easiest-to-test portion of the curriculum. Sixth, the validity of passing scores and achievement levels should be analyzed, as well as the validity of the test itself. Seventh, students who fail a high-stakes test should be provided with meaningful opportunities for remediation consisting of more than drilling on materials that imitate the test. Eighth, special accommodations should be provided for English language learners so that language does not interfere with assessment of content area knowledge. Ninth, provision should be made for students with disabilities so that they may demonstrate their proficiency on tested content without being impeded by the format of the test. Tenth, explicit rules should be established for excluding English language learners or students with disabilities so that schools, districts, or states cannot improve their scores by excluding some students. Eleventh, test results should be sufficiently reliable for their intended use. Twelfth, an ongoing program of research should be established to evaluate both the intended and unintended consequences of high-stakes testing programs.

Professional standards provide a useful framework for understanding the limitations and potential benefits of sound assessment methodologies. Used appropriately tests can greatly enhance educational decision-making. However, when used in ways that go beyond what tests can validly claim to do, tests could very likely do more harm than good.

BIBLIOGRAPHY

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION; AMERICAN PSYCHOLOGICAL ASSOCIATION; and NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION. 1999. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

BEATTY, ALEXANDRA; GREENWOOD, M. R. C.; and LINN, ROBERT L., eds. 1999. Myths and Tradeoffs: The Role of Tests in Undergraduate Admissions. Washington, DC: National Academy Press.

CANNELL, JOHN J. 1987. Nationally Normed Elementary Achievement Testing in America's Public Schools: How All 50 States Are Above the National Average. Daniels, WV: Friends for Education.

CANNELL, JOHN J. 1989. The Lake Wobegon Report: How Public Educators Cheat on Achievement Tests. Albuquerque, NM: Friends for Education.

COLLEGE BOARD. 1977. On Further Examination: Report of the Advisory Panel on the Scholastic Aptitude Test Score Decline. New York: College Board.

GRISSMER, DAVID; FLANAGAN, ANN; KAWATA, JENNIFER; and WILLIAMSON, STEPHANIE. 2000. Improving Student Achievement: What State NAEP Test Scores Tell Us. Santa Monica, CA: RAND.

KLEIN, STEPHEN P., et al. 2000. What Do Test Scores in Texas Tell Us? Santa Monica, CA: RAND.

MCNEIL, LINDA, and VALENZUELA, ANGELA. 2000. The Harmful Impact of the TAAS System of Testing in Texas: Beneath the Accountability Rhetoric. Cambridge, MA: Harvard University Civil Rights Project.

NATIONAL COMMISSION ON EXCELLENCE IN EDUCATION. 1983. A Nation at Risk: The Imperative of Educational Reform. Washington, DC: U.S. Department of Education.

STIPEK, DEBORAH. 1998. Motivation to Learn: From Theory to Practice, 3rd edition. Boston: Allyn and Bacon.

U.S. CONGRESS, OFFICE OF TECHNOLOGY ASSESSMENT. 1992. Testing in American Schools: Asking the Right Questions. Washington, DC: U.S. Government Printing Office.

INTERNET RESOURCES

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION. 2000. "AERA Position Statement Concerning High-Stakes Testing in Pre-K–12 Education." <www.aera.net/about/policy/stakes.htm>.

NATIONAL ASSOCIATION FOR THE EDUCATION of YOUNG CHILDREN. 1995. "Position Statement on School Readiness." <www.naeyc.org/resources/position_statements/psredy98.htm>.

LORRIE A. SHEPARD

Additional topics

Testing - Statewide Testing Programs

Education - Free Encyclopedia Search EngineEducation EncyclopediaTesting - Standardized Tests And High-stakes Assessment, Statewide Testing Programs, Test Preparation Programs, Impact Of - STANDARDIZED TESTS AND EDUCATIONAL POLICY