26 minute read

Teacher Evaluation

OVERVIEW, METHODS

OVERVIEW
Robert F. McNergney
Scott R. Imig

METHODS
Mari A. Pearlman

OVERVIEW

Baseball is known as the national pastime of the United States, but teacher evaluation beats it hands down. Everybody does it–some with a vengeance, others with the casual disregard that physical and emotional distance afford. Most enthusiasts grow up with the game, playing a sandlot version as they go through school. Indeed, familiarity with the job of teaching and the widespread practice of judging teachers has shaped the history of teacher evaluation.

History of Teacher Evaluation

Donald Medley, Homer Coker, and Robert Soar (1984) describe succinctly the modern history of formal teacher evaluation–that period from the turn of the twentieth century to about 1980. This history might be divided into three overlapping periods: (1) The Search for Great Teachers; (2) Inferring Teacher Quality from Student Learning; and (3) Examining Teaching Performance. At the beginning of the twenty-first century, teacher evaluation appears to be entering a new phase of disequilibrium; that is, a transition to a period of Evaluating Teaching as Professional Behavior.

The Search for Great Teachers began in earnest in 1896 with the report of a study conducted by H.E. Kratz. Kratz asked 2,411 students from the second through the eighth grades in Sioux City, Iowa, to describe the characteristics of their best teachers. Kratz thought that by making desirable characteristics explicit he could establish a benchmark against which all teachers might be judged. Some 87 percent of those young Iowans mentioned "helpfulness" as the most important teacher characteristic. But a stunning 58 percent mentioned "personal appearance" as the next most influential factor.

Arvil Barr's 1948 compendium of research on teaching competence noted that supervisors' ratings of teachers were the metric of choice. A few researchers, however, examined average gains in student achievement for the purpose of Inferring Teacher Quality from Student Learning. They assumed, for good reason, that supervisors' opinions of teachers revealed little or nothing about student learning. Indeed, according to Medley and his colleagues, these early findings were "most discouraging." The average correlation between teacher characteristics and student learning, as measured most often by achievement tests, was zero. Some characteristics related positively to student achievement gains in one study and negatively in another study. Most showed no relation at all. Simeon J. Domas and David Tiedeman (1950) reviewed more than 1,000 studies of teacher characteristics, defined in nearly every way imaginable, and found no clear direction for evaluators. Jacob Getzels and Philip Jackson (1963) called once and for all for an end to research and evaluation aimed at linking teacher characteristics to student learning, arguing it was an idea without merit.

Medley and his colleagues note several reasons for the failure of early efforts to judge teachers by student outcomes. First, student achievement varied, and relying on average measures of achievement masked differences. Second, researchers failed to control for the regression effect in student achievement–extreme high and low scores automatically regress toward the mean in second administrations of tests. Third, achievement tests were, for a variety of reasons, poor measures of student success. Perhaps most important, as the researchers who ushered in the period of Examining Teaching Performance were to suggest, these early approaches were conceptually inadequate, and even misleading. Student learning as measured by standardized achievement tests simply did not depend on a teacher's education, intelligence, gender, age, personality, attitudes, or any other personal attribute. What mattered was how teachers behaved when they were in classrooms.

The period of Examining Teaching Performance abandoned efforts to identify desirable teacher characteristics and concentrated instead on identifying effective teaching behaviors; that is, those behaviors that were linked to student learning. The tack was to describe clearly and precisely teaching behaviors and relate them to student learning–as measured most often by standardized achievement test scores. In rare instances, researchers conducted experiments for the purpose of arguing that certain teaching behaviors actually caused student learning. Like Kratz a century earlier, these investigators assumed that "principles of effective teaching" would serve as new and improved benchmarks for guiding both the evaluation and education of teachers. Jere Brophy and Thomas Good produced the most conceptually elaborate and useful description of this work in 1986, while Marjorie Powell and Joseph Beard's 1984 extensive bibliography of research done from 1965 to 1980 is a useful reference.

Goals of Teacher Evaluation

Although there are multiple goals of teacher evaluation, they are perhaps most often described as either formative or summative in nature. Formative evaluation consists of evaluation practices meant to shape, form, or improve teachers' performances. Clinical supervisors observe teachers, collect data on teaching behavior, organize these data, and share the results in conferences with the teachers observed. The supervisors' intent is to help teachers improve their practice. In contrast, summative evaluation, as the term implies, has as its aim the development and use of data to inform summary judgments of teachers. A principal observes teachers in action, works with them on committees, examines their students' work, talks with parents, and the like. These actions, aimed at least in part at obtaining evaluative information about teachers' work, inform the principal's decision to recommend teachers either for continuing a teacher's contract or for termination of employment. Decisions about initial licensure, hiring, promoting, rewarding, and terminating are examples of the class of summative evaluation decisions.

The goals of summative and formative evaluation may not be so different as they appear at first glance. If an evaluator is examining teachers collectively in a school system, some summary judgments of individuals might be considered formative in terms of improving the teaching staff as a whole. For instance, the summative decision to add a single strong teacher to a group of other strong teachers results in improving the capacity and value of the whole staff.

In a slightly different way, individual performance and group performance affect discussions of merit and worth. Merit deals with the notion of how a single teacher measures up on some scale of desirable characteristics. Does the person exhibit motivating behavior in the classroom? Does she take advantage of opportunities to continue professional development? Do her students do well on standardized achievement tests? If the answers to these types of questions are "yes," then the teacher might be said to be "meritorious." Assume for a moment that the same teacher is one of six members of a high school social studies team in a rural school district. Assume also that one of the two physics teachers just quit, the special education population is growing rapidly, and the state education department recently replaced one social science requirement for graduation with a computer science requirement. Given these circumstances, the meritorious teacher might not add much value to the school system; that is, other teachers, even less meritorious ones, might be worth more to the system.

The example of the meritorious teacher suggests yet another important distinction in processes of evaluating teachers: the difference between domain-referenced and norm-referenced teacher evaluation. When individual teachers are compared to a set of externally derived, publicly expressed standards, as in the case of merit decisions, the process is one of domain-referenced evaluation. What counts is how the teacher compares to the benchmarks of success identified in a particular domain of professional behavior. In contrast, norm-referenced teacher evaluation consists of grouping teachers' scores on a given set of measures and describing these scores in relation to one another. What is the mean score of the group? What is the range or standard deviation of the scores? What is shape of the distribution of the scores? These questions emanate from a norm-referenced perspective–one often adopted in initial certification or licensure decisions.

The work of John Meyer and Brian Rowan (1977) suggests that there are yet other goals driving the structure and function of teacher evaluation systems. If school leaders intend to maintain public confidence and support, they must behave in ways that assure their constituents and the public at large that they are legitimate. Schools must innovate to be healthy organizations, but if school leaders get too far ahead of the pack–look too different, behave too radically–they do so at their own peril. When they incorporate acceptable ideas, schools protect themselves. The idea that teachers must be held accountable, or in some way evaluated, is an easy one to sell to the public, and thus one that enhances a leader's or system's legitimacy.

Trends, Issues, and Controversies

With the standards movement of the late 1990s came increased expectations for student performance and renewed concerns about teacher practice. Driven by politicians, parents, and, notably, teacher unions, school districts began an analysis of teacher evaluation goals and procedures. The traditional model of teacher evaluation, based on scheduled observations of a handful of direct instruction lessons, came under fire. "Seventy years of empirical research on teacher evaluation shows that current practices do not improve teachers or accurately tell what happens in classrooms" (Peterson, p. 14). Not surprisingly, in this climate, numerous alternative evaluative practices have been developed or reborn.

In the early twenty-first century, the first line of teacher evaluation consists of state and national tests created as barriers for entry to the profession. Some forty states use basic skills and subject matter assessments provided by the Praxis Series examinations for this purpose. Creators of the examinations assume teachers should be masters of grammar, mathematics, and the content they intend to teach. Though many states use the same basic skills tests, each sets its own passing score. The movement to identify and hire quality teachers based on test scores has resulted in some notable legal cases. Teachers who graduate from approved teacher education programs yet fail to pass licensure tests have challenged the validity of such tests, as well as the assignment of culpability. If a person pays for teacher education and is awarded a degree, who is to blame when that person fails a licensure examination? This is not an insignificant concern. In 1998, for example, the state of Massachusetts implemented a new test that resulted in a 59 percent failure rate for prospective teachers. Once a teacher has assumed a job, however, that teacher is rarely, if ever, tested again. In-service teachers typically succeed at resisting pressure to submit to periodic examinations because of the power of their numbers and their political organization.

Despite the well-known difficulties of measuring links between teaching and learning, the practice of judging teachers by the performance of their students is enjoying a resurgence of interest. Polls indicate that a majority of the American public favors this idea. School leaders routinely praise or chastise schools, and by implication teachers, for students' test results. Despite researchers' inabilities to examine the complexity of life in schools and in classrooms, studies of relationships between teaching and learning often become political springboards for policy formulation. For example, William Sanders (1996) suggests that teacher effectiveness is the single greatest factor affecting academic growth. His work has been seized upon by accountability proponents to argue that teachers must be held accountable for students' low test scores.

Although there may be much to be gained from focusing educators on common themes of accountability through the use of standards and accompanying tests, there may be much to lose as well. The upside can be measured over time in greater collective attention to common concerns. The downside results when people assume teachers can influence factors outside their control–factors that affect students' test scores, such as students' experiences, socioeconomic status, and parental involvement. A focus on scores as the sole, or even primary, indicator of accountability also creates the possibility for academic misconduct, such as ignoring important but untested material, teaching to the test, or cheating.

As researchers have demonstrated, those schools that need the most help are often least likely to get it. Daniel L. Duke, Pamela Tucker, and Walter Heinecke (2000) studied sixteen high schools involved in initial efforts to meet the challenges of new accountability standards that emphasize student test scores. These schools represented various combinations of need and ability. The researchers found that the schools with high need and low ability (those with poor test scores and low levels of financial resources) reported the highest concerns about staffing, morale, instruction, and students. Thus, the schools that needed the most help, the ones that were the primary targets of new accountability efforts, appeared in this study to be put at greater risk by the accountability movement.

Teachers' jobs involve far more than raising test scores. An evaluation strategy borrowed from institutions of higher education and business, sometimes referred to as 360-degree feedback, acknowledges the necessity of considering the bigger picture. The intent of this holistic approach is to gather information from everyone with knowledge of a teacher's performance to create a complete representation of a teacher's practice and to identify areas for improvement. Multiple data sources, including questionnaires and surveys, student achievement, observation notes, teacher-developed curricula and tests, parent reports, teacher participation on committees, and the like, assure a rich store of information on which to base evaluation decisions. Current models tend to place the responsibility with administrators to interpret and respond to the data. To be sure, there are risks involved. The strategy asks children to evaluate their teachers, and it gathers feedback from individuals who possess only a secondary knowledge of a teacher's practices, namely parents and fellow teachers. Nonetheless, different kinds of information collected from different vantage points encourage full and fair representation of teachers' professional lives.

Toward Evaluating Teaching as Professional Behavior

At the turn of the twenty-first century, people continue to debate whether teaching is a true profession. Questions persist about educators' lack of self-regulation, the nebulously defined knowledge base upon which teaching rests, the lack of rigid entrance requirements to teacher education programs (witness alternative licensure routes), the level of teachers' salaries, and the locus of control in matters of evaluation. Yet school districts, state governments, the federal government, and national professional and lay organizations appear intent as never before on building and strengthening teaching as a profession.

One simple example of a changing attitude toward teaching as a profession is that of the use of peer evaluation. Two decades ago, in Toledo, Ohio, educators advanced processes of peer review as a method of evaluation. At its most basic level, peer review consists of an accomplished teacher observing and assessing the pedagogy of a novice or struggling veteran teacher. School districts that use peer review, however, often link the practice with teacher intervention, mentoring programs, and, in some instances, hiring and firing decisions. Columbus, Ohio's Peer Assistance and Review Program, seemingly representative of many review systems, releases expert teachers from classroom responsibilities to act as teaching consultants. Driven by the National Education Association's 1997 decision to reverse its opposition to peer review, the idea has enjoyed a resurgence of popularity in recent years.

Founded in 1987, the National Board for Professional Teaching Standards (NBPTS) is yet another example of people from different constituencies working together to advance the concept of teaching as a profession. The NBPTS attempts to identify and reward the highest caliber teachers, those who represent the top end of the quality distribution. Based on the medical profession's concept of board-certified physicians, the NBPTS bestows certification only on those teachers who meet what board representatives perceive to be the highest performance standards. By the end of the year 2000, nearly 10,000 teachers had received board certification–though this amounts to a tiny fraction of the nation's 2.6 million teachers. Widespread political and financial support, from both political conservatives and liberals, suggests this idea may have staying power.

Teacher evaluation will grow and develop as the concept of teaching as a profession evolves. Computer technology is only beginning to suggest how new methods of formative and summative evaluation can alter the landscape. Perhaps most important is that as reformers confront the realities of life in schools, public knowledge of what it means to be a teacher increases. More people in more walks of life are recognizing how complex and demanding teaching can be, and how important teachers are to society as a whole. Teacher evaluators of the future will demonstrate much higher levels of knowledge and skill than their predecessors, leaving the teaching profession better than they found it.

See also: SUPERVISION OF INSTRUCTION.

BIBLIOGRAPHY

BARR, ARVIL. 1948. "The Measurement and Prediction of Teaching Efficiency: A Summary of Investigations." Journal of Experimental Education 16 (4):203–283.

BROPHY, JERE, and GOOD, THOMAS. 1986. "Teacher Behavior and Student Achievement." In Handbook of Research on Teaching, ed. Merlin C. Wittrock. New York: Macmillan.

GAGE, NATHANIEL L., and NEEDELS, MARGARET C. 1989. "Process-Product Research on Teaching: A Review of Criticisms." The Elementary School Journal 89 (3):253–300.

GETZELS, JACOB. W., and JACKSON, PHILIP W. 1963. "The Teacher's Personality and Characteristics." In Handbook of Research on Teaching: A Project of the American Educational Research Association. ed. Nathaniel L. Gage. New York: Macmillan.

DOMAS, SIMEON J., and TIEDEMAN, DAVID V. 1950. "Teacher Competence: An Annotated Bibliography." Journal of Experimental Education 19:99–218.

DUKE, DANIEL L.; TUCKER, PAMELA; and HEINECKE,

WALTER. 2000. Initial Responses of Virginia High Schools to the Accountability Initiative. Charlottesville, VA: Thomas Jefferson Center for Educational Design, University of Virginia.

HERBERT, JOANNE M. 1999. "An Online Learning Community: Technology Brings Teachers Together for Professional Development." American School Board Journal March:39–40.

MCNERGNEY, ROBERT F.; HERBERT, JOANNE M.; and FORD, R. E. 1993. "Anatomy of a Team Case Competition." Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, Georgia.

MEDLEY, DONALD M. 1979. "The Effectiveness of Teachers." In Research on Teaching: Concepts, Findings, and Implications, ed. Penelope L. Peterson and Herbert J. Walberg. Berkeley, CA: McCutchan.

MEDLEY, DONALD M.; COKER, HOMER; and SOAR, ROBERT S. 1984. Measurement-Based Evaluation of Teacher Performance: An Empirical Approach. New York: Longman.

MEYER, JOHN W., and ROWAN, BRIAN. 1977. "Institutionalized Organizations: Formal Structure as Myth and Ceremony." American Journal of Sociology 83 (2):340–363.

PETERSON, KENNETH D. 2000. Teacher Evaluation: A Comprehensive Guide to New Directions and New Practices, 2nd edition. Thousand Oaks, CA: Corwin Press.

POWELL, MARJORIE, and BEARD, JOSEPH W. 1984.

Teacher Effectiveness: An Annotated Bibliography and Guide to Research. New York: Garland.

SANDERS, WILLIAM L. and RIVERS, JUNE C. 1996. Cumulative and Residual Effects of Teachers on Future Student Academic Achievement. Knoxville: University of Tennessee Value-Added Research and Assessment Center.

SCRIVEN, MICHAEL. 1967. "The Methodology of Evaluation." In Perspectives of Curriculum Evaluation, ed. Ralph W. Tyler, Robert M. Gagné, and Michael Scriven. Chicago: Rand McNally.

ROBERT F. MCNERGNEY

SCOTT R. IMIG

In the decade from 1991 to 2001, a number of developments in public policy and assessment practices significantly altered the landscape for teacher evaluation practices. The single most important shift in the public policy arena has been the emergence of a tidal wave of support for what is loosely called "teacher accountability." What this seems to mean in effect is a growing insistence on measurement of teacher quality and teacher performance in terms of student achievement, which is often poorly defined, crudely measured, and unconnected to what educators regard as significant learning.

Because there is still little consensus about acceptable ways to meet the very substantial challenges posed by links between measures of student achievement and consequent conclusions about teacher effectiveness, the fact that this issue dominates current discourse about teacher evaluation is very significant, and somewhat alarming. This is not a new effort or a new issue, but the heated insistence on its power as the single most important criterion for establishing a teacher's effectiveness is new. Simply put, most efforts to connect student achievement to individual teacher performance have foundered in the past on the following weaknesses:

The measurement does not take into account teaching context as a performance variable.
The measurement is unreliable, in part because it does not include time as a variable–both the teacher's time with a cohort of students; and some model or models of sufficient time to see learning effects in students.
The measures used to reflect student achievement are not congruent with best practice and philosophy of instruction in modern education.

The link between teacher performance and student achievement is both so intuitively compelling as a major part of a teacher's performance evaluation and so very difficult to implement that it has never really been systematically achieved in the United States. The pressure to forge such links is immense in the early twenty-first century, and it is critical to the health and vitality of the education workforce that the link be credible and valid. A foundational validity issue is, of course, the quality and integrity of the methods states and districts have developed or adopted to measure student achievement. The teaching workforce has long disdained standardized national tests, the most commonly used assessments in school districts across the United States to represent student achievement, arguing persuasively that actual local and state curricula–and thus instruction–are not adequately aligned (or aligned at all) with the content of these tests. Furthermore, education reformers have almost universally excoriated these tests for two decades as reductive and not representative of the skills and abilities students really need to develop for the new millennium.

An evaluative commentary on the use of student tests for the purpose of high stakes accountability decisions was given by incoming American Educational Research Association President Robert Linn (2002), who evaluated fifty years of student testing in the U.S. education system, and the effects of that testing:

I am led to conclude that in most cases the instruments and technology have not been up to the demands that have been placed on them by high-stakes accountability. Assessment systems that are useful monitors lose much of their dependability and credibility for that purpose when high stakes are attached to them. The unintended negative effects of the high-stakes accountability uses often outweigh the intended positive effects. (p. 14)

Given the policy climate in the early twenty-first century, this is a sobering and cautionary conclusion, coming as it does from such a major figure in the measurement community, and one known for his even-handed and judicious treatment of measurement issues. It is clear that the most widely used current measures of student achievement, primarily standardized norm-referenced multiple-choice tests developed and sold off-the-shelf by commercial test publishers, are useful for many educational purposes, but not valid for school accountability. Indeed, they may be positively misleading at the school level, and certainly a distortion of teaching effectiveness at the individual teacher level. Concerns about the increased dependence on high-stakes testing has prompted a number of carefully worded technical cautions from important policy bodies as well. Although it is possible to imagine a program of student testing that aligns the assessments used to the standards for learning and to the curriculum actually taught, and that employs multiple methods and occasions to evaluate student learning, the investment such a program would demand would increase the cost of student assessment significantly. Furthermore, involving teachers in the conceptual development and interpretation of assessment measures that would be instructionally useful (particularly when those measures may have a direct effect on the teachers' performance evaluation and livelihood) is no closer to the realities of assessment practice than it has ever been—it is, in general, simply not part of the practice of school districts in the United States.

The emphasis on teacher quality has gained considerable momentum from the body of empirical evidence substantiating the linkage between teacher competence and student achievement. The "value-added" research, typified by the work of William Sanders and colleagues (1996; 1997; 1998) reinforces the assumption that the teacher is the most significant factor that affects student achievement. Sanders's work in this area is the best known and, increasingly, most influential among policymakers. In the measurement community, however, independent analyses of Sanders's data and methods have just begun. There appear to be controversial issues associated with both the statistical model Sanders uses and replicability of his findings.

Teacher Evaluation

At the beginning of the twenty-first century there is more teacher testing for various purposes than ever before. Some of this testing serves traditional purposes; for example, for admission into programs of professional preparation in colleges and universities or for licensure. For the first time in the United States there is a high-stakes assessment for purposes of certification, the National Board for Professional Teaching Standards (NBPTS) certification assessments, which are modeled on medical specialty board certification. Finally, there is a growing use of performance assessments of actual teaching for both formative purposes–during a teacher's initial years of practice, or the induction period–and for summative purposes, to grant an initial or more advanced teaching license. Performance-assessment-based licensure has been implemented in Connecticut since 2000 and is being implemented in 2002 in Ohio and in 2003 in Arkansas. In addition, California plans to implement a teaching performance assessment for all beginning teachers in California beginning in 2004.

Both the policy climate and the standards movement have had profound effects on teacher testing. States set passing standards on licensing tests, often rigorous, for demonstrations of sufficient skill and knowledge to be licensed. For example, as of the year 2000, thirty-nine states require all licensed teachers to pass a basic skills test (reading, mathematics, and writing), twenty-nine require secondary teachers to pass subject-specific tests in their prospective teaching fields, and thirty-nine require prospective secondary teachers to have a major, minor, or equivalent course credits for a subject-specific license. This means that a number of states require all three hurdles to be cleared before granting a license. In addition, most states require that the teacher's preparation institution recommend the candidate for the license. In every state but New Jersey, however, the state has the power to waive all of these requirements "either by granting licenses to individuals who have not met them or by permitting districts to hire such people" (Edwards, p. 8). And, perhaps most discouraging, only about twenty-five of the fifty states even have accessible records of "the numbers and percentages of teachers who hold various waivers" (Jerald and Boser, p. 44). Thus, reliance on rigorous state testing and preparation requirements to assure the quality of the education workforce is likely to lead to disappointment.

In 2000, thirty-six of the thirty-nine states that require teachers to pass a basic skills test waived that requirement and permitted a teacher to enter a classroom as teacher of record without passing the test. In sixteen states, this waiver can be renewed indefinitely, so long as the hiring school district asserts its inability to find a qualified applicant. Of the twenty-nine states that require secondary teachers to pass subject matter exams–most often only multiple-choice tests, even though more sophisticated tests are available–only New Jersey denies a license and therefore a job to candidates who have not passed the tests. Eleven of these twenty-nine states allow such candidates to remain in the job indefinitely, and all twenty-nine but New Jersey waive the course work completion requirement for secondary teachers if the hiring district claims that it cannot find a more qualified applicant for the position.

Thus, while initial licensing tests have become increasingly sophisticated–they are based on K–12 student and disciplinary standards and offer both multiple-choice and constructed response formats–the requirements for their use are not only widely variable, but also not rigorously enforced.

As of 2001 the NBPTS certification assessments represent the first-ever, widely accepted national recognition of excellence in the teaching profession. The program has grown exponentially since 1994 when eighty-six teachers, the first National Board Certified Teachers, were announced. In 2001 approximately 14,000 candidates in nineteen different fields were assessed; the NBPTS expects that the number of National Board Certified Teachers nationwide will rise from 9,534 (2000) to approximately 15,000. The certification assessment consists of a classroom-based portfolio, including videotapes and student work samples with detailed analytical commentaries by the teacher-candidates, and a computer-delivered written assessment focused primarily on content and pedagogical-content knowledge. The NBPTS assessments have established a number of new benchmarks for teacher evaluation. The assessments themselves are both elaborate and very rigorous; it takes approximately nine months for a teacher to complete the assessment process. Almost universally regarded by candidates as the single most profound learning and professional development experience they have ever had, the assessment process is being widely used as a model for teacher professional development. The scoring process, which requires extensive training of peer teachers, is itself a substantial professional development opportunity.

In addition, the actual technical quality of the scoring has contradicted long-held opinions that complex human judgments sacrifice reliability or consistency for validity, or credibility. The NBPTS scoring reliability is extremely high. The expense of the assessment–$2,300 per candidate in 2002–has been borne largely by states and local governments as part of their support for teacher quality initiatives. That level of public support for a high-stakes, voluntary assessment is unprecedented in education in the United States. In 2001, the NBPTS published the first in a series of validity studies that showed substantive differences between National Board Certified Teachers and non-certified teachers in terms of what actually goes on in the classrooms and in student learning.

The third area of change and innovation in teacher evaluation has taken place in states' provision for mentoring and formative assessment in the initial period of a beginning teacher's career, a period commonly called the induction period. States vary in the nature of the support they provide, with twenty-eight states requiring or providing funds for beginning-teacher induction programs, but only ten states doing both. The most sophisticated induction programs exist in Connecticut (the Connecticut BEST program), California (the CFASST program), and Ohio (the Ohio FIRST program). Each of these programs uses structured portfolio-based learning experiences to guide a new teacher and a mentor through a collaborative first year of practice.

Few states assess the actual teaching performance of new teachers. Twenty-seven states require that the school principal evaluate each new teacher. As of 2000, only four states (Kentucky, Louisiana, Oklahoma, and South Carolina) go beyond this requirement and require that the principal and a team of other educators from outside the school, trained to a common set of criteria, participate in the new teacher's evaluation. As of 2001 Connecticut, New York, and Ohio will all have performance-based licensure tests for beginning teachers at the end of the first or second year of teaching. Connecticut requires a subject-specific teaching portfolio; New York requires a videotape of teaching; Ohio will use an observation-based licensing assessment developed by Educational Testing Service called Praxis III. In 2002 Arkansas will begin using Praxis III as well; by 2004 California will make its work-sample-based Teaching Performance Assessment operational for initial licensure of all California teachers.

BIBLIOGRAPHY

BOND, LLOYD; SMITH, TRACY; BAKER, WANDA K.; and HATTIE, JOHN A. 2000. The Certification System of the National Board for Professional Teaching Standards: A Construct and Consequential Validity Study. Washington, DC: National Board for Professional Teaching Standards.

EDWARDS, VIRGINIA B. 2000. "Quality Counts 2000." Education Week 19 (18)[entire issue].

EDWARDS, VIRGINIA B., ed. 2000. "Who Should Teach? The States Decide." Education Week 19 (18):8–9.

ELMORE, RICHARD F., and ROTHMAN, ROBERT, eds. 1999. Testing, Teaching, and Learning: A Guide for States and School Districts. Washington, DC: National Academy Press.

FEUER, MICHAEL J., et al., eds. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: National Academy Press.

HEUBERT, JAY P., and HAUSER, ROBERT M., eds. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: National Academy Press.

JERALD, CRAIG D., and BOSER, ULRICH. 2000. "Setting Policies for New Teachers." Education Week 19 (18):44–47.

KORETZ, DANIEL, et al. 2001. New Work on the Evaluation of High-Stakes Testing Programs. Symposium conducted at the National Council on Measurement in Education's Annual Meeting, Seattle, WA.

LINN, ROBERT L. 2000. "Assessments and Accountability." Educational Researcher 29:4–16.

MADAUS, GEORGE E., and O'DWYER, LAURA M. 1999. "A Short History of Performance Assessment: Lessons Learned." Phi Delta Kappan 80:688–695.

MILLMAN, JASON, ed. 1997. Grading Teachers, Grading Schools. Is Student Achievement a Valid Evaluation Measure? Thousand Oaks, CA: Corwin.

SANDERS, WILLIAM L., and HORN, SANDRA P. 1998. "Research Findings from the Tennessee Value-Added Assessment System (TVASS) Database: Implications for Educational Evaluation and Research." Journal of Personnel Evaluation in Education 12:247–256.

SANDERS, WILLIAM L., and RIVERS, JUNE C. 1996. Cumulative and Residual Effects of Teachers on Future Student Academic Achievement. Knoxville, TN: University of Tennessee Value-Added Research and Assessment Center.

WRIGHT, S. PAUL; HORN, SANDRA P.; and SANDERS, WILLIAM L. 1997. "Teacher and Classroom Context Effects on Student Achievement: Implications for Teacher Evaluation." Journal of Personnel Evaluation in Education 11:57–67.

INTERNET RESOURCES

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION. 2000. "AERA Position Statement Concerning High-Stakes Testing in PreK–12 Education." <www.aera.net/about/policy/stakes.htm>.

BARTON, PAUL. 1999. "Too Much Testing of the Wrong Kind: Too Little of the Right Kind in K–12 Education." <www.ets.org/research/pic>.

MARI A. PEARLMAN

Additional topics

Education - Free Encyclopedia Search EngineEducation Encyclopedia