Thomas R. Guskey

Howard R. Pollio


Few issues have created more controversy among educators than those associated with grading and reporting student learning. Despite the many debates and multitudes of studies, however, prescriptions for best practice remain elusive. Although teachers generally try to develop grading policies that are honest and fair, strong evidence shows that their practices vary widely, even among those who teach at the same grade level within the same school.

In essence, grading is an exercise in professional judgment on the part of teachers. It involves the collection and evaluation of evidence on students' achievement or performance over a specified period of time, such as nine weeks, an academic semester, or entire school year. Through this process, various types of descriptive information and measures of students' performance are converted into grades or marks that summarize students' accomplishments. Although some educators distinguish between grades and marks, most consider these terms synonymous. Both imply a set of symbols, words, or numbers that are used to designate different levels of achievement or performance. They might be letter grades such as A, B, C, D, and F; symbols such as &NA;+, &NA;, and &NA;−; descriptive words such as Exemplary, Satisfactory, and Needs Improvement; or numerals such as 4, 3, 2, and 1. Reporting is the process by which these judgments are communicated to parents, students, or others.

A Brief History

Grading and reporting are relatively recent phenomena in education. In fact, prior to 1850, grading and reporting were virtually unknown in schools in the United States. Throughout much of the nineteenth century most schools grouped students of all ages and backgrounds together with one teacher in one-room schoolhouses, and few students went beyond elementary studies. The teacher reported students' learning progress orally to parents, usually during visits to students' homes.

As the number of students increased in the late 1800s, schools began to group students in grade levels according to their age, and new ideas about curriculum and teaching methods were tried. One of these new ideas was the use of formal progress evaluations of students' work, in which teachers wrote down the skills each student had mastered and those on which additional work was needed. This was done primarily for the students' benefit, since they were not permitted to move on to the next level until they demonstrated their mastery of the current one. It was also the earliest example of a narrative report card.

With the passage of compulsory attendance laws at the elementary level during the late nineteenth and early twentieth centuries, the number of students entering high schools increased rapidly. Between 1870 and 1910 the number of public high schools in the United States increased from 500 to 10,000. As a result, subject area instruction in high schools became increasingly specific and student populations became more diverse. While elementary teachers continued to use written descriptions and narrative reports to document student learning, high school teachers began using percentages and other similar markings to certify students' accomplishments in different subject areas. This was the beginning of the grading and reporting systems that exist today.

The shift to percentage grading was gradual, and few American educators questioned it. The practice seemed a natural by-product of the increased demands on high school teachers, who now faced classrooms with growing numbers of students. But in 1912 a study by two Wisconsin researchers seriously challenged the reliability of percentage grades as accurate indicators of students' achievement.

In their study, Daniel Starch and Edward Charles Elliott showed that high school English teachers in different schools assigned widely varied percentage grades to two identical papers from students. For the first paper the scores ranged from 64 to 98, and the second from 50 to 97. Some teachers focused on elements of grammar and style, neatness, spelling, and punctuation, while others considered only how well the message of the paper was communicated. The following year Starch and Elliot repeated their study using geometry papers submitted to math teachers and found even greater variation in math grades. Scores on one of the math papers ranged from 28 to 95–a 67-point difference. While some teachers deducted points only for a wrong answer, many others took neatness, form, and spelling into consideration.

These demonstrations of wide variation in grading practices led to a gradual move away from percentage scores to scales that had fewer and larger categories. One was a three-point scale that employed the categories of Excellent, Average, and Poor. Another was the familiar five-point scale of Excellent, Good, Average, Poor, and Failing, (or A, B, C, D, and F). This reduction in the number of score categories served to reduce the variation in grades, but it did not solve the problem of teacher subjectivity.

To ensure a fairer distribution of grades among teachers and to bring into check the subjective nature of scoring, the idea of grading based on the normal probability, bell-shaped curve became increasingly popular. By this method, students were simply rank-ordered according to some measure of their performance or proficiency. A top percentage was then assigned a grade of A, the next percentage a grade of B, and so on. Some advocates of this method even specified the precise percentages of students that should be assigned each grade, such as the 6-22-44-22-6 system.

Grading on the curve was considered appropriate at that time because it was well known that the distribution of students' intelligence test scores approximated a normal probability curve. Since innate intelligence and school achievement were thought to be directly related, such a procedure seemed both fair and equitable. Grading on the curve also relieved teachers of the difficult task of having to identify specific learning criteria. Fortunately, most educators of the early twenty-first century have a better understanding of the flawed premises behind this practice and of its many negative consequences.

In the years that followed, the debate over grading and reporting intensified. A number of schools abolished formal grades altogether, believing they were a distraction in teaching and learning. Some schools returned to using only verbal descriptions and narrative reports of student achievement. Others advocated pass/fail systems that distinguished only between acceptable and failing work. Still others advocated a mastery approach, in which the only important factor was whether or not the student had mastered the content or skill being taught. Once mastered, that student would move on to other areas of study.

At the beginning of the twenty-first century, lack of consensus about what works best has led to wide variation in teachers' grading and reporting practices, especially among those at the elementary level. Many elementary teachers continue to use traditional letter grades and record a single grade on the reporting form for each subject area studied. Others use numbers or descriptive categories as proxies for letter grades. They might, for example, record a 1, 2, 3, or 4, or they might describe students' achievement as Beginning, Developing, Proficient, or Distinguished. Some elementary schools have developed standards-based reporting forms that record students' learning progress on specific skills or learning goals. Most of these forms also include sections for teachers to evaluate students' work habits or behaviors, and many provide space for narrative comments.

Grading practices are generally more consistent and much more traditional at the secondary level, where letter grades still dominate reporting systems. Some schools attempt to enhance the discriminatory function of letter grades by adding plusses or minuses, or by pairing letter grades with percentage indicators. Because most secondary reporting forms allow only a single grade to be assigned for each course or subject area, however, most teachers combine a variety of diverse factors into that single symbol. In some secondary schools, teachers have begun to assign multiple grades for each course in order to separate achievement grades from marks related to learning skills, work habits, or effort, but such practices are not widespread.

Research Findings

Over the years, grading and reporting have remained favorite topics for researchers. A review of the Educational Resources Information Center (ERIC) system, for example, yields a reference list of more than 4,000 citations. Most of these references are essays about problems in grading and what should be done about them. The research studies consist mainly of teacher surveys. Although this literature is inconsistent both in the quality of studies and in results, several points of agreement exist. These points include the following:

Grading and reporting are not essential to the instructional process. Teachers do not need grades or reporting forms to teach well, and students can and do learn many things well without them. It must be recognized, therefore, that the primary purpose of grading and reporting is other than facilitation of teaching or learning.

At the same time, significant evidence shows that regularly checking on students' learning progress is an essential aspect of successful teaching–but checking is different from grading. Checking implies finding out how students are doing, what they have learned well, what problems or difficulties they might be experiencing, and what corrective measures may be necessary. The process is primarily a diagnostic and prescriptive interaction between teachers and students. Grading and reporting, however, typically involve judgment of the adequacy of students' performance at a particular point in time. As such, it is primarily evaluative and descriptive.

When teachers do both checking and grading, they must serve dual roles as both advocate and judge for students–roles that are not necessarily compatible. Ironically, this incompatibility is usually recognized when administrators are called on to evaluate teachers, but it is generally ignored when teachers are required to evaluate students. Finding a meaningful compromise between these dual roles is discomforting to many teachers, especially those with a child-centered orientation.

Grading and reporting serve a variety of purposes, but no one method serves all purposes well. Various grading and reporting methods are used to: (1) communicate the achievement status of students to their parents and other interested parties; (2) provide information to students for self-evaluation; (3) select, identify, or group students for certain educational paths or programs; (4) provide incentives for students to learn; and (5) document students' performance to evaluate the effectiveness of instructional programs. Unfortunately, many schools try to use a single method of grading and reporting to achieve all of these purposes and end up achieving none of them very well.

Letter grades, for example, offer parents and others a brief description of students' achievement and the adequacy of their performance. But using letter grades requires the abstraction of a great deal of information into a single symbol. In addition, the cut-offs between grades are always arbitrary and difficult to justify. Letter grades also lack the richness of other, more detailed reporting methods such as narratives or standards-based reports.

These more detailed methods also have their drawbacks, however. Narratives and standardsbased reports offer specific information that is useful in documenting student achievement. But good narratives take time to prepare and as teachers complete more narratives, their comments become increasingly standardized. Standards-based reports are often too complicated for parents to understand and seldom communicate the appropriateness of student progress. Parents often are left wondering if their child's achievement is comparable with that of other children or in line with the teacher's expectations.

Because no single grading method adequately serves all purposes, schools must first identify their primary purpose for grading, and then select or develop the most appropriate approach. This process involves the difficult task of seeking consensus among diverse groups of stakeholders.

Grading and reporting require inherently subjective judgments. Grading is a process of professional judgment–and the more detailed and analytic the grading process, the more likely it is that subjectivity will influence results. This is why, for example, holistic scoring procedures tend to have greater reliability than analytic procedures. However, being subjective does not mean that grades lack credibility or are indefensible. Because teachers know their students, understand various dimensions of students' work, and have clear notions of the progress made, their subjective perceptions can yield very accurate descriptions of what students have learned.

Negative consequences result when subjectivity translates to bias. This occurs when factors apart from students' actual achievement or performance affect their grades. Studies have shown, for example, that cultural differences among students, as well as their appearance, family backgrounds, and lifestyles, can sometimes result in biased evaluations of their academic performance. Teachers' perceptions of students' behavior can also significantly influence their judgments of academic performance. Students with behavior problems often have no chance to receive a high grade because their infractions over-shadow their performance. These effects are especially pronounced in judgments of boys. Even the neatness of students' handwriting can significantly affect teachers' judgments. Training programs help teachers identify and reduce these negative effects and can lead to greater consistency in judgments.

Grades have some value as rewards, but no value as punishments. Although educators would undoubtedly prefer that motivation to learn be entirely intrinsic, the existence of grades and other reporting methods are important factors in determining how much effort students put forth. Most students view high grades as positive recognition of their success, and some work hard to avoid the consequences of low grades.

At the same time, no studies support the use of low grades or marks as punishments. Instead of prompting greater effort, low grades usually cause students to withdraw from learning. To protect their self-image, many regard the low grade as irrelevant and meaningless. Other students may blame themselves for the low mark, but feel helpless to improve. Grading and reporting should always be done in reference to learning criteria, never "on the curve." Although using the normal probability curve as a basis for assigning grades yields highly consistent grade distributions from one teacher to the next, there is strong evidence that it is detrimental to relationships among students and between teachers and students. Grading on the curve pits students against one another in a competition for the few rewards (high grades) distributed by the teacher. Under these conditions, students readily see that helping others threatens their own chances for success.

Modern research has also shown that the seemingly direct relationship between aptitude or intelligence and school achievement depends on instructional conditions. When the quality of instruction is high and well matched to students' learning needs, the magnitude of this relationship diminishes drastically and approaches zero. Moreover, the fairness and equity of grading on the curve is a myth.

Relating grading and reporting to learning criteria, however, provides a clearer picture of what students have learned. Students and teachers alike generally prefer this approach because they consider it fairer. The types of learning criteria teachers use for grading and reporting typically fall into three general categories:

  1. Product criteria are favored by advocates of standards-based approaches to teaching and learning. These educators believe the primary purpose of grading and reporting is to communicate a summative evaluation of student achievement and performance. In other words, they focus on what students know and are able to do at a particular point in time. Teachers who use product criteria base grades exclusively on final examination scores, final products (reports or projects), overall assessments, and other culminating demonstrations of learning.
  2. Process criteria are emphasized by educators who believe product criteria do not provide a complete picture of student learning. From this perspective, grading and reporting should reflect not just the final results but also how students got there. Teachers who consider effort or work habits when reporting on student learning are using process criteria. So are teachers who count regular classroom quizzes, homework, class participation, or attendance.
  3. Progress criteria, often referred to as improvement scoring, learning gain, or value-added grading, consider how much students have gained from their learning experiences. Teachers who use progress criteria look at how far students have come over a particular period of time, rather than just where they are. As a result, grading criteria may be highly individualized. Most of the research evidence on progress criteria in grading and reporting comes from studies of differentially paced instructional programs and special education programs.

Teachers who base their grading and reporting procedures on learning criteria typically use some combination of these three types. Most also vary the criteria they employ from student to student, taking into account individual circumstances. Although usually done in an effort to be fair, the result is a "hodgepodge grade" that includes elements of achievement, effort, and improvement.

Researchers and measurement specialists generally recommend the use of product criteria exclusively in determining students' grades. They point out that the more process and progress criteria come into play, the more subjective and biased grades are likely to be. If these criteria are included at all, they recommend reporting them separately.


The issues of grading and reporting on student learning continue to challenge educators. However, more is known at the beginning of the twenty-first century than ever before about the complexities involved and how certain practices can influence teaching and learning. To develop grading and reporting practices that provide quality information about student learning requires clear thinking, careful planning, excellent communication skills, and an overriding concern for the well-being of students. Combining these skills with current knowledge on effective practice will surely result in more efficient and more effective grading and reporting practices.


Over the course of an academic career the average student will be exposed to a variety of grading systems and procedures. Although some of these systems may be qualitative in nature, such as an annual or semiannual written narrative, the vast majority are quantitative and depend upon numerical or alphanumerical metrics. Perhaps the most familiar of these involves the letters "A" through "F," where "A" is usually given a value of 4.0 and is characterized in words as outstanding or excellent and "F" is given a value of 0.0 and is described as unsatisfactory or failing. The grades of A through F are usually derived from some more differentiated quantitative value such as test score, in which the specific nature of the relationship between grade and test score may take a variety of different forms: (e.g., an A is defined by a score of 90% or better or by a value that falls in the top 5–10% of scores independent of absolute value, and so on). Regardless of the specific translation of test performance into letter grade, the point to keep in mind is that the A–F scale defines the most frequent grading system used in higher education over the past half century or more.

Variations in the Grading System

Like all prototypes, the A–F system admits many variations. These often take the form of plusses and minuses, thereby producing a scale having the possibility of fifteen distinct units: A+, A, A–, B+, B … F–. In actual practice, the grade of A+ is scarcely ever used and the same is true for D+ and D–and F+ and F–, thereby yielding a scale of between eight to ten units. Generally speaking, the greater the number of units in the grading system the more precisely does it hope to quantify student performance. What is interesting in this regard are fluctuations in the actual number of units used in different historical eras. Without going too deeply into the relevant historical facts, it is clear that certain historical periods, such as the 1960s, reduced the grading system to two or so units–Pass, No Credit (P/NC)–whereas other periods, such as the 1980s, expanded it to ten, eleven or twelve units.

Variations in the breadth of the grading system would seem to have significant educational implications. At a minimum, these differences may be taken to imply that scales having a large number of units indicate a relative comfort in making precise distinctions, whereas those having fewer units suggest a relative discomfort in making such distinctions. In the case of more differentiated systems, distinctions and rankings are significant, and individual achievement is emphasized; in the case of less differentiated systems, distinctions and rankings are de-emphasized and interstudent competition is minimized. To some degree, it is possible to view fluctuations in American grading systems as reflecting a more general ambivalence the society has in regard to competition and cooperation, between individual recognition and social equity. Educational institutions sometimes emphasize strict evaluation, competition, and individual achievement, whereas at other times they emphasize less precise evaluation, cooperation, and sympathetic understanding for students of all achievement levels.

Another property of grading systems is that individual class grades often are combined to produce an overall metric called the grade point average or GPA. Unlike its constituent values, which usually are carried to only one (or no numerically significant places), the GPA presents a metric of 400 units yielding the possibility that a GPA of 3.00 will locate the student in the category of "good" whereas a value of2.99 will exclude him or her from this category. In the same way, honors, admission to graduate school, preliminary selection for interviews by a desirable company, and so forth, may be defined by a single point difference on the GPA scale (e.g., 3.50 versus3.49 for Phi Beta Kappa, etc.).

Because GPAs are significant in categorizing student performance, a number of evaluations have been made of their reliability and validity. One issue to be addressed here concerns field of study, where it is well documented that classes in the natural sciences and business produce lower overall grades than those in the humanities or social sciences. What this means is that it is unreasonable to equate grade values across disciplines. It also suggests that the GPA is composed of unequal components and that students may be able to secure a higher GPA by a judicious selection of courses.

Although other factors may be mentioned aside from academic discipline (such as SAT level of school, quality and nature of tests, etc.) the conclusion must be that the GPA is a poor measure and should not be used by itself in coming to significant decisions about the quality of student performance or differences between departments and/or educational institutions. The GPA is also a relatively poor basis on which to predict future performance, which perhaps explains why such attempts are never very impressive. In fact, a number of meta-analyses of this relationship, conducted every ten years or so since 1965, reveals that the median correlation between GPA and future performance is 0.18; a value that is neither very useful nor impressive. The strongest relationship between GPA and future achievement is usually found between undergraduate GPA and first-year performance in graduate or professional school.

Despite such difficulties in understanding the exact meanings of grades and the GPA, they remain important social metrics and sometimes yield heated discussions over issues such as grade inflation. Although grade inflation has many different meanings, it usually is defined by an increase in the absolute number of As and Bs over some period of years. The tacit assumption here seems to be that any continuing increase in the overall percentage of "good grades" or in the overall GPA implies a corresponding decline in academic standards. Although historically there have been periods in which the number of good grades decreased (so-called grade deflation), significant social concerns usually only accompany the grade inflation pattern. This one-sided emphasis suggests that grade inflation is as much a sociopolitical issue as an educational one and depends upon the dubious equating of grades with money. What really seems of concern here is a value issue, not a cogent analogy that reveals anything significant about grades or money.

How Grades Are Produced

Grading systems represent just one aspect of an interconnecting network of educational processes, and any attempt to describe grading systems without considering other aspects of this network must necessarily be incomplete. Perhaps the most important of these processes concerns the procedures used to produce grades in the first place, namely, the classroom test. Here, of course, are purely formal differences; for example, between multiple choice and essay tests, or between in-class and take-home tests or papers. Also to be included are the quality of test items themselves not only in terms of content but also in terms of the clarity of the question and, in the case of multiple choice tests, of the distractors.

One way to capture the complexity of possible ways in which grades are produced is to consider the set of implicit choices that lie behind an instructor's use of a specific testing and/or grading procedure. Included here are such questions as: What evaluation procedure should I use? Term papers, classroom discussions, or in-class tests? If I choose tests, what kind(s)? Essay, true/false, fill-in-the-blank, matching, or multiple-choice? If I choose multiple-choice, what grading model should I use? Normal curve, percent-correct, improvement over preceding tests? If I choose percent-correct, how many tests should I give? Final only, two in-class tests and a final, one midterm and one final? How should I weight each test if I choose the midterm-final pattern? Midterm equals final, midterm is equivalent to twice the final exam grade, final equals twice the midterm grade? What grade report system should I use? P/F; A, B, C, D, F; or A+, A, A–, B+, … F? An examination of this collection of possible choices suggests that instructors have a large number of options as to how to go about testing and grading their students.

Any consideration of the ways in which testing and grading relate to one another must also deal with the ways in which one or both of these activities relate to learning and teaching. The relationship between learning and testing is a fairly direct (if neglected) one, especially if tests are used not only to evaluate student achievement but also to reinforce or promote learning itself. Thus it is easy to develop a classroom question or exercise that requires the student to read some material before being able to answer the question or complete the exercise. Teaching, on the other hand, would seem to be somewhat further removed from issues of testing and grading, although the specific testing and grading plan used by the instructor does inform the student as to what constitutes relevant knowledge as well as what attitude he or she holds toward precise evaluation and academic competition.

Students are not immune to testing and grade procedures, and educational researchers have made the distinction between students who are grade oriented and those who are learning oriented. Although this distinction is surely too one-dimensional, it does suggest that for some students the classroom is a place where they experience and enjoy learning for its own sake. For other students, however, the classroom is experienced as a crucible in which they are tested and in which the attainment of a good grade becomes more important than the learning itself. When students are asked how they became grade (or learning) oriented, they usually point to the actions of their teachers in emphasizing grades as a significant indicator of future success; alternatively, they describe instructors who are excited by promoting new learning in their classrooms. When college instructors are asked about the reason(s) for their emphasis on grades, they report that student behaviors–such as arguing over the scoring of a single question–make it necessary for them to maintain strict and well-defined grading standards in their classrooms. The ironic point is that both the student and the instructor see the "other" as emphasizing grades over learning, and neither sees this as a desirable state of affairs. What seems missing in this context is a clear recognition by both the instructor and the student that grades are best construed as a type of communication. When grades (and tests) are thought about in this way, they can be used to improve learning. As it now stands, however, the communicative purpose of grading is ordinarily submerged in their more ordinary use as a means of rating and sorting students for social and institutional purposes not directly tied to learning. Only when grades are integrated into a coherent teaching and learning strategy do they serve the purpose of providing useful and meaningful feedback not only to the larger culture but to the individual student as well.



