12 minute read

Research Methods

School And Program Evaluation

Program evaluation is research designed to assess the implementation and effects of a program. Its purposes vary and can include (1) program improvement, (2) judging the value of a program, (3) assessing the utility of particular components of a program, and (4) meeting accountability requirements. Results of program evaluations are often used for decisions about whether to continue a program, improve it, institute similar programs elsewhere, allocate resources among competing programs, or accept or reject a program approach or theory. Through these uses program evaluation is viewed as a way of rationalizing policy decision-making.

Program evaluation is conducted for a wide range of programs, from broad social programs such as welfare, to large multisite programs such as the preschool intervention program Head Start, to program funding streams such as the U.S. Department of Education's Title I program that gives millions of dollars to high-poverty schools, to small-scale programs with only one or a few sites such as a new mathematics curriculum in one school or district.

Scientific Research versus Evaluation

There has been some debate about the relationship between "basic" or scientific research and program evaluation. For example, in 1999 Peter Rossi, Howard Freeman, and Michael Lipsey described program evaluation as the application of scientific research methods to the assessment of the design and implementation of a program. In contrast, Michael Patton in 1997 described program evaluation not as the application of scientific research methods, but as the systematic collection of information about a program to inform decision-making.

Both agree, however, that in many circumstances the design of a program evaluation that is sufficient for answering evaluation questions and providing guidance to decision-makers would not meet the high standards of scientific research. Further, program evaluations are often not able to strictly follow the principles of scientific research because evaluators must confront the politics of changing actors and priorities, limited resources, short timelines, and imperfect program implementation.

Another dimension on which scientific research and program evaluation differ is their purpose. Program evaluations must be designed to maximize the usefulness for decision-makers, whereas scientific research does not have this constraint. Both types of research might use the same methods or focus on the same subject, but scientific research can be formulated solely from intellectual curiosity, whereas evaluations must respond to the policy and program interests of stakeholders (i.e., those who hold a stake in the program, such as those who fund or manage it, or program staff or clients).

How Did Program Evaluation Evolve?

Program evaluation began proliferating in the 1960s, with the dawn of social antipoverty programs and the government's desire to hold the programs accountable for positive results. Education program evaluation in particular expanded also because of the formal evaluation requirements of the National Science Foundation–sponsored mathematics and science curriculum reforms that were a response to the 1957 launch of Sputnik by the Soviet Union, as well as the evaluation requirements instituted as part of the Elementary and Secondary Education Act of 1965.

Experimentation versus Quasi-experimentation

The first large-scale evaluations in education were the subject of much criticism. In particular, two influential early evaluations were Paul Berman and Milbrey McLaughlin's RAND Change Agent 1973– 1978 study of four major federal programs: the Elementary and Secondary Education Act, Title VII (bilingual education), the Vocational Education Act, and the Right to Read Act; and a four-year study of Follow Through, which sampled 20,000 students and compared thirteen models of early childhood education. Some of the criticisms of these evaluations were that they were conducted under too short of a time frame, used crude measures that did not look at incremental or intermediate change, had statistical inadequacies including invalid assumptions, used poorly supported models and inappropriate analyses, and did not consider the social context of the program.

These criticisms led to the promotion of the use of experiments for program evaluation. Donald Campbell wrote an influential article in 1969 advocating the use of experimental designs in social program evaluation. The Social Science Research Council commissioned Henry Riecken and Robert Boruch to write the 1978 book Social Experimentation, which served as both a "guidebook and manifesto" for using experimentation in program evaluation. The best example of the use of experimentation in social research is the New Jersey negative income tax experiment sponsored by the Office of Equal Opportunity of the federal Department of Health, Education, and Welfare.

Experiments are the strongest designs for assessing impact, because through random sampling from the population of interest and random assignment to treatment and control groups, experiments rule out other factors besides the program that might explain program success. There are several practical disadvantages to experiments, however. First, they require that the program be a partial coverage program–that is, there must be people who do not participate in the program, who can serve as the control group. Second, experiments require large amounts of resources that are not always available. Third, they require that the program be firmly and consistently implemented, which is frequently not the case. Fourth, experiments do not provide information about how the program achieved its effects. Fifth, program stakeholders sometimes feel that random assignment to the program is unethical or politically unfeasible. Sixth, an experimental design in a field study is likely to produce no more than an approximation of a true experiment, because of such factors as systematic attrition from the program, which leaves the evaluator with a biased sample of participants (e.g., those who leave the program, or attrite, might be those who are the hardest to influence, so successful program outcomes would be biased in the positive direction).

When experiments are not appropriate or feasible, quasi-experimental techniques are used. Set forth by Donald Campbell and Julian Stanley in 1963, quasi-experimentation involves a number of different methods of conducting research that does not require random sampling and random assignment to treatment and control groups. One common example is an evaluation that matches the program participants to nonparticipants that share similar characteristics (e.g., race) and measures outcomes of both groups before and after the program. The challenge to quasi-experimentation is to rule out what Campbell and Stanley termed internal validity threats, or factors that might be alternative explanations for program results besides the program itself, which in turn would reduce confidence in the conclusions of the study. Unlike experimental design, which protects against just about all possible internal validity threats, quasi-experimental designs generally leave one or several of them uncontrolled.


In addition to focusing on the relative strengths and weaknesses of experiments and quasi-experiments, criticisms of early large-scale education evaluations highlighted the importance of measuring implementation. For example, McLaughlin and Berman's RAND Change Agent study and the Follow-Through evaluation demonstrated that implementation of a specific program can differ a great deal from one site to the next. If an evaluation is designed to attribute effects to a program, varying implementation of the same program reduces the value of the evaluation, because it is unclear how to define the program. Thus, it is necessary to include in a program evaluation a complete description of how the program is being implemented, to allow the examination of implementation fidelity to the original design, and to discover any cross-site implementation differences that would affect outcomes.

In 1967 Michael Scriven first articulated the idea that there were two types of evaluation–one focused on evaluating implementation, called formative evaluation, and one focused on evaluating the impact of the program, called summative evaluation. He argued that emerging programs should be the subject of formative evaluations, which are designed to see how well a program was implemented and to improve implementation; and that summative evaluations should be reserved for programs that have been well-established and have stable and consistent implementation.

Related to the idea of formative and summative evaluation is a controversy over the extent to which the evaluator should be a program insider or an objective third party. In formative evaluations, it can be argued that the evaluator needs to become somewhat of an insider, in order to become part of the formal and informal feedback loop that makes providing program improvement information possible. In contrast, summative evaluations conducted by a program insider foster little confidence in the results, because of the inherent conflict of interest.

Stakeholder and Utilization Approaches

Still another criticism of early education evaluations was that stakeholders felt uninvolved in the evaluations; did not agree with the goals, measures, and procedures; and thus rejected the findings. This discovery of the importance to the evaluation of stake-holder buy-in led to what Michael Patton termed stakeholder or utilization-focused evaluation. Stake-holder evaluation bases its design and execution on the needs and goals of identified stakeholders or users, such as the funding organization, a program director, the staff, or clients of the program.

In the context of stakeholder evaluation, Patton in 1997 introduced the idea that it is sometimes appropriate to conduct goal-free evaluation. He suggested that evaluators should be open to the idea of conducting an evaluation without preconceived goals because program staff might not agree with the goals and because the goals of the program might change over time. Further, he argued that goal-free evaluation avoids missing unanticipated outcomes, removes the negative connotation to side effects, eliminates perceptual biases that occur when goals are known, and helps to maintain evaluator objectivity. Goals are often necessary, however, to guide and focus the evaluation and to respond to the needs of policymakers. As a result, Patton argued that the use of goals in program evaluation should be decided on a case-by-case basis.

Theory-Based Evaluations

Besides stakeholder and goal-free evaluation, Carol Weiss in 1997 advocated for theory-based evaluations, or evaluations that are grounded in the program's theory of action. Theory-based evaluation aims to make clear the theoretical underpinnings of the program and use them to help structure the evaluation. In her support of theory-based evaluation, Weiss wrote that if the program theory is outlined in a phased sequence of cause and effect, then the evaluation can identify weaknesses in the system or at what point in the chain of effects results can be attributed. Also, articulating a programmatic theory can have positive benefits for the program, including helping the staff address conflicts, examine their own assumptions, and improve practice.

Weiss explained that theory-based approaches have not been widespread because there may be more than one theory that applies to a program and no guidance about which to choose, and because the process of constructing theories is challenging and time consuming. Further, theory-based approaches require large amounts of data and resources. A theory-based evaluation approach does, however, strengthen the rigor of the evaluation and link it more with scientific research, which by design is a theory-testing endeavor.

Data Collection Methods

Within different types of evaluation (e.g., formative, stakeholder, theory-based), there have been debates about which type of methodology is appropriate, with these debates mirroring the debates in the larger social science community. The "scientific ideal" of using social experiments and randomized experiments, which supports the quantification of implementation and outcomes, is contrasted with the "humanistic ideal" that the program should be seen through the eyes of the clients and defies quantification, which supports an ethnographic or observational methodology.

Campbell believed that the nature of the research question should determine the question, and he encouraged evaluations that have both qualitative and quantitative assessments, with these assessments supporting each other. In the early twenty-first century, program evaluations commonly use a combination of qualitative and quantitative data collection techniques.

Does Evaluation Influence Policy?

Although the main justification for program evaluation is its role in rationalizing policy, program evaluation results rarely have a direct impact on decision-making. This is because of the diffuse and political nature of policy decision-making and because people are generally resistant to change. Most evaluations are undertaken and disseminated in an environment where decision-making is decentralized among several groups and where program and policy choices result from conflict and accommodation across a complex and shifting set of players. In this environment, evaluation results cannot have a single and clear use, nor can the evaluator be sure how the results will be interpreted or used.

While program evaluations may not directly affect decisions, evaluation does play a critical role in contributing to the discourse around a particular program or issue. Information generated from program evaluation helps to frame the policy debate by bringing conflict to the forefront, providing information about trade-offs, influencing the broad assumptions and beliefs underlying policies, and changing the way people think about a specific issue or problem.

Evaluation in the Early Twenty-First Century

In the early twenty-first century, program evaluation is an integral component of education research and practice. The No Child Left Behind Act of 2001 (reauthorization of the U.S. government's Elementary and Secondary Education Act) calls for schools to use "research-based practices." This means practices that are grounded in research and have been proven through evaluation to be successful. Owing in part to this government emphasis on the results of program evaluation, there is an increased call for the use of experimental designs.

Further, as the evaluation field has developed in sophistication and increased its requirements for rigor and high standards of research, the lines between scientific research and evaluation have faded. There is a move to design large-scale education evaluations to respond to programmatic concerns while simultaneously informing methodological and substantive inquiry.

While program evaluation is not expected to drive policy, if conducted in a rigorous and systematic way that adheres to the principles of social research as closely as possible, the results of program evaluations can contribute to program improvement and can provide valuable information to both advance scholarly inquiry as well as inform important policy debates.


BERMAN, PAUL, and MCLAUGHLIN, MILBREY. 1978. Federal Programs Supporting Educational Change, Vol. IV: The Findings in Review. Santa Monica, CA: RAND.

CAMPBELL, DONALD. 1969. "Reforms as Experiments." American Psychologist 24:409–429.

CAMPBELL, DONALD, and STANLEY, JULIAN. 1963. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally.

CHELIMSKY, ELEANOR. 1987. "What Have We Learned about the Politics of Program Evaluation?" Evaluation News 8 (1):5–22.

COHEN, DAVID, and GARET, MICHAEL. 1975. "Reforming Educational Policy with Applied Social Research." Harvard Educational Review 45 (1):17–43.

COOK, THOMAS D., and CAMPBELL, DONALD T. 1979. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Chicago: Rand McNally.

CRONBACH, LEE J. 1982. Designing Evaluations of Educational and Social Programs. San Francisco: Jossey-Bass.

CRONBACH, LEE J.; ABRON, SUEANN ROBINSON; DORNBUSCH, SANFORD; HESS, ROBERT; PHILLIPS, D. C.; WALKER, DECKER; and WEINER,STEPHEN. 1980. Toward Reform of Program Evaluation: Aims, Methods, and Institutional Arrangements. San Francisco: Jossey-Bass.

HOUSE, ERNEST; GLASS, GENE; MCLEAN, LESLIE; and WALKER, DECKER. 1978. "No Simple Answer: Critique of the Follow Through Evaluation." Harvard Educational Review 48:128–160.

PATTON, MICHAEL. 1997. Utilization-Focused Evaluation, 3rd edition. Thousand Oaks, CA: Sage.

RIECKEN, HENRY, and BORUCH, ROBERT. 1978. Social Experimentation: A Method for Planning and Evaluating Social Intervention. New York: Academic Press.

ROSSI, PETER; FREEMAN, HOWARD; and LIPSEY, MARK. 1999. Evaluation: A Systematic Approach, 6th edition. Thousand Oaks, CA: Sage.

SCRIVEN, MICHAEL. 1967. "The Methodology of Evaluation." In Perspective of Curriculum Evaluation, ed. Robert E. Stake. Chicago: Rand McNally.

SHADISH, WILLIAM R.; COOK, THOMAS; and LEVITON, LAURA. 1991. Foundations of Program Evaluation: Theories of Practice. Newbury Park, CA: Sage.

U.S. OFFICE OF EDUCATION. 1977. National Evaluation: Detailed Effects. Volumes II-A and II-B of the Follow Through Planned Variation Experiment Series. Washington, DC: Government Printing Office.

WEISS, CAROL. 1972. Evaluation Research: Methods for Assessing Program Effectiveness. Englewood Cliffs, NJ: Prentice Hall.

WEISS, CAROL. 1987. "Evaluating Social Programs: What Have We Learned?" Society 25:40–45.

WEISS, CAROL. 1988. "Evaluation for Decisions: Is Anybody There? Does Anybody Care?" Evaluation Practice 9:5–20.

WEISS, CAROL. 1997. "How Can Theory-Based Evaluation Make Greater Headway?" Evaluation Review 21:501–524.


Additional topics

Education - Free Encyclopedia Search EngineEducation EncyclopediaResearch Methods - Qualitative And Ethnographic, School And Program Evaluation, Verbal Protocols - OVERVIEW