~ Archive for September, 2006 ~

An inventory: 11 issues with value-added studies (evaluations based on student test scores)

1

Even though most standardized tests for K-12 students were designed with the individual student’s learning in mind, we often gravitate towards the use of such scores when we seek to trace the success of a teacher, program, or school. They seem so objective, so unambiguous, so well suited to the task. And often our impulse is to reward teachers whose students score highest, to demand more of the rest, and perhaps to direct more resources their way.

Now at some point it may become apparent to folks that schools and districts with the highest average scores are also those with the most affluent student populations. (Systematic studies consistently find that income accounts for 60-80% of the variation in test scores among different groups. ) Realizing this, many will call instead for tests at start and end of the school year and advocate rewarding the schools and teachers whose students show the greatest improvement. This is an eminently natural position to take. But it turns out to be fraught with an extraordinary number of challenges.

Statistical models for assessing contributions to student learning (value-added models, if you can stomach that awkward yet adhesive term) have received intense scrutiny among educational researchers in the past five years. I became attuned to this literature after I ran into some serious roadblocks in an effort to isolate contributors to reading and math improvement for children in a school touched by the WIDE World program. I began to see that gathering more complete data and using more sophisticated methods (hierarchical linear models, propensity scores) could solve only some of my problems. And I began to collect my own and more distinguished researchers’ impressions of the hurdles one might need to overcome to develop a sound explanatory model of test score change. What follows is a list of these issues. While no study is likely to involve all of them, most will bump up against quite a few.

1. Studies can easily confuse effects from individual students; from being among certain students; from teacher; from intervention; and from school. To what should we compare a certain result — to the results that would have occurred if the student(s) had not been in school at all ? if they had been in school, but had stayed in the previous grade? if they had been in a different school? with a different set of classmates? with different teachers ? Questions such as these are too often neglected, detracting from the soundness of research claims. (1), (5)

2. Few students study under just one teacher, making it perilous to try to attribute gains or losses to an individual teacher.

3. What students in some classes learn may spill over and reach students in other classes (”contamination”).

4. Groups of students do not often stay together long-term, so while students may exert effects on one another — which are difficult enough to measure — these are extremely difficult to track longitudinally. (1), (5)

5. It’s difficult to separate past effects (of any of these types — from teacher, school, or set of classmates) from more recent ones. (1)

6. Variations in the policies by which schools assign students to special Ed or ESL programs can distort results, as can any pattern of biased exclusion of students from testing. Students not promoted will be left out of any calculation of year-to-year change, when including such students would lower the group score. (According to Walt Haney, this is one source of the spurious “Texas Education Miracle” of the late 1990s.) (1), (2)

7. Inclusion of different-enough schools in a study means one must extrapolate to a point beyond the reasonable. E.g., suppose that, within schools with 0%-30% limited English-proficient (LEP) students, each difference of 10 percentage points in LEP is linked with an average test score difference of 3 points. That would mean a 3-point score difference for a 0% LEP school compared to a 10%, and a 6 point difference for a 0% compared to a 20%. However, for a school with a % LEP far outside that range, such as 60%, that relationship may not hold at all. The slope might get much flatter or much steeper. In such cases trying to adjust or control for % LEP would yield misleading results. (1)

8. Thomas Kane and Douglas Staiger have shown that 50-85% of year-to-year variation in group test scores can be attributed simply to yearly fluctuations in the academic levels of incoming student cohorts. In other words, to noise: to something that has nothing to do with the teacher’s or program’s effectiveness. Differences between student groups within a year figure to be subject to noise as well. The authors also convincingly show that, because group averages fluctuate much more for small groups, it is the smaller schools who are more apt to suddenly rise to the top or sink to the bottom, netting them undeserved rewards or penalties. Such outstanding schools almost always end up closer to the middle of the pack the following year, demonstrating the principle of regression to the mean. Their exceptionableness is due not to anything noteworthy such as an instructional change, but only to chance. (3)

9. Student performance in different subjects must be assessed via different instruments. It would be pointless to try to use a single instrument such as the SAT whether testing reading, world history, or advanced placement physics. And different tests vary in their propensity to show change, either because of differences in the relative difficulties of pre-and post versions or because of differences in either version’s validity and reliability. This fact complicates any study involving multiple subject areas or multiple grades.

10. Since virtually all (all?) standardized tests in education rely to some degree on students’ reading ability, value-added research results in all subjects other than reading will be compromised unless all students have achieved a certain reading level. One’s ability to think effectively with social science, math, or science material will not be picked up by a test unless that test is properly matched to the student’s reading ability. Moreover, group comparisons are potentially invalidated if some groups are more affected by this problem than others.

11. It is often desirable to try to relate student outcomes to some kind of indicator of baseline teaching effectiveness. Some examples are years of teaching experience; type of certification or teacher preparation program; educational degree; professional development points; and experts’/administrators’ ratings. Unfortunately, the first two of these have been fairly conclusively shown to be largely unrelated to test score outcomes, based on a recent, very large-scale study in New York City. (4) The other three variables seem unpromising based on WIDE’s recent evaluation work, including an unpublished urban school study involving about 25 teachers and 300 students. This is not to say that teacher quality itself does not matter. Indeed, evident from Kane et al.’s recent paper is the very great need for some usable measure that can serve as a proxy for teacher quality.

Rubin, Stuart, and Zanutto (1) and Damian Betebenner (5) make several suggestions that I find to be key for thoughtful research using value-added models. Three seem to be the most important:

  • Randomize to the extent possible.
  • Collect data on as many relevant variables as possible; statistical control of these, while far inferior to equalizing through randomization, is still useful.
  • Be very careful to think through, and make explicit, your assumptions. The best analytical method for a particular study and research question will depend on these assumptions. Example: Is it reasonable to expect that no improvement would occur absent a certain intervention? If so, it makes sense to analyze gain scores, as with analysis of variance. Is it instead reasonable to think that all students would improve to some degree even without the intervention, and that their posttest score could be predicted as a linear function of their pretest score? If so, analysis of covariance would make sense.

I suppose it is clear by now that I am pessimistic about the prospects of modeling standardized test scores, or changes therein, as a way of isolating the contributing factors in student achievement/improvement. Rubin et al. take a stronger stand (p. 18):

[... We] do not think that [most value-added] analyses are estimating causal quantities, except under extreme and unrealistic assumptions. We argue that models such as these should not be seen as estimating causal effects of teachers or schools, but rather as providing descriptive measures. It is the reward structures based on such value-added models that should be the objects of assessment, since they can actually be (and are being) implemented.

***
(1) Donald B. Rubin, Elizabeth A. Stuart, and Elaine L. Zanutto (2003). A potential outcomes view of value-added assessment in education .
(2) Walt Haney (2000). The myth of the Texas miracle in education .
(3) Thomas Kane and Douglas Staiger (2002). Volatility in school test scores: Implications for test-based accountability systems .
(4) Thomas Kane, Jonah Rockoff, and Douglas Staiger (2006).
What does certification tell us about teacher effectiveness? Evidence from New York City.
(5) Damian Betebenner (2006). Lord’s Paradox with Three Statisticians (Presented at 2006 AERA Annual Meeting in San Francisco; seems to be temporarily unavailable on the Internet.)

[RBS]

Are WIDE World surveys representative?

0

WIDE World’s evaluation efforts rely quite a bit on surveys. Since response rates have been about 55% for our end-of-course evaluations and about 20% for our one year follow-up surveys, it is natural to ask what gives us confidence that the fraction who respond constitute a representative sample of course participants. Our web page, Course Evaluations, briefly discusses the ways we try to address the issue (both at the top and bottom of the page). The current piece will fill in some gaps and add some detail.

We regularly check to see whether survey respondents hold, on average, similar characteristics to those of nonrespondents. For example, we check whether our findings are applicable regardless of region, teaching subject, level of experience, or educational degree. Generally we find that they are, and the glaring exception turns out to be less important than one might think. Respondents tend to have amassed more participation points than nonrespondents have. But level of participation shows such low correlations (r ~ 0.0 to 0.2) with important outcomes such as course satisfaction, or appraisal of the course’s effects, that for these outcomes nonrepresentativeness in participation is scarcely an issue. And to reiterate what was described in the page listed above, in 2005 our special effort to individually reach nonrespondents (especially those with low participation) revealed surprisingly few differences in their opinions of the courses compared to respondents.

As for our recent one year follow-up survey, it is true that respondents were more likely than others to have gone on to become coaches in our program. However, the numbers in this group were small, and adjusting for this imbalance would have exerted only a slight effect on the findings reported at One Year Follow-up. Our paper entitled Beyond Self-Report, p. 12, provides another examination of such factors as they pertain to a follow-up survey from the previous year.

[RBS]

Log in
Protected by AkismetBlog with WordPress