The Many Failings of Value-Added Modeling

Scott Alexander reviews the research on value-added models measuring teacher quality1. While Scott’s overview is perfectly fine, any such effort is akin to a circa 1692 overview of the research literature on alchemy. Quantifying teacher quality will, I believe, be understood in those terms soon enough.

High School VAM is Impossible

I have many objections to the whole notion of modeling what value a teacher adds, but top of the idiocy heap is how little attention is paid to the fact that VAM is only even possible with elementary school teachers. First, reading and basic math are the primary learning objectives of years 1-5. Second, elementary schools think of reading and math ability in terms of grade level. Finally, elementary teachers or their schools have considerable leeway in allocating instruction time by subject.

Now, go to high school (of which middle school is, as always, a pale imitation with similar issues). We don’t evaluate student reading skills by grade level, but rather “proficiency”. We don’t say “this 12th grader reads at the 10th grade level”. We have 12th graders who read at the 8th grade level, of course. We have 12th graders who read at the third grade level. But we don’t acknowledge this in our test scores, and so high school tests can’t measure reading progress. Which is good, because high school teachers aren’t tasked with reading instruction, so we wouldn’t expect students to make much progress. What’s that? Why don’t we teach reading instruction in high school, if kids can’t read at high school level, you ask? Because we aren’t allowed to. High school students with remedial level skills have to wait until college acknowledges their lack of skills.

And that’s reading, where at least we have a fighting shot of measuring progress, even though the tests don’t currently measure it–if we had yearly tests, which of course we don’t. Common Core ended yearly high school tests in most states. Math, it’s impossible because we pass most kids (regardless of ability) into the next class the next year, so there’s no “progress”, unless we measure kids at the beginning and end of the year, which introduces more tests and, of course, would show that the vast majority of students entering, say, algebra 2 don’t in fact understand algebra 1. Would the end of year tests measure whether or not the students had learned algebra 1, or algebra 2?

Nor can high school legally just allocate more time to reading and math instruction, although they can put low-scoring kids in double block instruction, which is a bad, bad thing.

Scope Creep

Most teachers at all levels don’t teach tested subjects and frankly, no one really cares about teacher quality and test scores in anything other than math or reading, but just pretend on everything else. Which leads to a question that proponents answer implicitly by picking one and ignoring the other: do we measure teacher quality to improve student outcomes or to spend government dollars effectively?

If the first, then what research do we have that art teachers, music teachers, gym teachers, or, god save us, special education teachers improve student outcomes? (answer: none.) If the second, then what evidence do we have that the additional cost of testing in all these additional topics, as well as the additional cost of defending the additional lawsuits that will inevitably arise as these teachers attack the tests as invalid, will be less strain on the government coffers than the cost of the purportedly inadequate teachers? What research do we have that any such tests on non-academic subjects are valid even as measures of knowledge, much less evidence of teacher validity?

None, of course. Which is why you see lawsuits by elective teachers pointing out it’s a tad unfair to be judged on the progress of students they’ve never actually met, much less taught. While many of those lawsuits get overturned as unfair but not constitutional, the idiocy of these efforts played no small part in the newest version of the federal ESEA, the ESSA, killed the student growth measure (SGM) requirement.

So while proponents might argue that math and English score growth have some relationship to teacher quality in those subjects, they can’t really argue for testing all subjects. Sure, people can pretend (a la Common Core) that history and science teachers have an impact on reading skills, but we have no mechanism to, and are years away from, changing instruction and testing in these topics to require reading content and measuring the impact of that specific instruction in that specific topic. And again, that’s just reading. Not math, where it’s easy enough to test students on their understanding of math in science and history, but very difficult to tangle out where that instruction came from. Of course, this is only an issue after elementary school. See point one.

Abandoning false gods

For the past 20 years or so, school policy has been about addressing “preparation”, which explains the obsession with elementary school. Originally, the push for school improvement began in high school. Few people realize or acknowledge these days that the Nation at Risk, that polemic seen as groundbreaking by education reformers but kind of, um, duh? by any regular people who take the time to read it, was entirely focused on high school, as can be ascertained by a simple perusal of its findings and recommendations. Stop coddling kids with easy classes, make them take college prep courses! That’s the ticket. It’s the easy courses, the low high school standards that cause the problem. Put all kids in harder classes. And so we did, with pretty disastrous results through the 80s. Many schools began tracking, but Jeannie Oakes and disparate impact lawsuits put an end to that.

I’m not sure when the obsession with elementary school began because I wasn’t paying close attention to ed policy during the 90s. But at some point in the early 90s, it began to register that putting low-skilled kids in advanced high school classes was perhaps not the best idea, leading to either fraud or a lot of failing grades, depending on school demographics. And so, it finally dawned on education reformers that many high school students weren’t “academically prepared” to manage the challenging courses that they had in mind. Thus the dialogue turned to preparing “underserved” students for high school. Enter KIPP and all the other “no excuses” charters which, as I’ve mentioned many times, focus almost entirely on elementary school students.

In the early days of KIPP, the scores seemed miraculous. People were bragging that KIPP completely closed the achievement gap back then, rather than the more measured “slight improvement controlling for race and SES” that you hear today. Ed reformers began pushing for all kids to be academically prepared, that is hey! Let’s make sure no child is left behind! And so the law, which led to an ever increasing push for earlier reading and math instruction, because hey, if we can just be sure that all kids are academically prepared for challenging work by high school, all our problems will be fixed.

Except, alas, they weren’t. I believe that the country is nearing the end of its faith in the false god of elementary school test scores, the belief that the achievement gap in high school is caused simply by not sufficiently challenging black and Hispanic kids in elementary school. Two decades of increasing elementary scores to the point that they appear to have topped out, with nary a budge in high school scores has given pause. Likewise, Rocketship, KIPP, and Success Academy have all faced questions about how their high-scoring students do in high school and college.

As I’ve said many times, high school is brutally hard compared to elementary school. The recent attempt to genuinely shove difficulty down earlier in the curriculum went over so well that the new federal law gave a whole bunch of education rights back to the states as an apology. Kidding. Kind of.

And so, back to VAM….Remember VAM? This is an essay about VAM. Well, all the objections I pointed out above–the problems with high school, the problems with specific subject teachers–were mostly waved away early on, because come on, folks, if we fix elementary school and improve instruction there, everything will fall into place! Miracles will happen. Cats will sleep with dogs. Just like the NCLB problem with 100% above average was waved away because hey, by them, the improvements will be sooooo wonderful that we won’t have to worry about the pesky statistical impossibilities.

I am not sure, but it seems likely that the fed’s relaxed attitude towards test scores has something to do with the abandonment of this false idol, which leads inevitably to the reluctant realization that perhaps The Nation At Risk was wrong, perhaps something else is involved with academic achievement besides simply plopping kids in the right classes. I offer in support the fact that Jerry Brown, governor of California, has remained almost entirely unscathed for shrugging off the achievement gap, saying hey, life’s a meritocracy. Who’s going to be a waiter if everyone’s “elevated” into some important job? Which makes me wonder if Jerry reads my blog.

So if teacher’s don’t make any difference and VAM is pointless, how come any yutz can’t become a teacher?

No one, ever, has argued that teachers don’t make any difference. What they do say is that individual teacher qualities make very little difference in student test scores and/or student academic outcomes, and the differences aren’t predictable or measurable.

If I may quote myself:

Teaching, like math, isn’t aspirin. It’s not medicine. It’s not a cure. It is an art enhanced by skills appropriate to the situation and medium, that will achieve all outcomes including success and failure based on complex interactions between the teachers and their audience. Treat it as a medicine, mandate a particular course of treatment, and hundreds of thousands of teachers will simply refuse to comply because it won’t cure the challenges and opportunities they face.

And like any art, teaching is not a profession that yields to market justice. Van Gogh died penniless. Bruces Dern and Davison are better actors than Chrisses Hemsworth and Evans, although their paychecks would never know it. Teaching, like art and acting, runs the range from velvet Elvis paint by numbers to Renoir, from Fast and Furious to Short Cuts. There are teaching superstars, and journeyman teachers, and the occasional lousy teacher who keeps working despite this–just as Rob Scheider still finds work, despite being so bad that Roger Ebert wrote a book about it.

Unlike art and acting, teaching is a government job. So while actors will get paid lots of money to pretend to be teachers, the job itself will never lead to the upside achieved by the private sector, despite the many stories about famous Korean tutors. Upside, practicing our craft won’t usually lead to poverty, except perhaps in North Carolina.

Most teachers understand this. It’s the outside world and the occasional short-termers who want teachers to be rewarded for excellence. Most teachers don’t support merit pay and vehemently oppose “student growth measures”.

The country appears to be moving towards a teacher shortage. I anticipate all talk of VAM to vanish. But if you want to improve teacher quality beyond its current much-better-than-it’s-credited condition, I suggest we consider limiting the scope of public education. Four of these five education policy proposals will do just that.

**************************************************************************
1 I was writing this up in the comments section of Scott Alexander’s commentary on teacher VAM research, when I remembered I was behind on my post quota. What the heck. I’m turning this into a post. It’s a long answer, but not as long-winded as Scott Alexander, the one blogger who makes me feel brusque.

Advertisements

About educationrealist


29 responses to “The Many Failings of Value-Added Modeling

  • Daniel Freeman

    “No one, ever, has argued that teachers don’t make any difference. What they do say is that individual teacher qualities make very little difference in student test scores and/or student academic outcomes, and the differences aren’t predictable or measurable.”

    Has there ever been a study that compared the academic outcomes of students that received traditional teaching versus students that were just given a pile of material to learn on their own?

  • Aldo Rustichini

    I am not sure why the very interesting post “Algebra 1 Growth in Geometry and Algebra II, Spring 2013” should prove that VAM is impossible. It seems that the ratio of the sum of the realized vertical displacements over the total of the possible vertical displacements in the two scatter-plots on that post is a good measure of VAM.

  • Model, Mirror, Mentor – spottedtoad

    […] was I shocked by any of my former colleagues’ or friends’ scores? No, of course not. Is VAM mere chicanery, no more scientific than 16th century alchemy? Probably, though those alchemists had some insights into practical chemistry that came in handy […]

  • anonymousskimmer

    “Bruces Dern and Davison are better actors than Chrisses Hemsworth and Evans, although their paychecks would never know it. Teaching, like art and acting, runs the range from velvet Elvis paint by numbers to Renoir, from Fast and Furious to Short Cuts.”

    The only objective evaluation here is paint-by-numbers vs Renoir. In all other cases you’re letting your sense of taste dominate over objective judgment.

    I haven’t seen enough of Bruce Dern to evaluate him at all, but for the other three actors I’ve seen enough of their basic personality bleed through their acting that I cannot distinguish between good acting and good typecasting. I believe that it would take an expert to make such an evaluation (or a relevant test of truly varied roles).

    Thanks for mentioning Jeanie Oakes. I hadn’t heard of her before.

  • Mark Roulo

    “No one, ever, has argued that teachers don’t make any difference. What they do say is that individual teacher qualities make very little difference in student test scores and/or student academic outcomes, and the differences aren’t predictable or measurable.”

    Do you think you could design a measurement of teaching effectiveness that (a) was fairly stable/repeatable, and (b) measured some/much of the things “we” actually care about?

    Or do you think that this is inherently unmeasureable? Not just with today’s tests and mandates, but no matter what.

    • educationrealist

      No. I do think they could capture the qualities that are good for engagement of hard to engage populations, or at least a group of different profiles that work. Might not raise test scores, but will get kids working and planning. But once they identify those qualities, those teachers would have to pay more.

      • Nuclear Lab Rat

        With regards to Mark’s question, what are the “things we actually care about”? That is, would you say that “these things” are measurable or even specifiable to any useful degree of accuracy or precision over a broad population? Or even a narrow one?

      • educationrealist

        What I think we care about are engaged kids working at the best of their ability and using the information they acquire to make decisions about their values and careers.

        Hey, that’s pretty good. I think I’ll steal that.

  • Mark Roulo

    From SlateStarCodex: “It leads to teachers in top schools with 100% of kids above average getting terrible VAM scores because only 86% instead of 92% of the kids hit their expected target–and that was because they were at 100% the year before, and were at 98% the next year. When in fact, that’s almost certainly due to test variances. (In some states, the third grade reading test is notably harder than the second grade’s, for example.)”

    I think this mostly goes away if the tests don’t have an artificial ceiling. I don’t expect anyone involved in this to actually *FIX* this, but what you describe seems to be an artifact of a poorly designed (for the use to which it is put) test.

  • Purple Tortoise

    It could be worse. At the college level, teacher quality is measured via student evaluations.

    • educationrealist

      Yes. That would be horrible, I agree.

    • NuclearLabRat

      At the college level, nobody who makes decisions about staffing or tenure really cares whatsoever about teacher quality – it’s all about how much money did you bring in.

      And at the upper division college level, student evaluations of teacher performance, in my experience, can be very indicative of a teacher’s performance. I would never suggest this method be used to measure teachers in anything below college courses, and there are some major caveats to their use even there, but in general I would say that in college student ratings are not used enough, rather than the other way around. Teaching simply isn’t a priority at major US colleges. At all.

      • educationrealist

        yes, I agree. But I’m just not certain that students are any better at judging.

      • Purple Tortoise

        NLR is correct that teaching isn’t a priority at research universities, but most college students are not at research universities. At lesser ranked colleges, student evaluations do play a large role because they give the administration simple and cheap metric on which to gauge teaching without ever having to step into a classroom. Contra NLR, I definitely disagree that student evaluations are indicative of a teacher’s performance. The students can accurately report whether the instructor showed up on time and gave organized lectures, but the students are not in a position to judge whether the instruction was actually effective for genuine learning. Moreover, many studies have shown a substantial correlation between student evaluation scores and easy grading, lack of rigor, and the physical attractiveness of the instructor.

      • Purple Tortoise

        A clarification of my above statement: written comments are sometimes useful because they provide a perspective on what happened inside the classroom, but numerical scores from student evaluations are worthless. If Prof. A gets a 3.4 average in a giant lower-division GE class and Prof. B gets a 4.1 in an upper-division special topics class, can we really say Prof. B is a better instructor. And when a student writes in an evaluation, “All I wanted was an easy A, but the instructor gave too much work,” can we really trust the numerical scores at all?

  • Jokah Macpherson

    I don’t understand the North Carolina reference.

  • Nuclear Lab Rat

    Would you agree that part of the reason it is so hard to measure teacher quality is that current licensing requirements have resulted in such a large restriction of range effect that all (or nearly all) current teachers are beyond the point of diminishing returns in terms of how much effort one should spend to find a “good” teacher? That is, if we suddenly started allowing anyone with, say, a middle school education to teach, would we then be able to find some useful measure of teacher quality?

    • educationrealist

      Yes. I think we’re well above the basement. In fact, I think we could lower the basement for elementary school by splitting off K-6 into K-4 and 5-8, which would give more social workers a chance to be school teachers. I think the high school one is set about right in math, is probably a bit too easy in English and history. I took the credential tests in a state with notoriously difficult ones, and aced English and Social Science in minimal time (one of the English sections took me 20 minutes). Now, these are my best subjects and I’m a bit of a genius. Also, I’m not saying the tests were simple. They were well-done and challenging. But they have about a 70% pass rate, and I know the Praxis tests are slightly easier, with a slightly higher pass rate. But we’re just talking about the edges.

    • Ford

      My experience going to school, and talking among friends, is that this is far from true.

      The difference between how much I learned from my good teachers vs my bad teachers was night and day. The difference in how much I know about history (taught by an amazing teacher) versus how much my wife does not know (who went to a top-ranked school, but with mediocre history teachers) is astonishing to me.

      That said, absolutely none of that difference in quality of education was reflected in my ability to succeed in college or in the job market… So I’ll agree that measuring it from a distance seems almost impossible.

  • Vocational Ed and the Elephant | educationrealist

    […] cycle began when 1983’s Nation at Risk forced radical changes in high school education in a failed attempt to raise standards. Nation badly damaged what successful vocational ed we had by arguing we […]

  • Word from the Dark Side, 6/4/16 | SovietMen

    […] Slate Star Codex examines the evidence about how much difference teachers make.  Education Realist considers Value-Added measures of teaching […]

  • End of Education Reform? | educationrealist

    […] percent of the variability in test scores”. As I wrote earlier, I don’t think VAM will last much longer. Teachers are being judged by test scores in some states, but the energy is on rolling back those […]

  • In Which Ed Explains Induction | educationrealist

    […] back in the 90s, it finally began to occur to folks that not all kids were ready for this material, but rather than change the requirements, they […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: