Why Merit Pay and Value Added Assessment Won’t Work, Part I

The year I taught Algebra I, I did a lot of data collection, some of which I discussed in an earlier post. Since I’ve been away from that school for a while, I thought it’d be a good time to finish the discussion.

I’m not a super stats person. I’m not even a mathematician. To the extent I know math, it’s applied math, with the application being “high school math problems”. This is not meant to be a statistically sound analysis, comparing Treatment A to Treatment B. But it does reveal some interesting big picture information.

This data wasn’t just sitting around. A genuine DBA could have probably whipped up the report in a few hours. I know enough SQL to get what I want, but not enough to get it quickly. I had to run reports for both years, figure out how to get the right fields, link tables, blah blah blah. I’m more comfortable with Excel than SQL, so I dumped both years to Excel files and then linked them with student id. Unfortunately, the state data did not include the subject name of each test. So I could get 2010 and 2011 math scores, but it took me a while to figure out how to get the 2010 test taken—and that was a big deal, because some of the kids whose transcripts said algebra had, in fact, taken the pre-algebra (general math) test. Not that I’m bitter, or anything.

Teachers can’t get this data easily. I haven’t yet figured out how to get the data for my current school, or if it’s even possible. I don’t know what my kids’ incoming scores are, and I still haven’t figured out how my kids did on their graduation tests.

So the data you’re about to see is not something teachers or the general public generally has access to.

At last school, in the 2010-11 school year, four teachers taught algebra to all but 25 of over 400 students. I had the previous year’s test scores for about 75% of the kids, 90% of whom had taken algebra the year before, the other 10% or so having taken pre-algebra. This is a slightly modified version of my original graph; I put in translations of the scores and percentages.


You should definitely read the original post to see all the issues, but the main takeaway is this: Teacher 4 has a noticeably stronger population than the other three teachers, with over 40% of her class having scored Basic or Higher the year before, usually in Algebra. I’m Teacher 3, with by far the lowest average incoming scores.

The graph includes students for who I had 2010 school year math scores in any subject. Each teacher has from 8-12 pre-algebra student scores included in their averages. Some pre-algebra kids are very strong; they just hadn’t been put in algebra as 8th graders due to an oversight. Most are extremely weak. Teachers are assessed on the growth of kids repeating algebra as well as the kids who are taking it for the first time. Again, 80% of the kids in our classes had taken algebra once. 10-20% had taken it twice (our sophomores and juniors).

Remember that at the time of these counts, I had 125 students. Two of the other teachers (T1 and T4) had just under 100, the third (T2) had 85 or so. The kids not in the counts didn’t have 2010 test scores. Our state reports student growth for those with previous years’ scores and ignores the rest. The reports imply, however, that the growth is for all students. Thanks, reports! In my case, three or four of my strongest students were missing 2010 scores, but the bulk of my students without scores were below average.

So how’d we do?

I limited the main comparison to the 230 students who took algebra for both years and had scores for both years and had one of 4 teachers.


Here are the pre-algebra and algebra intervention growth–pre-algebra is not part of the above scores, but the algebra intervention is a sub-group. These are tiny groups, but illustrative:


The individual teacher category gains/slides/pushes are above; here they are in total:

(Arrrggh, I just realized I left off the years. Vertical is 2010, horizontal is 2011.)

Of the 230 students who took algebra two years in a row, the point gain/loss categories went like this:

Score change > + 50 points 57
Score change > -20 points 27
-20 points < score change < + 50 points 146

Why the Slice and Dice?

As I wrote in the original post, Teacher 1 and I were positive that Teacher 4 had much stronger student population than we did—and the data supports that belief. Consequently I suspected that no matter how I sliced the data, Teacher 4 would have the best numbers. But I wanted a much better idea of how I’d done, based on the student population.

Because one unshakeable fact kept niggling at me: our school had a tremendous year in 2010-2011, based largely on our algebra scores. We knew this all throughout the year—benchmark tests, graduation tests—and our end of year tests confirmed it, giving us a huge boost in the metrics that principals and districts cared about. And I’d taught far more algebra students than any other teacher. Yet my numbers based on the district report looked mediocre or worse. I wanted to square that circle.

The district reports the data on the right. We were never given average score increase. A kid who had a big bump in average score was irrelevant if he or she didn’t change categories, while a kid who increases 5 points from the top of one category to the bottom of another was a big win. All that matters were category bumps. From this perspective, my scores look terrible.

I wanted to know about the data on the left. For example Teacher 1 had far better “gain” category numbers than I did. But we had the same mean improvement overall, of 5%, with comparable increases in each category. Broken down further, Teacher 4’s spectacular numbers are accompanied by a huge standard deviation—she improved some kids a lot. The other three teachers might not have had as dramatic a percentage increase, but the kids moved up more consistently. In three cases, the average score declined, but was accompanied by a big increase in standard deviation, suggesting many of the kids in that category improved a bit, while a few had huge drops. Teacher 2 and I had much tighter achievement numbers—I may have moved my students less far, but I moved a lot of them a little bit. None of this is to argue for one teacher’s superiority over another.

Of course, once I broke the data down by initial ability, group size became relevant but I don’t have the overall numbers for each teacher, each category, to calculate the confidence interval or a good sample size. I like 10. Eleven of the 18 categories hit that mark.

How many kids have scores for both years?

The 2011 scores for our school show that just over 400 students took the algebra test. My fall 2010 graph above show 307 students with 2010 scores (in any subject) who began the year. Kick in another 25 for the teacher I didn’t include and we had about 330 kids with 2010 scores. My results show 230 kids with algebra scores for both years, and the missing teacher had 18, making 248. Another 19 kids had pre-algebra scores for the first year, although the state’s reports wouldn’t have cared about that. So 257 of the kids had scores for both years, or about 63% of the students tested.

Notice that I had the biggest fall off in student count. I think five of my kids were expelled before the tests, another four or so left to alternative campuses. I remember that two went back to Mexico; one moved to his grandparents’ in Iowa. Three of my intervention students were so disruptive during the tests that they were ejected, so their test results were not scored (the next year our school had a better method of dealing with disruptive students). Many of the rest finished the year and took the tests, but they left the district over the summer (not sure if they are included in the state reports, but I couldn’t get their data). I think I had the biggest fall-off over the year in the actual student counts; I went from 125 to 95 by year-end.

What about the teachers?

Teacher 1: TFA, early-mid 20s, Asian, first year teacher. Had a first class honors masters degree in Economics from one of the top ten universities in Europe. She did her two, then left teaching and is now doing analytics for a fashion firm in a city where “fashion firm” is a big deal. She was the best TFAer I’ve met, and an excellent new teacher.

Teacher 2: About 60. White. A 20-year teacher who started in English, took time off to be a mom, then came back and got a supplemental math credential. She is only qualified to teach algebra. She is the prototype for the Teacher A I described in my last post, an algebra specialist widely regarded as one of the finest teachers in the district, a regard I find completely warranted.

Teacher 3: Me. 48 at the time, white. Second career, second year teacher, English major originally but a 15-year techie. Went to one of the top-rated ed schools in the country.

Teacher 4: Asian, mid-late 30s. Math degree from a solid local university, teaches both advanced math and algebra. She became the department head the next year. The reason her classes are top-loaded with good students: the parents request her. Very much the favorite of administration and district officials.

And so, a Title I school, predominantly Hispanic population (my classes were 80% Hispanic), teachers that run the full gamut of desirability—second career techie from a good ed school, experienced pro math major, experienced pro without demonstrated higher math ability, top-tier recent college grad.

Where was the improvement? Case 1: Educational Policy Objectives

So what is “improvement”? Well, there’s a bunch of different answers. There’s “significant” improvement as researchers would define it. Can’t answer that with this data. But then, that’s not really the point. Our entire educational policy is premised on proficiency. So what improvement does it take to reach “proficiency”, or at least to change categories entirely?

Some context: In our state, fifty points is usually enough to move a student from the bottom of one category to the bottom of another. So a student who was at the tip top of Below Basic could increase 51 points and make it to the bottom of Proficient, which would be a bump of two categories. An increase of 50 points is, roughly, a 17% increase. Getting from the bottom of Far Below Basic to Below Basic requires an increase of 70%, but since the kids were all taking Algebra for the second time, the boost needed to get them from FBB to BB was a more reasonable 15-20%. To get from the top of the Far Below Basic category to Proficient—the goal that we are supposed to aim for—would require a 32% improvement. Improving from top of Basic to bottom of Advanced requires a 23% improvement.

Given that context, only two of the teachers in one category each moved the needle enough to even think about those kind of gains—and both categories had 6-8 students. Looking at categories with at least ten students, none of the teachers had average gains that would achieve our educational policy goals. In fact, from that perspective, the teachers are all doing roughly the same.

I looked up our state reports. Our total population scoring Proficient or Advanced increased 1%.

Then there’s this chart again:


32 students moved from “not proficient” to “proficient/advanced”. 9 students moved from “proficient” to “advanced”. I’ll throw them in. 18% of our students were improved to the extent that, officially, 100% are supposed to achieve.

So educational policy-wise, not so good.

Where was the improvement? Case 2: Absolute Improvement

How about at the individual level? The chart helps with that, too:


Only 18 students were “double gainers” moving up two categories, instead of 1. Twelve of those students belonged to Teacher 4; 4 belonged to Teachers 1 , while Teacher 2 and I only had 1 (although I had two more that just missed by under 3 points). Teachers 1, 2, and 3 had one “double slider” each, who dropped two categories.

(I interviewed all the teachers on the double gainers; in all cases, the gains were unique to the students. The teachers all shrugged—who knew why this student improved? It wasn’t some brilliant aha moment unique to that teacher’s methods, nor was it due to the teacher’s inspiring belief and/or enthusiasm. Two of the three echoed my own opinion: the students’ cognitive abilities had just developed over the past year. Or maybe for some reason they’d blown off the test the year before. I taught two of the three “double sliders”—one was mine, one I taught the following year in geometry, so I had the opportunity to ask them about their scores. Both said “Oh, yeah, I totally blew off the test.” )

So a quarter of the students had gains sufficient to move from the middle of one category to the middle of another. The largest improvement was 170 points, with about 10 students seeing >100 point improvement. The largest decline was 169 points, with 2 students seeing over 100 point decline. Another oddity: only one of these two students was a “double slider”. The other two “double sliders” had less than 100 point declines. My double slider had a 60 point decline; my largest point decline was 89 points, but only dropped one category.

However, the primary takeaway from our data is that 63% of the students forced to take algebra twice were, score-wise if not category-wise, a “push”. They dropped or gained slightly, may have moved from the bottom of one category to the middle of the same, or maybe from the top of one category to the bottom of another.

One might argue that we wasted a year of their lives.

State reports say our average algebra score from 2010 to 2011 nudged up half a point.

So it’s hard to find evidence that we made much of a difference to student achievement as a whole.

I know this is a long post, so I’ll remind the reader that all of the students in my study have already taken algebra once. Chew on that for a while, will you?

Where was the improvement? Case 3: Achievement Gap

I had found no answer to my conundrum in my above numbers, although I had found some comfort. Broken down by category, it’s clear I’m in the hunt. But the breakdown doesn’t explain how we had such a stupendous year.

But when I thought of comparing our state scores from year to year, I got a hint. The other way that schools can achieve educational policy objectives is by closing the achievement gap.

All of this data comes from the state reports for our school, and since I don’t want to discuss who I am on this blog, I can’t provide links. You’ll have to take my word for it—but then, this entire post is based on data that no one else has, so I guess the whole post involves taking my word for it.

2010-11 Change
Overall + 0.5
Whites 7.2
Hispanics + 4
EcDis Hisp 1
ELL + 7

Wow. Whites dropped by seven points, Hispanics overall increased by 4, and non-native speakers (almost entirely Hispanic and economically disadvantaged), increased by 7 points.

So clearly, when our administrator was talking about our great year, she was talking about our cleverness in depressing white scores whilst boosting Hispanics.

Don’t read too much into the decline. For example, I personally booted 12 students, most of them white, out of my algebra classes because they’d scored advanced or proficient in algebra the previous year. Why on earth would they be taking the subject again? No other teacher did this, but I know that these students told their friends that they could get out of repeating Algebra I simply by demanding to be put in geometry. So it’s quite possible that much of the loss is due to fewer white advanced or proficient students taking algebra in the first place.

So who was teaching Hispanics and English Language Learners? While I can’t run reports anymore, I did have my original file of 2010 scores. So this data is incoming students with 2010 scores, not the final 2011 students. Also, in the file I had, the ED and ELL overlap was 100%, and I didn’t care about white or black EDs for this count. Disadvantaged non-ELL Asians in algebra is a tiny number (hell, even with ELL). So I kept ED out of it.

  Hisp ELL
t1 30 21
t2 32 38
t3 48 37
t4 39 12

Well, now. While Teacher 4 has a hefty number of Hispanics, very few of them are poor or ELLs. Teacher 2 seems to have Asian ELLs in addition to Hispanic ELLs. I have a whole bunch of Hispanics, most of them poor and ELL.

So I had the most mediocre numbers, but we had a great year for Hispanic and ELL scores, and I had the most Hispanic and ELL students. So maybe I was inadvertently responsible for depressing white scores by booting all those kids to geometry, but I had to have something to do with raising scores.

Or did I? Matthew DiCarlo is always warning against confusing comparing year to year scores, which are a cross-section of data at a point in time, with comparing student progress at two different points in time. In fact, he would probably say that I don’t have a conundrum, that it’s quite possible for me to have been a crappy teacher who had minimal impact on student achievement compared point to point, while the school’s “cross-section” data, which doesn’t compare students directly, could have some other reason for the dramatic changes.

Fair enough. In that case, we didn’t have a great year, right? It was just random happenstance.

This essay is long enough. So I’ll leave any one interested to explain why this data shows that merit pay and value added scores are pointless. I’m not sure when I’ll get back to it, as I’ve got grades to do.


About educationrealist

3 responses to “Why Merit Pay and Value Added Assessment Won’t Work, Part I

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: