Houston ISD ASPIRE: Good idea, bad implementation?

Houston ISD has implemented a new evaluation system called ASPIRE, developed in collaboration with Battelle for Kids, based on the SAS Educational Value-Added Assessment System (EVAAS). This is a new initiative to track student progress year-over-year (longitudinally), instead of comparing this year's third grade class to next year's and last year's (a cross-sectional study). The idea behind the new evaluation is that if you track the same population over time then you can see how different teachers improve their progress year after year. It's a nice idea, but as far as I can tell, this implementation has at least two major flaws. I'd appreciate comments from educators and statisticians either confirming or rebutting these observations; I'm neither, and I like to hear from experts.

Before I outline what I consider are ASPIRE's weaknesses, I'd like to go on record as a supporter of the concept at least in theory. It's clear that the old method of measuring a teacher's "performance" year after year, with a changing population each year, is unfair to the teacher because it does not control for possibly wide swings in their class's demographics. If one year a teacher has an overall eager student body, and the next, one that comes in with a lack of skills or a lack of focus, the teacher's "performance" will vary. The exit scores for each of those classes will differ; one year he or she will look like a success, and the next, possibly a failure.

The implementation of ASPIRE is an attempt to measure a class's incoming and outgoing level of achievement, and determine what if any effect the teacher has on the students. On first sight, that seems reasonable; however, the implementation at HISD falls short for the following reasons.

The first potential flaw in the system is that schools with an advanced student population will show little to no year-over-year improvement. This is an effect of the tests chosen for the ASPIRE metrics: the Texas Assessment of Knowledge Skills (TAKS), and a normed test (Stanford or Aprenda). Neither of these tests can differentiate among the students in the 99th percentile, and that effect may even be true for a larger population (perhaps to the 95th percentile). If a child stays in the 99th percentile year after year, or bounces around within the 95th-99th percentile band, they show no (or negative!) progress according to those tests. It's not fair to penalize that child's (or that class's) teachers because the test can't measure student progress at that level. The same goes for the TAKS; it's designed to measure a very basic level of subject mastery; for schools with advanced students, that mastery happens as a matter of course, and possibly even early in the school year. What happens in the classroom supplemental to that (a deeper investigation of the subject, a broader survey of related topics) is not measured in ASPIRE.

When HISD implements "performance based pay" on such systems, they intend to truly reward the teachers who take struggling students and help them reach new levels of accomplishment during the school year. However, they run the risk of leaving behind the teachers who are teaching the average or advanced students, and that's not fair. By publishing this "value add" data on the web, they actually give the misleading impression that the schools with more accomplished students actually have flat or failing performance.

Let me be clear: this is not a problem for all student populations. For students who are not advanced, an improvement year after year would be meaningful if it were measured correctly.

That brings me to the second potential flaw: the data may not be reflecting what ASPIRE needs to measure. The first input is TAKS performance, a measure of mastery in various subject areas. This is probably a good indicator when used for reading (comprehension) and mathematics, which are measured every year; those are subjects where each year builds upon the student's previous knowledge, and an improvement may signal a significant change in understanding. The other areas measured (science and writing) are less obviously incremental, and aren't tested every year; and other parts of the curriculum (history, art, music, foreign language, health, for example) aren't measured at all.

The other test used as input is either the Stanford or the Aprenda (for Spanish speaking students). Unfortunately for this effort, these are nationally normed tests, useless for measuring student progress. Very briefly, a norm referenced test is one in which the student is assigned a rank against their peers that year; the questions are not assessing subject matter knowledge, but are instead chosen as differentiators between students. To see the effect of the first characteristic, just think of how a student will score differently based on a different set of kids taking the test; the same student could be in the 70th, 80th, or 90th percentile depending on who else is taking the test. Clearly, this is not simply measuring achievement; while a large part of how well a student does on the test depends on their knowledge, a significant factor is the set of other students, over which they have no control.

The second problem with normed tests is more subtle. The questions on the test are not chosen to assess what a student knows; instead, they're effectively chosen for how "tricky" they are, so they expose a difference between sets of students. The purpose of a normed test is to rank all the test-takers along a continuum of scores; you can't do that if there are a large number of questions on the test that everyone gets right or wrong. On the TAKS, which is a criterion-referenced assessment and is measuring comprehension and mastery, it's OK for everyone to get all the questions right; that means that all the students in Texas have mastered that subject that year. The normed tests are not serving that same purpose; such a result on the Stanford or Aprenda would be a serious failure.

The final issue with ASPIRE involves the more general debate about whether these standardized tests are actually providing a relevant measure of student accomplishment, and accurately reflect the effects of good or poor teachers (as opposed to a good or bad curricula, inappropriately homogenized pedagogical methods, external factors such as days lost to weather, etc.). You clearly cannot improve a system such as public education without being able to measure it; however, there's a valid debate over whether we know how to describe and measure the effects of a successful education. Until we get to that point, I'm supportive of attempts to assess educational effectiveness, and skeptical of punishing or rewarding teachers simply by using the results of those potentially ineffective efforts.

The idea of tracking each student population's progress longitudinally (year-over-year) and measuring their improvement is a good one; however, I'm disappointed that HISD and Battelle seem to have gotten the implementation wrong with ASPIRE. I can't tell if they use the TAKS and Stanford/Aprenda metrics simply because that's what they have at hand (and they don't want to change the tests, or add new ones), or if it's just because they fundamentally don't understand how poorly the tests measure what they're trying to track. Perhaps ASPIRE will get better over time; it may also be that my analysis above is flawed in one or many ways. If I'm way off base, I'd love the reassurance of being proven wrong.

Houston - It's Hot!

Topics

Archive

Monday, December 8, 2008