Houston - It's Hot!: testing

Ericka Mellon wrote an article about the performance of high school students at HISD schools on a "college ready" assessment in the Houston Chronicle.

The article describes advances HISD has made in the number of "college ready" students on the TAKS Exit Level exam. This is defined (as far as I can tell) by a student receiving a scale score of 2200 in each of the Math and Reading sections of the exam. My first two questions are:

Do these scale scores correlate with any future success, such as college performance or acceptance?
Scale scores are a way of applying a "fudge factor" to try to normalize score reporting across different years' tests (just like the SAT scores, for example); are we sure they represent something meaningful? Can they easily be adjusted year to year to affect the numbers?

Here's an example of the scale score issue: One year, a student can get 23 of 36 questions correct, and receive a scale score of 2100. The next year, the test is determined to be "easier", so a student needs 26 of 37 correct for the same scale score. What's not clear to me is that the composition of the 23 or 26 correct affects the student's scale score. On the first test, if a student shows complete mastery of a most topics, and gets 13 wrong representing a 0% mastery of two core topics, can that be the same as a student who misses 13 questions, some here and there, but with a decent grasp of all the concepts? Can you compare two students who get 23 correct - one who gets 23 of the 24 "easy" questions right, and none of the 12 "hard" ones, and a student who gets 23 correct, a mix of hard and easy questions? Does it make sense to map a scale score to just a raw score, or should the questions or their distributions be weighted? You should refer to the TEA web site documenting the conversion of raw scores to scaled scores on the TAKS.

Never mind. Let's say students who meet the scale score test are all equally "ready for college". One of the documents Ms. Mellon attaches at the bottom of her article shows the achievement levels per high school in HISD. The numbers are interesting. DeBakey has an impressive record of preparing their students for the TAKS exit, the best in town. Almost all their kids score at least proficient in Math and Reading. Bellaire last year saw 82% of their kids "pass" in Math, 76% in Reading. Carnegie, 95%/94%, and that's way up from 87%/72% (!!) in 2007. I love the HSPVA numbers, which kind of buck the trend of doing better in math than reading: 84%/96%. Lamar's numbers are 65%/63%.

What does this mean for a parent trying to decide which HS is right for their kid? On the one hand, if you are not worried about your kids passing these standards, maybe these aggregate numbers aren't that important to your individual case. On the other hand, I worry that schools have been or will be looking at these numbers, setting campus goals, and then expending a large number of resources trying to get those numbers up. Although that's not a bad reaction (again, assuming these metrics actually measure something meaningful), in practice, I fear this means that fewer or no resources at those schools will be focused on the students who are in no danger of missing these goals - the advanced kids who could also use more attention to better develop their own skills and interests. I worry more and more that campus educational resources are a zero-sum game, and when the balance shifts inordinately to focusing on bringing the bottom students up, the top students get less attention. What should that balance be?

As parents, should we focus on sending our kids to schools with the best records, assuming their staff already feels confident in their students' performance and can focus on deeper or broader curricula? Or should we worry that schools at the top are there because they're focusing so many resources on passing these tests, and our children may languish?

Houston ISD has implemented a new evaluation system called ASPIRE, developed in collaboration with Battelle for Kids, based on the SAS Educational Value-Added Assessment System (EVAAS). This is a new initiative to track student progress year-over-year (longitudinally), instead of comparing this year's third grade class to next year's and last year's (a cross-sectional study). The idea behind the new evaluation is that if you track the same population over time then you can see how different teachers improve their progress year after year. It's a nice idea, but as far as I can tell, this implementation has at least two major flaws. I'd appreciate comments from educators and statisticians either confirming or rebutting these observations; I'm neither, and I like to hear from experts.

Before I outline what I consider are ASPIRE's weaknesses, I'd like to go on record as a supporter of the concept at least in theory. It's clear that the old method of measuring a teacher's "performance" year after year, with a changing population each year, is unfair to the teacher because it does not control for possibly wide swings in their class's demographics. If one year a teacher has an overall eager student body, and the next, one that comes in with a lack of skills or a lack of focus, the teacher's "performance" will vary. The exit scores for each of those classes will differ; one year he or she will look like a success, and the next, possibly a failure.

The implementation of ASPIRE is an attempt to measure a class's incoming and outgoing level of achievement, and determine what if any effect the teacher has on the students. On first sight, that seems reasonable; however, the implementation at HISD falls short for the following reasons.

The first potential flaw in the system is that schools with an advanced student population will show little to no year-over-year improvement. This is an effect of the tests chosen for the ASPIRE metrics: the Texas Assessment of Knowledge Skills (TAKS), and a normed test (Stanford or Aprenda). Neither of these tests can differentiate among the students in the 99th percentile, and that effect may even be true for a larger population (perhaps to the 95th percentile). If a child stays in the 99th percentile year after year, or bounces around within the 95th-99th percentile band, they show no (or negative!) progress according to those tests. It's not fair to penalize that child's (or that class's) teachers because the test can't measure student progress at that level. The same goes for the TAKS; it's designed to measure a very basic level of subject mastery; for schools with advanced students, that mastery happens as a matter of course, and possibly even early in the school year. What happens in the classroom supplemental to that (a deeper investigation of the subject, a broader survey of related topics) is not measured in ASPIRE.

When HISD implements "performance based pay" on such systems, they intend to truly reward the teachers who take struggling students and help them reach new levels of accomplishment during the school year. However, they run the risk of leaving behind the teachers who are teaching the average or advanced students, and that's not fair. By publishing this "value add" data on the web, they actually give the misleading impression that the schools with more accomplished students actually have flat or failing performance.

Let me be clear: this is not a problem for all student populations. For students who are not advanced, an improvement year after year would be meaningful if it were measured correctly.

That brings me to the second potential flaw: the data may not be reflecting what ASPIRE needs to measure. The first input is TAKS performance, a measure of mastery in various subject areas. This is probably a good indicator when used for reading (comprehension) and mathematics, which are measured every year; those are subjects where each year builds upon the student's previous knowledge, and an improvement may signal a significant change in understanding. The other areas measured (science and writing) are less obviously incremental, and aren't tested every year; and other parts of the curriculum (history, art, music, foreign language, health, for example) aren't measured at all.

The other test used as input is either the Stanford or the Aprenda (for Spanish speaking students). Unfortunately for this effort, these are nationally normed tests, useless for measuring student progress. Very briefly, a norm referenced test is one in which the student is assigned a rank against their peers that year; the questions are not assessing subject matter knowledge, but are instead chosen as differentiators between students. To see the effect of the first characteristic, just think of how a student will score differently based on a different set of kids taking the test; the same student could be in the 70th, 80th, or 90th percentile depending on who else is taking the test. Clearly, this is not simply measuring achievement; while a large part of how well a student does on the test depends on their knowledge, a significant factor is the set of other students, over which they have no control.

The second problem with normed tests is more subtle. The questions on the test are not chosen to assess what a student knows; instead, they're effectively chosen for how "tricky" they are, so they expose a difference between sets of students. The purpose of a normed test is to rank all the test-takers along a continuum of scores; you can't do that if there are a large number of questions on the test that everyone gets right or wrong. On the TAKS, which is a criterion-referenced assessment and is measuring comprehension and mastery, it's OK for everyone to get all the questions right; that means that all the students in Texas have mastered that subject that year. The normed tests are not serving that same purpose; such a result on the Stanford or Aprenda would be a serious failure.

The final issue with ASPIRE involves the more general debate about whether these standardized tests are actually providing a relevant measure of student accomplishment, and accurately reflect the effects of good or poor teachers (as opposed to a good or bad curricula, inappropriately homogenized pedagogical methods, external factors such as days lost to weather, etc.). You clearly cannot improve a system such as public education without being able to measure it; however, there's a valid debate over whether we know how to describe and measure the effects of a successful education. Until we get to that point, I'm supportive of attempts to assess educational effectiveness, and skeptical of punishing or rewarding teachers simply by using the results of those potentially ineffective efforts.

The idea of tracking each student population's progress longitudinally (year-over-year) and measuring their improvement is a good one; however, I'm disappointed that HISD and Battelle seem to have gotten the implementation wrong with ASPIRE. I can't tell if they use the TAKS and Stanford/Aprenda metrics simply because that's what they have at hand (and they don't want to change the tests, or add new ones), or if it's just because they fundamentally don't understand how poorly the tests measure what they're trying to track. Perhaps ASPIRE will get better over time; it may also be that my analysis above is flawed in one or many ways. If I'm way off base, I'd love the reassurance of being proven wrong.

Houston - It's Hot!

Topics

Archive

Tuesday, February 17, 2009

HISD - College Ready?

Monday, December 8, 2008

Houston ISD ASPIRE: Good idea, bad implementation?

You might also enjoy