Here we go again on testing – or at least the more general category of high stakes measures. This one by Larry Cuban. You know who Larry Cuban is – former high school social studies teacher (14 years, including in the District), seven years district superintendent in Arlington, VA, and professor emeritus of education at Stanford University for the past 20.

An interesting way for Cuban to close. Perhaps he was making a political argument in an attempt to reach the single measure aficionados. The research is pretty clear that the use of multiple measures – particularly the highly error-prone and inaccurate ones we have (mincomp tests and ratings – actually the greater problem is the consistency of the raters) – simply produces corruption of all of them.



“Why Do Good Policy Makers Use Bad Indicators?”*

Larry Cuban

January 29, 2012

Test scores are the coin of the educational realm in the U.S.. In No Child Left Behind, they are used to reward and punish districts, schools, and teachers for how well or poorly students score on state tests. In pursuit of federal dollars, The Race To The Top competition has shoved state after state into legislating that teacher evaluations include student test scores as part of judging teacher effectiveness.

Numbers glued to high stakes consequences, however, corrupt performance. Since the mid-1970s, social scientists have documented the untoward results of attaching high stakes to quantitative indicators not only for education but also across numerous institutions. They have pointed out that those who implement policies using specific quantitative measures will change their practices to insure better numbers.

The work of social scientist Donald T. Campbell and others about the perverse outcomes of incentives was available and known to many but went ignored. In Assessing the Impact of Planned Social Change, Campbell wrote:

“The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor” (p. 49).

Campbell drew instances of distorted behavior when police officials used clearance rates in solving crimes, the Soviets set numerical goals for farming and industry, and when the U.S military used “body counts” in Vietnam as evidence of winning the war.

That was nearly forty years ago. In the past decade, medical researchers have found similar patterns when health insurers and Medicare have used quantitative indicators to measure physician performance. For example, Medicare requires—as a quality measure—that doctors administer antibiotics to a pneumonia patient within six hours of arriving at the hospital. As one physician said: “The trouble is that doctors often cannot diagnose pneumonia that quickly. You have to talk to and examine the patient and wait for blood tests, chest X-rays and so on.” So what happens is that “more and more antibiotics are being used in emergency rooms today, despite all-too-evident dangers like antibiotic-resistant bacteria and antibiotic-associated infections.” He and other doctors also know that surgeons have been known to pick reasonably healthy patients for heart bypass operations and ignore elderly ones who have 3-5 chronic ailments to insure that results look good.

More examples.

TV stations charge for advertising on the basis of how many viewers they have during  “sweep” months (November, February, May, and July). Nielsen company has boxes in two million homes (representative of the nation’s viewership) that register whether the TV is on and what families are watching during those months. They also have viewers fill out diaries. Nielsen assumes that what the station shows in those months represents programming for the entire year (see 2011-2012-Sweeps-Dates). Nope. What TV networks and cable companies do is that during those “sweeps” they program new shows, films, extravaganzas, and sports that will draw viewers so they can charge higher advertising rates. They game the system and corrupt the measure (see p. 80).

And just this week, ripped from the headlines of the daily paper, online vendors secretly ask purchasers  of their products to write reviews and rate it with five stars in exchange for a kickback of the price the customer paid. Another corrupted measure.

Of course, educational researchers also have documented the link between standardized test scores and narrowed instruction to prepare students for test items, instances of state policymakers fiddling with cut-off scores on tests, increased dropouts, and straight out cheating by a few administrators. (see Dan Koretz, Measuring Up).

What Donald Campbell had said in 1976 about “highly corruptible indicators” applies not only in education but also to many different institutions.

So why do good policy makers use bad indicators? The answer is that numbers are highly prized in the culture because they are easy to grasp and use in making decisions.The simpler the number–wins/losses, products sold, profits made, test scores– the easier to judge worth. When numbers have high stakes attached to them, they then become incentives (either as a carrot or a stick) to make the numbers look good. And that is where  indicators turn bad as sour milk whose expiration date has long passed.

The best policymakers, not merely good ones, know that multiple measures for a worthy goal reduce the possibility of reporting false performance.


*Steven Glazerman and Liz Potamites, False Performance Gains: A Critique of Successive Cohort Indicators,” Working Paper, Mathematica Policy Research, December 2011, p. 13.