I recently read Ian Ayres’ excellent book, Super Crunchers. For folks who read and enjoyed Freakonomics, this book is a must-read, covering more cases where clever statistical analyses have uncovered interesting and useful results. The goal in writing the book, according to Ayres, was to encourage people to learn to think statistically. On the other side of the link is a discussion of some errors in experimental design, why their treatment in Ayres’ book frustrates me and why the average person should care.
My chief complaint about this book is that there is no insight offered into the hygiene of the data used. This may seem to be a nitpicky sort of complaint, given that the book was written for the general audience, but since the book focused on case studies where hardcore statistical analysis yielded interesting and useful results, I think that an understanding of how you make sure your data is clean would be helpful. An example of this is the chapter on direct instruction (DI). I don’t question the studies or their data that show that DI is an effective method, since they have been almost certainly been peer reviewed by excellent statisticians. I do want to point out a couple of different types of errors that often exist in otherwise clean datasets. It turns out, as I will explain below, that one of these errors and the omission of any information about how the data used in this analysis were collected, is highly relevant to another very real-world type of problem – how your bosses decide whether or not you get a raise.
The first, and most common error in an experimental design is an uncontrolled variable. This happens when you have measurements that you assume to be varying as a function of your controlled variables, but in actuality are most highly correlated with a variable that you’re not controlling. This sounds like an easy thing to avoid, but in reality, systems are messy and you will not always design your experiment correctly the first time around. You will usually miss some things that affect your experiment, due to unfamiliarity if nothing else. An example of this kind of error would be measuring plant growth as a function of days of sunshine, but not controlling for the amount of water the plants receive. Most savvy experimental designs take this into account by limiting what can be inferred by the correlation – this is what people mean when they say that correlation doesn’t imply causation. They’re being careful about what they can and cannot infer from the data. Oddly, I don’t recall Ayres explicitly mentioning this issue in any of the case studies, but we can assume that any obvious errors in this regard would have been quickly picked out by the reviewers of the papers from which the case studies were drawn.
The second error, and the one that I think is actually most relevant is what I call proxy error. This is error or bias that enters when you are forced to measure a proxy for a property, rather than the property itself. This is not uncommon at all – most things that are interesting are also not directly measurable. A great example of this is measuring the quality of books. People have different tastes, and different opinions about what they value in a book. If its fiction, some people might prefer stories with gripping, twisty plots, while others might prefer interesting characters that can be identified with. Its important for the publishing industry to have an objective measure of how good a book is, so they can try to maximize their earnings by publishing what people want to read. There is thus an implicit assumption made that books that have high sales are good books. This is true, from the perspective of a business, make no mistake, but from the perspective of a book-lover this is clearly not the case, as anyone who has read The Da Vinci Code or the latest Laurell Hamilton novel can attest. What we see here is a proxy error, in that we believe that something which appeals to everyone must have high quality. By that same token, we might argue that McDonald’s is higher quality than your favorite local fine dining restaurant. In those terms, we can quickly see the absurdity.
In the case of the chapter on direct instruction in Super Crunchers, Ayres comes to the conclusion that DI has marked efficacy in both imparting core skills and improving creativity. What I want to point out is that there are serious issues with drawing that unequivocal conclusion on the basis of the “super crunching” that was done, at least according to what was reported in the book. The first issue is that there is no precise definition of creativity and thus it cannot be quantitatively measured. This means that in absence of a clear discussion about what was meant by creativity and how it was measured, I don’t think a conclusion can be drawn. Additionally, the lack of a precise definition means that you have to construct proxies that do have precise definitions and are measurable. Now you have two problems to solve: you have to construct a proxy and then measure it in a statistically reliable way.
For creativity specifically, I’m aware of several proxies. When we interview potential candidates for a research scientist/engineer position, creativity is one of the key traits we probe for. We therefore not only have to construct reliable, repeatable, and robust proxies for creativity, we have to construct them such that they can be measured within the constraints of our interview process. We might thus draw the following conclusion: “If a candidate is a good brainstormer and can use that modality effectively to solve problems placed before her, we believe that she will likely be a highly creative individual.” This comes through in the process as “so-and-so had some really creative ideas and response to my questions, so I ranked her highly on creativity.”
The researchers who design the educational testing that generated the data on DI went through a similar, though probably much more in-depth and rigorous, thought process when deciding how to measure creativity. Their design criteria were also necessarily different. Rather than being constrained to an arbitrary process like our recruiting process, their design was likely focused much more heavily on reliability and repeatability. We sacrifice some of those, knowing that our process may reject some qualified candidates as a result of that sacrifice, because the cost of doing scientifically reliable and repeatable measurement is too great for the payback on the margin.
But no matter how careful your design is, there are simply some cases where any proxy will not accurately reflect the property for which they are proxy. In the case of sociometric and psychometric testing, that error is itself unmeasurable and unbounded. You might be able gain some understanding of the error by constructing several different proxies and measuring them as well, but my sense is that this is rarely done in a single study. And in this particular case, the danger is that the children being tested had learned to do well on the tests rather than actually having learned to be creative. In other words, the children only appeared to be creative because they had altered their behaviors to meet the expectations of the researchers.
This subject is one that is highly relevant to the general audience, however, because of how often it affects peoples’ lives. It has become particularly interesting to me as we have developed better metrics for recruiting and rewarding researchers and managing our project portfolio. Setting metrics by which a group of PhD scientists and engineers are measured and rewarded is tricky, because of the richness in the strategies we’ll employ to maximize our benefit within the metrics given. It is therefore critical that the metrics are proxies for activities and outcomes that are truly valuable to the organization, since they will likely profoundly alter behaviors. Constructing these metrics is a difficult job and I believe that feedback on the metrics themselves is critical in getting it right. If people in general understood these connections and subtleties of constructing effective human metrics, then a lot of the pathos that exists within the business world around the subject would disappear. While this would be terrible for Scott Adams, since he’d lose a lot of material for Dilbert, I think we’d handle the loss just fine.
It’s also worth noting that my interest in this subject probably sensitized me to the proxy error that could be present in this data. Its possible, although I think unlikely, that the study accounted for this somehow. Its also very likely that I missed potential proxy errors in other chapters and that I’d probably not be competent to discuss them even if I did notice them. That doesn’t mean that they aren’t there, of course. Before anyone accuses me of picking on one part of the book unfairly, let me assure you that my criticisms stem only from my wish that he’d treated this subject in a little more depth, which is possibly an unreasonable request for a general audience book. Regardless, I think that if we’re teaching people to think statistically, as Ayres advocates in the book, we need to also teach them the first half of the problem, which is good experimental design, at the same time. After all, the statistics are no better than the data from which they are computed.