I read Cathy O'Neil's Weapons of Math Destruction a few weeks ago and I continue to mull it over, and I want to spend some time expanding on one small part of the book. In WMD Ms. O'Neil talks a fair bit about how models can lead to terrible outcomes due to a fusion of their particular blindspots and perverse incentives. One thing I would like to expand upon is the how these blindspots can develop naturally and be obscured by the naive performance metrics that one typically uses to decide how well the model is functioning.
Consider the following example: Suppose I am an education reformer and I believe that society is overmedicating our kids, slapping labels like ADHD on kids who are actually just being non-conformist children in ways that should be encouraged. I claim that I have a black box algorithm that can correctly identify ADHD in kids, using only their school records, with ~95% accuracy. I have tested this with a large sample of records obtained from the Edmonton Public School Bard and empirically 94.9% of kids were correctly sorted into the ADHD or no ADHD categories. But more importantly the false positive rate is very small, in fact 0% of kids were incorrectly attributed ADHD. So this is a wonderful new technique for identifying kids who truly need medication while sparing others from our pervasive pharmaceutical culture that "cures" kids of their creativity and individuality, right? Well here's the code:
def has_ADHD(student): return False
Nothing I said before was wrong it was merely irrelevant or misleading. ADHD is relatively uncommon with about 5% of school age kids having a diagnosis, so in a large sample of students from the general population most students don't have it. This hides the fact that my algorithm doesn't do anything. Sure it has a 0% false positive rate, but it also has a 0% true positive rate as well.
The example I gave was silly, the algorithm literally did nothing, but it is easy to imagine cases where a particular minority can be exluded for structural reasons. For example, ADHD is more common in boys so an algorithm that (all things being equal) preferentially assigned ADHD diagnoses to boys or was biased against girls having ADHD might appear to work fairly well while simultaneously systematically excluding girls with ADHD. This same algorithm may easily be getting >95% accuracy as well, for reasons like for my silly algorithm above.
This sort of problem, where an algorithm can perform well for the vast majority of cases but completely falls on its face for the small minority, is common. It is a big and persistent problem in Data Science, one that isn't insurmountable by any means but one that requires thought. Regardless, when we see that some new algorithm performs wonderfully, we should always ask when it doesn't perform well and why. Exploring the models failures is often far more instructive than basking in its glory.