This isn't really engineering related, directly, but I think it is an interesting example of when data can lead you to the wrong conclusions.
Where I work we have a central document management database that manages the flow of documents from creation through approval and release. So, for example, I write a report and put it into the system and the system manages getting the appropriate review and approval for said report. The time it takes to traverse the system is one of our quality system metrics. Basically we aim to have documents processed in a reasonably short period of time. This makes sense.
Where things have gone awry is in the presentation of this data. Most of the approvals happen within a few days, however there are some (ridiculous) outliers. In any given month a handful of reports have gotten lost in the system, for whatever reason everyone has forgotten about them, and it may take up to (no joke) 200 days for someone to find them and sheepishly approve the damn thing or delete it (which is usually what was warranted). The metric for how well we're performing is simply the time difference between when a report was entered into the system and when it was approved and this is presented as a series of graphs with the x-axis being the date it went into the system and the y-axis the time it took to complete, the duration.
For the sake of this post, I'm going to generate some random data to play with.
%matplotlib inline from scipy.stats import zipf from random import randrange import numpy as np import matplotlib.pyplot as plt data =  duration_dist = zipf(1.75) # The data more or less follows a power law today = 365 # Today, in days from the start, so one year of data max_docs = 25 # Max number of documents created in a day for day in range(today): num_docs = randrange(max_docs) duration = duration_dist.rvs(num_docs) data.extend([(day,dur) for dur in duration if dur <365]) orig_data = np.array(data) filt = np.array([row < today - row for row in orig_data]) visible_data = orig_data[filt] invisible_data = orig_data[np.logical_not(filt)]
On any given day a random number of reports are created, up to a maximum of 25, the time it takes for a report to traverse the system is sampled from a discrete power law distribution (since I only care about whole numbers of days). When plotted as a simple scatter plot it looks like this:
The above data is randomly generated and I think that illustrates the problem: As of day 100 we appear to have gotten our act together and sucessively improved! Hooray! Except we know we haven't since the data was randomly generated and uniform, what we're seeing is an illusion.
We have all the data from before 100 days or so, including all the extreme poor performers, however we only have some of the data for more recent months, specifically we are missing those poor performers. These reports are still open in the system and thus don't have a defined duration yet. This biases us to thinking the poor performance only existed in the past. What happened to all the bad reports from more recent months? They are still there but we don't know how poor the performance will be yet, since they haven't been approved. We will only know how poor the performance is in the future. This is clearly shown if I incude the "future" data in the plot:
This is the same data set but with the future performance added in. Our sense of improvement was just an illusion! The blue solid line shows the cut off, any documents that take longer than this line to traverse the system is still open, so is invisible to us today.
I used randomly sampled data here, but this is exactly what we see with the real data. At any given moment we don't know if we are getting better, because we don't have an accurate accounting of our worst performance yet. By excluding these data points, for the seemingly good reason that they don't exist yet, we convince ourselves that performance is improving. The present will always look better than the past, for any arbitrary choice of present.
How could we do this better? I don't know actually. We could track our median performance, this shouldn't change too much with or without the poor performance included, but then do we really care about our median performance? Unless our performance is terrible across the board we are actually most interested those extreme outliers, so that we can work to eliminate them.
One possibility is to not care how long it takes exactly to complete a report and instead partition our data around a particular target. Suppose we want all reports to be approved within a 10 day window. We can now show all data up to 10 days ago, and look at the percent of reports that PASS versus those that FAIL. This allows us to see those outliers, because we no longer care how much they are outliers merely that they are outliers. This gives us a better indicator of our performance, but at a cost: we have to pick an (arbitrary) PASS/FAIL criteria.
Putting aside the specifics of this particular set of metrics, I think this highlights the importance of missing data. In this case the missing data has a structure that is very important: the probability of a given data point being missing depends strongly on how close it is to the present.
As usual the ipython notebook is on github