This is one of those questions I could probably answer with research, but I'm lazy so I am going to do simulations. Anyways, it came up in my life to check a data-set to see if the values are normally distributed. There are a couple of ways of doing this (I lean towards doing a KS test) but one that was recommended was to do a $\chi^2$ test. Of course the $\chi^2$ test typically requires the data to be in discrete bins, and this got me thinking: surely the test itself is highly dependent upon the bin size I choose so, presumably, I could fiddle with that variable to get whatever answer I wanted. Presumably.

This is not an answer I have an immediate response to. My old statistics book does do $\chi^2$ tests on *already* binned, normally distributed data with no discussion on how you arrive at those bins (most of the chapter on $\chi^2$ testing for goodness of fit is devoted to discrete distributions). A quick google search turned up a bunch of people arguing about what is the best method but I didn't immediately find an answer to my question *per se*.

I could, however, do some simulations and see what that looks like. It won't tell me what is optimal or best in some abstruse mathematical sense but it could give me a sense for what "works" (for some definition of "works"). Sort of analogously to back when I tried out different ways of parameter estimation to see whether the differences were really spectacular in a few particular cases.

### The Plan of Attack

I am going take a bunch of different number of bins, from 4 bins to 50 going up by 2, and for each bin I will

- Randomly draw 100 data points $ Y_{i} \sim N \left( \mu = 1, \sigma^2 = 4 \right)$,
- Calculate $ \hat{\mu} = \bar{Y} $ and $ \hat{\sigma}^2 = \frac{1}{99} \sum_{i=1}^{100} \left( Y_i - \bar{Y} \right)^{2} $
- Calculate the Pearson $c$ statistics relative to a normal distribution $ N\left( \hat{\mu}, \hat{\sigma}^{2} \right)$.
- Calculate the critical $\chi^{2}_{0.95, k-3}$ where
*k*is the number of bins - Reject the hypothesis that the data is normally distributed when $c \ge \chi^{2}_{0.95, k-3}$

Then I repeat this procedure 10,000 times and add up how many times the null hypothesis got rejected. Basically I am counting how many times I commit a type I error.

### Running the Simulation and Collecting the Results

After the simulation I end up with a huge table of results I need to aggregate it in some meaningful way. I am going to make a few assumptions right off the bat.

- For bin $k$ the number of rejections $ X_{k} \sim B \left( n=10000, p \right) $
- I can estimate the probability of rejection $ \hat{p_{k}} = { x_{k} \over 10000 } $ and standard error $ SE_{p} = \sqrt{ { \hat{p} ( 1 - \hat{p} ) \over 10000 }}$
- Suppose
*n*is large enough to approximate the distribution as normal, then the 95% confidence interval is $ \hat{p} \pm 1.96 \cdot SE_{p} $

Of course one set of data points isn't super useful, so I ran the simulation again with 150 samples and 200 samples. These are plotted below.

Obviously I can't generalize a whole lot from this, since this is just one small set of simulations. From these results it appears that the actual rejection rate (which is a false rejection mind you, the data *is* normally distributed) shows a correlation up to some mid-range number of bins, then it flattens out around 10%. This is far from reassuring as it implies that, with a change in bin size, you can almost double the chance that you reject the null hypothesis (i.e. conclude your data is not-normal). Consider also that the data was deliberately sampled from a normal distribution, I am not considering at all the possibility that you could *mistake* your data for being normal simply because you binned it in a particular way.

I think this provides a good reason for using a KS-test (or something else), as it does not rely on binning to make its decision. There isn't that extra knob to twiddle until you get the results you want.

As usual the ipython notebook is on github